Siegelmann & Sontag: Neural Nets as Turing Machines

Siegelmann and Sontag proved in 1992 that a fixed recurrent neural network can simulate any Turing machine.

This document explains their approach and how to use the implementation in this repo.

The Core Idea: Stacks Instead of a Tape

Rather than operating on a tape directly, Siegelmann & Sontag reformulate the problem using $p$-stack machines. A $p$-stack machine is equivalent to a Turing machine but stores memory in $p$ stacks instead of a tape. Each stack supports three operations: push, pop, and peek (read the top).

For the balanced parentheses problem, two stacks are enough:

The transition function for a $p$-stack machine takes the current state and the top of each stack as inputs, and outputs the next state and a stack operation for each stack:

# (state, top_of_stack_1, top_of_stack_2) → (next_state, op_stack_1, op_stack_2)
# ops: 'noop', 'push (', 'push )', 'pop'
# None = stack is empty; '*' = wildcard (matches any value)

balanced_parentheses_delta_stack = {
    ('*',  '(',  '(') : ('A',  'pop',    'push ('),
    ('*',  '(', None) : ('A',  'pop',    'push ('),
    ('*',  ')',  '(') : ('A',  'pop',       'pop'),
    ('I', None, None) : ('T', 'noop',      'noop'),
    ('A', None, None) : ('T', 'noop',      'noop'),
    # not balanced
    ('*',  ')', None) : ('F', 'noop',      'noop'),
    ('*', None,  '(') : ('F', 'noop',      'noop'),
    # terminal states loop
    ('F',  '*',  '*') : ('F', 'noop',      'noop'),
    ('T',  '*',  '*') : ('T', 'noop',      'noop'),
    ('*',  '*',  ')') : ('F', 'noop',      'noop'),
}
terminal_states = ['T', 'F']

Encoding a Stack as a Number

The key mathematical trick is encoding an entire stack as a single rational number using a Cantor-set encoding. For a binary stack (symbols 0 and 1), each stack value is encoded as:

$$\mathcal{E}(a_1 a_2 \cdots a_k) = \sum_{i=1}^{k} \frac{b - 1 + 4\rho(a_i - 1)}{b^i}$$

where $b$ is the base and $\rho$ is a scaling factor. For example with $b=4$, $\rho=1/2$:

Stack contents	Encoded value
(empty)	0
`[0]`	1/4
`[1]`	3/4
`[0, 0, 0]`	21/64

The encoding has a crucial property: push, pop, and peek are all linear functions of the encoded value. This means the network can manipulate the stack using only linear layers and a saturated ReLU activation $\sigma(x) = \mathrm{clamp}(x, 0, 1)$.

Two Network Architectures

The paper describes two constructions:

version=4. Uses 4 layers per step. Each layer is a stage of the computation.

version=1. Uses a single recurrent layer, operating in "real time" (one network step per Turing machine step). Uses base $b = 10\rho^2$ (= 40 for $\rho=2$) and float64 arithmetic. Reliable for strings up to ~8 characters due to error amplification in the pop operation ($b\times$ per pop).

Usage

from turing.ss.simulator import (
    Description, Simulator,
    balanced_parentheses_delta_stack,
    balanced_parentheses_terminal_states,
)

description = Description(
    balanced_parentheses_delta_stack,
    balanced_parentheses_terminal_states,
)

# 4-layer
sim4 = Simulator(description, version=4)
sim4.simulate("(()())")

# 1-layer - requires float64, reliable up to ~8 chars
sim1 = Simulator(description, version=1)
sim1.simulate("(())", T=12)

Output: each step prints the current machine state.

# version=4 output (state vector, one-hot):
   I  ['', '(()())']
   A  ['(', '()()']
   ...
   T  ['', '']

# version=1 output (state name):
   I
   A
   ...
   T

Defining your own machine:

The Description class accepts any p-stack transition function. Constraints:

Alphabet must have at most 2 symbols (mapped internally to 0/1)
'*' in a key is a wildcard matching any value
None means the stack is empty

my_delta = {
    # (state, top_stack1, top_stack2) : (next_state, op1, op2)
    ("start", None, None): ("done", "noop", "noop"),
}
description = Description(my_delta, terminal_states=["done"])
tx = Simulator(description, version=4)

Architecture

The network state at each step is a vector containing:

State part - one-hot encoding of the current machine state
Stack part - each stack encoded as a single scalar using the Cantor-set encoding

At each step, a configuration detector reads the current state and the top of each stack, identifies the matching transition rule, and produces the next state and stack operations. The weights are set analytically via least-squares from the transition function.

The saturated_relu $\sigma(x) = \mathrm{clamp}(x,0,1)$ serves as the nonlinearity throughout, since its behavior on rational inputs is exact.

The 4-Layer Architecture (version=4)

The 4-layer RNN (SiegelmannSontag4) processes one step of the stack machine through four sequential layers, each followed by $\sigma$ (saturated ReLU). The network input $x$ has dimension $s{-}1+p$, where $s$ is the number of states and $p$ is the number of stacks.

Why $s{-}1$? One state dimension is redundant: if no other state is active, the machine must be in state 0. The first layer reconstructs the full $s$-dimensional state from $s{-}1$ inputs.

F4: Configuration Reading

Three parallel linear layers extract information from the raw input:

Layer	Formula	Purpose
`linear_state0`	$w_0 = -1,; b_0 = 1;; w_{j} = e_j$	Recover full $s$-dim state: state 0 = $1 - \sum_{j>0} x_j$
`linear_top`	$4x - 2$	Extract top-of-stack symbol: $\sigma(4x-2)$ maps Cantor encoding to 0 or 1
`linear_nonempty`	$4x$	Detect non-empty stack: $\sigma(4x)$ is 0 iff stack is empty

The outputs are concatenated with the raw stack values and passed through $\sigma$:

$$o = \sigma\bigl(\mathtt{linear_state0}(x_{\text{state}}),; \mathtt{linear_top}(x_{\text{stack}}),; \mathtt{linear_nonempty}(x_{\text{stack}}),; x_{\text{stack}}\bigr)$$

After $\sigma$, the first $s+2p$ elements are clean binary values: one-hot state, top symbols, and non-empty flags.

F3: Configuration Detector

The configuration detector (ConfigurationDetector4) is a single linear layer followed by $\sigma$, with $s \cdot 3^p$ neurons. Each neuron fires for exactly one combination of (state, top/nonempty pattern).

Why $3^p$? Each stack has three valid configurations: empty ($\text{top}=0, \text{nonempty}=0$), top-is-0 ($\text{top}=0, \text{nonempty}=1$), top-is-1 ($\text{top}=1, \text{nonempty}=1$). The fourth case ($\text{top}=1, \text{nonempty}=0$) is impossible and skipped.

Each neuron's weight row is $[e_z \mid v_i]$ — a one-hot state selector concatenated with a binary pattern. The bias is $-\sum v_i$, creating a hard-AND gate: the neuron outputs 1 only when all inputs match. The weights are universal (independent of the transition function).

F2: Transition

Two learned linear layers read the configuration detector output:

$\beta$: maps the $s \cdot 3^p$ indicator vector to the next state ($s$ dims)
$\gamma$: maps to stack action indicators ($4p$ dims, with bias $-1$)

Each stack has 4 action slots: noop, push 0, push 1, pop. The indicator selects which action to apply. Meanwhile, linear_update encodes the four stack operations as linear functions of the stack value:

Action	Weight	Bias	Effect on Cantor value $x$
noop	$1$	$0$	$x$ (identity)
push 0	$1/4$	$1/4$	$x/4 + 1/4$
push 1	$1/4$	$3/4$	$x/4 + 3/4$
pop	$4$	$-2\cdot\text{top}-1$	$4x - 2\cdot\text{top} - 1$

The transition output is $\sigma\bigl(\beta(u) + \gamma(u) - 1 + \mathtt{linear_update}(o)\bigr)$. Since exactly one action indicator is 1 and the rest are 0, only the correct operation survives after the $-1$ bias and $\sigma$ clamping.

F1: Output

The final layer reassembles the output vector:

Drops state 0 (back to $s{-}1$ dims, since state 0 is implicit)
Sums the 4 sub-stack slots back to 1 scalar per stack (only the active operation produced a nonzero value)

How fit() Works

The configuration detector weights (F3) are fixed and universal. Only $\beta$ and $\gamma$ need to be set per-machine. fit() generates training pairs from the transition function, runs them through the configuration detector to get $H$, then solves $H \cdot C^T = Y$ via torch.linalg.solve. The solution $C$ is split into $\beta$ (state rows) and $\gamma$ (action rows).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Siegelmann & Sontag: Neural Nets as Turing Machines

The Core Idea: Stacks Instead of a Tape

Encoding a Stack as a Number

Two Network Architectures

Usage

Architecture

The 4-Layer Architecture (version=4)

F4: Configuration Reading

F3: Configuration Detector

F2: Transition

F1: Output

How fit() Works

Further Reading

FilesExpand file tree

STACK.md

Latest commit

History

STACK.md

File metadata and controls

Siegelmann & Sontag: Neural Nets as Turing Machines

The Core Idea: Stacks Instead of a Tape

Encoding a Stack as a Number

Two Network Architectures

Usage

Architecture

The 4-Layer Architecture (version=4)

F4: Configuration Reading

F3: Configuration Detector

F2: Transition

F1: Output

How fit() Works

Further Reading