Siegelmann and Sontag proved in 1992 that a fixed recurrent neural network can simulate any Turing machine.
This document explains their approach and how to use the implementation in this repo.
Rather than operating on a tape directly, Siegelmann & Sontag reformulate the problem using
For the balanced parentheses problem, two stacks are enough:
The transition function for a
# (state, top_of_stack_1, top_of_stack_2) → (next_state, op_stack_1, op_stack_2)
# ops: 'noop', 'push (', 'push )', 'pop'
# None = stack is empty; '*' = wildcard (matches any value)
balanced_parentheses_delta_stack = {
('*', '(', '(') : ('A', 'pop', 'push ('),
('*', '(', None) : ('A', 'pop', 'push ('),
('*', ')', '(') : ('A', 'pop', 'pop'),
('I', None, None) : ('T', 'noop', 'noop'),
('A', None, None) : ('T', 'noop', 'noop'),
# not balanced
('*', ')', None) : ('F', 'noop', 'noop'),
('*', None, '(') : ('F', 'noop', 'noop'),
# terminal states loop
('F', '*', '*') : ('F', 'noop', 'noop'),
('T', '*', '*') : ('T', 'noop', 'noop'),
('*', '*', ')') : ('F', 'noop', 'noop'),
}
terminal_states = ['T', 'F']The key mathematical trick is encoding an entire stack as a single rational number using a Cantor-set encoding. For a binary stack (symbols 0 and 1), each stack value is encoded as:
where
| Stack contents | Encoded value |
|---|---|
| (empty) | 0 |
[0] |
1/4 |
[1] |
3/4 |
[0, 0, 0] |
21/64 |
The encoding has a crucial property: push, pop, and peek are all linear functions of the encoded value. This means the network can manipulate the stack using only linear layers and a saturated ReLU activation
The paper describes two constructions:
version=4. Uses 4 layers per step. Each layer is a stage of the computation.
version=1. Uses a single recurrent layer, operating in "real time" (one network step per Turing machine step). Uses base
from turing.ss.simulator import (
Description, Simulator,
balanced_parentheses_delta_stack,
balanced_parentheses_terminal_states,
)
description = Description(
balanced_parentheses_delta_stack,
balanced_parentheses_terminal_states,
)
# 4-layer
sim4 = Simulator(description, version=4)
sim4.simulate("(()())")
# 1-layer - requires float64, reliable up to ~8 chars
sim1 = Simulator(description, version=1)
sim1.simulate("(())", T=12)Output: each step prints the current machine state.
# version=4 output (state vector, one-hot):
I ['', '(()())']
A ['(', '()()']
...
T ['', '']
# version=1 output (state name):
I
A
...
T
Defining your own machine:
The Description class accepts any p-stack transition function. Constraints:
- Alphabet must have at most 2 symbols (mapped internally to 0/1)
'*'in a key is a wildcard matching any valueNonemeans the stack is empty
my_delta = {
# (state, top_stack1, top_stack2) : (next_state, op1, op2)
("start", None, None): ("done", "noop", "noop"),
}
description = Description(my_delta, terminal_states=["done"])
tx = Simulator(description, version=4)The network state at each step is a vector containing:
- State part - one-hot encoding of the current machine state
- Stack part - each stack encoded as a single scalar using the Cantor-set encoding
At each step, a configuration detector reads the current state and the top of each stack, identifies the matching transition rule, and produces the next state and stack operations. The weights are set analytically via least-squares from the transition function.
The saturated_relu
The 4-layer RNN (SiegelmannSontag4) processes one step of the stack machine through four sequential layers, each followed by
Why
Three parallel linear layers extract information from the raw input:
| Layer | Formula | Purpose |
|---|---|---|
linear_state0 |
Recover full |
|
linear_top |
Extract top-of-stack symbol: |
|
linear_nonempty |
Detect non-empty stack: |
The outputs are concatenated with the raw stack values and passed through
After
The configuration detector (ConfigurationDetector4) is a single linear layer followed by
Why
Each neuron's weight row is
Two learned linear layers read the configuration detector output:
-
$\beta$ : maps the$s \cdot 3^p$ indicator vector to the next state ($s$ dims) -
$\gamma$ : maps to stack action indicators ($4p$ dims, with bias$-1$ )
Each stack has 4 action slots: noop, push 0, push 1, pop. The indicator selects which action to apply. Meanwhile, linear_update encodes the four stack operations as linear functions of the stack value:
| Action | Weight | Bias | Effect on Cantor value |
|---|---|---|---|
| noop |
|
||
| push 0 | |||
| push 1 | |||
| pop |
The transition output is
The final layer reassembles the output vector:
- Drops state 0 (back to
$s{-}1$ dims, since state 0 is implicit) - Sums the 4 sub-stack slots back to 1 scalar per stack (only the active operation produced a nonzero value)
The configuration detector weights (F3) are fixed and universal. Only fit() generates training pairs from the transition function, runs them through the configuration detector to get torch.linalg.solve. The solution
- Siegelmann & Sontag (1995)
src/turing/ss/networks.py-SiegelmannSontag4,ConfigurationDetector4,SiegelmannSontag1,ConfigurationDetector1


