Skip to content

Commit 5b20d1e

Browse files
NSHkrNSHkr
authored andcommitted
add strat
1 parent cd6cfc4 commit 5b20d1e

File tree

7 files changed

+1978
-0
lines changed

7 files changed

+1978
-0
lines changed

docs/design/1.md

Lines changed: 146 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,146 @@
1+
Excellent question. This gets to the heart of the architectural trade-offs between the two libraries. The Elixir `json_remedy` is architecturally clean and performant, while the Python `json_repair` is pragmatically robust, hardened by numerous real-world, messy examples.
2+
3+
To make the Elixir version as robust and encompassing as the Python one without sacrificing its superior architecture, we need to evolve its deterministic, layered pipeline into a more flexible, context-aware, and probabilistic decision-making engine. This involves borrowing concepts from information theory, compiler design (specifically error recovery), and probabilistic modeling.
4+
5+
Here is a breakdown of the necessary design enhancements.
6+
7+
### Conceptual Framework: From Determinism to Probabilistic Repair
8+
9+
The core limitation of the current Elixir design is that each layer makes a **single, deterministic decision**. `ContentCleaning` removes comments and returns one string. `StructuralRepair` adds a missing brace and returns one string. This works well for simple errors but fails when a situation is ambiguous.
10+
11+
The Python version resolves ambiguity with complex, nested `if/else` heuristics that implicitly weigh different possibilities. To make the Elixir version superior, we must make this weighing process **explicit, configurable, and principled.**
12+
13+
The guiding principle should be: **What is the most probable valid JSON, given this malformed input?** This is a classic "noisy channel" problem from information theory. The repair process is an attempt to recover the original signal (valid JSON) from the noise (syntax errors).
14+
15+
### Design Enhancements
16+
17+
#### 1. Introduce a Probabilistic Repair Model (A "Cost" System)
18+
19+
Instead of each layer returning one definitive result, it should return a list of *potential repair candidates*, each with an associated "cost" or "negative log-likelihood". The cost represents how "drastic" or "unlikely" a given repair is.
20+
21+
A `repair_candidate` would look like this:
22+
23+
```elixir
24+
@type repair_candidate :: %{
25+
content: String.t(),
26+
context: JsonContext.t(),
27+
cost: non_neg_integer(),
28+
log: [repair_action()]
29+
}
30+
```
31+
32+
- **Lower cost is better.** A simple fix like changing `'` to `"` has a low cost (e.g., 1). Deleting an entire line of text has a high cost (e.g., 50).
33+
- Layers would generate multiple candidates. For example, when faced with `{"key": "value" "another_key": "value"}`, Layer 3 could generate two candidates:
34+
1. Insert a comma: `{"key": "value", "another_key": "value"}` (Cost: 5)
35+
2. Insert a colon and treat "value" as a key: `{"key": {"value": "another_key"}, "value": ...}` (Cost: 25 - this is a much more drastic change).
36+
37+
#### 2. Evolve the Pipeline into a "Beam Search" Engine
38+
39+
The simple `with` pipeline is no longer sufficient. It must be replaced by a **Repair Engine** that orchestrates the layers. This engine would implement a form of **beam search** to explore the most promising repair paths without an exponential explosion of possibilities.
40+
41+
The `beam_width` would be a configurable parameter (e.g., 3).
42+
43+
**Workflow:**
44+
45+
1. **Input:** Start with one candidate: `{content: initial_string, cost: 0, ...}`.
46+
2. **Layer 1:** `ContentCleaning.process/2` is called. It returns a list of candidates (e.g., one for the content inside ` ```json`, another if it finds a second JSON block).
47+
3. **Prune:** The Repair Engine takes all candidates, sorts them by cost, and keeps only the top `beam_width` candidates.
48+
4. **Layer 2:** For *each* of the surviving candidates, `StructuralRepair.process/2` is called. This generates a new, larger list of candidates.
49+
5. **Prune:** The engine again sorts all new candidates by their cumulative cost and keeps the top `beam_width`.
50+
6. **Repeat:** This continues through all layers.
51+
7. **Final Selection:** After the final layer, the candidate with the lowest total cost that successfully parses in the `Validation` layer is chosen as the winner.
52+
53+
This transforms the pipeline from a linear filter into a search for the lowest-cost path through the "repair state space."
54+
55+
#### 3. Create a Richer, More Granular `JsonContext`
56+
57+
The current context (`:object_key`, `:object_value`) is good but insufficient for the Python version's level of nuance. The context needs to track more state to make better cost calculations.
58+
59+
**Enhanced `JsonContext`:**
60+
61+
```elixir
62+
defstruct current: :root,
63+
stack: [],
64+
position: 0,
65+
in_string: false,
66+
# NEW FIELDS
67+
last_significant_char: nil, # The last non-whitespace char seen (e.g., "}", ",", ":")
68+
last_token_type: nil, # The last logical token seen (e.g., :string, :number, :close_brace)
69+
lookahead_buffer: "" # A small buffer of upcoming characters for lookahead without re-reading
70+
```
71+
72+
This richer context allows for more intelligent repairs:
73+
- If `last_token_type` was `:string_value` and we now see another `string_value` in an object, the cost of inserting a comma is very low.
74+
- If we see a `{` and the `last_significant_char` was `}`, the cost of inserting a comma between them is very low.
75+
76+
#### 4. Codify Heuristics into a Declarative Rule Set
77+
78+
The Python version's strength is its many hard-coded heuristics. The Elixir version should adopt these not as tangled `if/else` logic, but as a declarative, extensible rule set within `Layer3.SyntaxNormalization`.
79+
80+
```elixir
81+
# In Layer3.SyntaxNormalization
82+
@rules [
83+
# Rule for missing comma between values in an array
84+
%{
85+
name: :missing_array_comma,
86+
# The pattern to match in the context
87+
context_pattern: %{current: :array, last_token_type: {:value, _}},
88+
# The pattern to match in the upcoming text
89+
char_pattern: &is_value_start?/1,
90+
# The repair to apply
91+
repair: {:insert, ","},
92+
# The cost of this repair
93+
cost: 5
94+
},
95+
# Rule for missing colon after a key
96+
%{
97+
name: :missing_colon,
98+
context_pattern: %{current: :object_value, last_token_type: :key},
99+
char_pattern: &is_value_start?/1,
100+
repair: {:insert, ":"},
101+
cost: 8
102+
},
103+
# Rule for quoting an unquoted key
104+
%{
105+
name: :unquoted_key,
106+
context_pattern: %{current: :object_key, last_token_type: :open_brace_or_comma},
107+
char_pattern: &is_identifier_start?/1,
108+
# This is a more complex repair
109+
repair: {:quote_unquoted_key},
110+
cost: 10
111+
}
112+
]
113+
```
114+
115+
The `SyntaxNormalization` layer would iterate through these rules at each character, generating repair candidates whenever a pattern matches. This makes the logic easy to read, extend, and test.
116+
117+
### Putting It All Together: A Worked Example
118+
119+
Consider the input: `{"key1": "value1" "key2": "value2"}`.
120+
121+
1. **Engine Start:** Pipeline receives the input string.
122+
2. **Layers 1 & 2:** Pass through unchanged (cost = 0).
123+
3. **Layer 3 (SyntaxNormalization):**
124+
* It parses up to `"value1"`. The context is now `last_token_type: :string_value`.
125+
* It sees the next non-whitespace char is `"`. This is the start of another string (`"key2"`).
126+
* It consults its rules. A rule like `context_pattern: %{last_token_type: :string_value}, char_pattern: &is_quote?/1` matches.
127+
* This rule knows two likely repairs for this situation in an object:
128+
* **Candidate A:** Insert a comma. `{"key1": "value1", "key2": ...}`. This is a common typo. **Cost: 5**.
129+
* **Candidate B:** Assume the first string was a key to a nested object. `{"key1": {"value1": "key2", ...}}`. This is a major structural change. **Cost: 25**.
130+
4. **Beam Search:** The engine now has two candidates in its beam (assuming `beam_width >= 2`).
131+
* Path A: `(content_with_comma, cost: 5)`
132+
* Path B: `(content_with_nesting, cost: 25)`
133+
5. **Continuation:** The engine continues parsing *both* paths.
134+
* Path A continues smoothly. It finds ` "key2": "value2"}` and parses to the end.
135+
* Path B immediately runs into trouble trying to make sense of the remaining `: "value2"}`. Its cost will skyrocket as it tries more drastic repairs.
136+
6. **Layer 4 (Validation):**
137+
* Path A's final string is `{"key1": "value1", "key2": "value2"}`. `Jason.decode` succeeds. **Final cost: 5**.
138+
* Path B's final string is a mess. `Jason.decode` fails. **Final cost: Infinity**.
139+
7. **Result:** The engine selects Path A as the winner.
140+
141+
### Summary of Benefits from these Enhancements
142+
143+
- **Robustness:** The library no longer has to be "right" at every step. It can explore multiple plausible repairs and pick the one that leads to the most coherent final structure. This directly addresses the "hard-coded use cases" from the Python version by providing a framework to handle ambiguity.
144+
- **Extensibility:** Adding a new, weird repair rule is as simple as adding a new map to the `@rules` list, rather than weaving more logic into a complex function.
145+
- **Maintainability:** The separation of the *search mechanism* (Repair Engine) from the *repair logic* (Layer rules) keeps the code clean and testable.
146+
- **Principled Design:** The "cost" system provides a formal, tunable model for what constitutes a "good" repair, moving beyond hard-coded heuristics to a more robust, information-theoretic foundation.

docs/design/10.md

Lines changed: 192 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,192 @@
1+
Looking at both libraries, I see complementary strengths that suggest a hybrid approach rather than making one exactly like the other. Let me provide a different perspective on how to enhance the Elixir library while preserving its architectural advantages.## A Complementary Perspective: Architectural Symbiosis Over Convergence
2+
3+
After analyzing both libraries deeply, I believe the most valuable approach isn't to make the Elixir library exactly like the Python one, but rather to leverage their complementary strengths while preserving what makes each excellent in their respective domains.
4+
5+
## The Core Insight: Different Paradigms, Different Strengths
6+
7+
The **Python `json_repair`** is fundamentally a **heuristic-driven, imperative parser** that has evolved through extensive battle-testing with real-world malformed JSON. Its strength lies in its pragmatic handling of edge cases through accumulated wisdom.
8+
9+
The **Elixir `JsonRemedy`** is a **declarative, composable pipeline** that leverages Elixir's inherent strengths: binary pattern matching, immutable data structures, and functional composition. Its strength lies in architectural elegance and predictable performance characteristics.
10+
11+
## Proposed Enhancement Strategy: Selective Adoption with Architectural Preservation
12+
13+
Rather than a complete overhaul, I propose a **hybrid approach** that selectively adopts Python's robustness while preserving Elixir's architectural advantages:
14+
15+
### 1. **Empirical Knowledge Integration** (Not Probabilistic Overhead)
16+
17+
Instead of introducing a complex probabilistic cost system, extract the **empirical patterns** from the Python library's accumulated fixes and encode them as **compile-time decision trees**:
18+
19+
```elixir
20+
# Extract patterns from Python's battle-tested heuristics
21+
@repair_patterns [
22+
# Pattern: "value1" "value2" in object context
23+
%{
24+
context: :object_value,
25+
pattern: ~r/"\s*"/,
26+
repairs: [
27+
{priority: 1, action: :insert_comma, condition: &followed_by_key?/2},
28+
{priority: 2, action: :merge_strings, condition: &looks_like_continuation?/2}
29+
]
30+
},
31+
# Pattern: Missing closing quotes before colons
32+
%{
33+
context: :object_key,
34+
pattern: ~r/[^"]\s*:/,
35+
repairs: [{priority: 1, action: :add_missing_quote, position: :before_colon}]
36+
}
37+
]
38+
```
39+
40+
This captures Python's empirical knowledge without abandoning Elixir's deterministic approach.
41+
42+
### 2. **Context-Aware Character Lookahead** (Not Full Beam Search)
43+
44+
Instead of expensive beam search, enhance the existing context with **minimal lookahead** that leverages Elixir's binary pattern matching efficiency:
45+
46+
```elixir
47+
defmodule JsonRemedy.Context.EnhancedContext do
48+
defstruct current: :root,
49+
stack: [],
50+
position: 0,
51+
# Enhanced for Python-level awareness
52+
last_token: nil,
53+
lookahead_cache: %{}, # Cache 3-5 char lookaheads
54+
char_sequence: [] # Track last 3 chars for patterns
55+
56+
# Efficient binary pattern matching for common patterns
57+
def peek_pattern(context, input, patterns) do
58+
remaining = String.slice(input, context.position, 10) # Small window
59+
60+
patterns
61+
|> Enum.find(fn pattern ->
62+
binary_matches_pattern?(remaining, pattern)
63+
end)
64+
end
65+
66+
# Use Elixir's binary matching for O(1) pattern detection
67+
defp binary_matches_pattern?(<<"\"", _::binary>>, :quote_start), do: true
68+
defp binary_matches_pattern?(<<char::utf8, rest::binary>>, :identifier_colon)
69+
when char in ?a..?z or char in ?A..?Z do
70+
71+
find_colon_after_identifier(rest)
72+
end
73+
defp binary_matches_pattern?(_, _), do: false
74+
end
75+
```
76+
77+
### 3. **Fast Path Optimization with Fallback Layers**
78+
79+
Instead of making every layer probabilistic, create **fast paths** for common patterns while preserving the existing deterministic pipeline:
80+
81+
```elixir
82+
defmodule JsonRemedy.FastPath do
83+
# Handle 80% of common cases with O(1) binary patterns
84+
@common_fixes [
85+
# Pattern matches for frequent Python fixes
86+
{~r/True/, "true"},
87+
{~r/False/, "false"},
88+
{~r/'\s*([^']*)\s*'/, "\"\\1\""},
89+
{~r/,\s*[}\]]/, ""} # trailing commas
90+
]
91+
92+
def attempt_fast_repair(input) do
93+
case detect_simple_patterns(input) do
94+
{:ok, repaired} -> {:fast_path, repaired}
95+
:complex -> {:fallback_to_pipeline, input}
96+
end
97+
end
98+
99+
# Use binary pattern matching for detection
100+
defp detect_simple_patterns(input) do
101+
case input do
102+
<<"True", rest::binary>> -> {:ok, "true" <> rest}
103+
<<"False", rest::binary>> -> {:ok, "false" <> rest}
104+
<<"'", _::binary>> = quoted -> attempt_quote_conversion(quoted)
105+
_ -> :complex
106+
end
107+
end
108+
end
109+
```
110+
111+
### 4. **Incremental Enhancement Through Pattern Mining**
112+
113+
Rather than rewriting the architecture, **systematically extract patterns** from the Python library and add them as **new rules** to the existing layers:
114+
115+
```elixir
116+
# In Layer3.SyntaxNormalization - add Python-derived rules
117+
@python_derived_rules [
118+
# Extracted from Python's parse_string edge cases
119+
%{name: :doubled_quotes, pattern: "\"\"", replacement: "\""},
120+
%{name: :unmatched_delimiters, pattern: "\" \"", context: :object_value,
121+
action: :check_key_value_pattern},
122+
123+
# Extracted from Python's object parsing
124+
%{name: :missing_comma_after_value,
125+
pattern: {&value_ending?/1, &key_starting?/1},
126+
action: :insert_comma},
127+
]
128+
```
129+
130+
## Performance-First Architecture Decisions
131+
132+
Based on the search results showing Elixir's binary pattern matching performance advantages:
133+
134+
### 1. **Leverage Elixir's Binary Matching Superiority**
135+
Elixir's binary pattern matching creates efficient sub-binaries without copying, and the compiler can optimize away unnecessary allocations when patterns are well-structured. This gives Elixir a fundamental advantage over Python's character-by-character string manipulation.
136+
137+
### 2. **Preserve the Pipeline but Add Intelligence**
138+
Instead of abandoning the clean pipeline, enhance each layer with **Python-derived intelligence**:
139+
140+
```elixir
141+
defmodule JsonRemedy.Layer3.IntelligentSyntax do
142+
# Keep the existing efficient pipeline
143+
def process(input, context) do
144+
input
145+
|> apply_fast_patterns() # O(1) binary matches for common cases
146+
|> apply_context_repairs() # Python-derived contextual fixes
147+
|> fallback_to_existing() # Original character-by-character when needed
148+
end
149+
150+
# Use Elixir's strengths: binary pattern matching + guards
151+
defp apply_fast_patterns(input) do
152+
case input do
153+
# Handle 90% of cases with direct binary patterns
154+
<<prefix::binary-size(n), "True", rest::binary>> when is_valid_context(prefix) ->
155+
prefix <> "true" <> rest
156+
157+
<<prefix::binary, "'", content::binary>> ->
158+
attempt_quote_normalization(prefix, content)
159+
160+
_ -> input
161+
end
162+
end
163+
end
164+
```
165+
166+
### 3. **Selective Complexity Introduction**
167+
Only add complexity where the Python library demonstrates clear empirical advantages:
168+
169+
- **String parsing edge cases**: Python handles many malformed string scenarios
170+
- **Object key detection**: Python's heuristics for unquoted keys are sophisticated
171+
- **Context-sensitive comma handling**: Python has nuanced comma insertion/removal logic
172+
173+
## The Result: Best of Both Worlds
174+
175+
This approach would yield:
176+
177+
1. **Maintainability**: Keep Elixir's clean, testable architecture
178+
2. **Performance**: Leverage binary pattern matching for O(1) operations on common cases
179+
3. **Robustness**: Selectively adopt Python's battle-tested edge case handling
180+
4. **Extensibility**: Add new patterns as they're discovered without architectural changes
181+
182+
## Why This Beats Pure Convergence
183+
184+
Benchmark tests across JSON libraries show that different libraries excel in different scenarios - some are better for large files, others for small requests. Similarly, the Python and Elixir libraries have different optimal use cases.
185+
186+
Rather than making them identical, this approach:
187+
- **Preserves** Elixir's architectural advantages for maintainability and performance
188+
- **Incorporates** Python's empirical knowledge without its architectural complexity
189+
- **Leverages** each language's inherent strengths (Python's string manipulation vs Elixir's binary matching)
190+
- **Avoids** the complexity overhead of probabilistic systems for deterministic problems
191+
192+
The goal isn't to replicate Python's approach in Elixir, but to create a library that's **better than both** by combining their strengths while avoiding their weaknesses.

0 commit comments

Comments
 (0)