Skip to content

Commit 0861e6b

Browse files
author
nshkrdotcom
committed
prep for v0.1.5 - add example, update changelog and readme
1 parent cadd8dd commit 0861e6b

File tree

3 files changed

+368
-45
lines changed

3 files changed

+368
-45
lines changed

CHANGELOG.md

Lines changed: 53 additions & 16 deletions
Original file line numberDiff line numberDiff line change
@@ -10,8 +10,37 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
1010
## [0.1.5] - 2025-10-24
1111

1212
### Added
13-
- **Professional hex-shaped logo**: New `assets/json_remedy_logo.svg` with modern design featuring medical cross and Elixir drop, representing JSON healing in the Elixir ecosystem
14-
- **HTML content handling in Layer 3**: New `HtmlHandlers` module for intelligent detection and quoting of unquoted HTML values
13+
14+
#### **🔄 Pre-processing Pipeline** - Major Architectural Enhancement
15+
A new pre-processing stage now runs **before** the main layer pipeline to handle complex patterns that would otherwise be broken by subsequent layers. This is inspired by the [json_repair](https://github.com/mangiucugna/json_repair) Python library.
16+
17+
**New Pre-processing Modules**:
18+
- **`MultipleJsonDetector`** utility: Detects and aggregates consecutive JSON values
19+
- Pattern: `[]{}``[[], {}]`
20+
- Prevents Layer 1 from treating subsequent JSON as "wrapper text"
21+
- Runs first in the pipeline before any layer processing
22+
- **Test status**: ✅ 10/10 tests passing
23+
24+
- **`ObjectMerger`** (Layer 3): Merges key-value pairs after premature closing braces
25+
- Pattern: `{"a":"b"},"c":"d"}``{"a":"b","c":"d"}`
26+
- Handles malformed objects with extra closing braces
27+
- Merges additional pairs erroneously placed outside objects
28+
- **Test status**: ✅ 10/10 tests passing
29+
30+
**New Layer 3 Filters**:
31+
- **`EllipsisFilter`**: Removes unquoted ellipsis (`...`) placeholders
32+
- Pattern: `[1,2,3,...]``[1,2,3]`
33+
- Common in LLM-generated content to indicate truncation
34+
- Preserves quoted `"..."` as valid string values
35+
- **Test status**: ✅ 10/10 tests passing
36+
37+
- **`KeywordFilter`**: Removes unquoted comment-like keywords
38+
- Pattern: `{"a":1, COMMENT "b":2}``{"a":1,"b":2}`
39+
- Filters: `COMMENT`, `SHOULD_NOT_EXIST`, `DEBUG_INFO`, `PLACEHOLDER`, `TODO`, `FIXME`, etc.
40+
- **Test status**: ✅ 10/10 tests passing
41+
42+
#### **🌐 HTML Content Handling in Layer 3**
43+
- **`HtmlHandlers`** module: Intelligent detection and quoting of unquoted HTML values
1544
- **DOCTYPE declarations**: `<!DOCTYPE HTML ...>` properly detected and quoted
1645
- **HTML comments**: `<!-- ... -->` handled correctly without breaking tag depth tracking
1746
- **Void elements**: Self-closing tags (`<meta>`, `<br>`, `<hr>`, `<img>`, etc.) tracked without expecting closing tags
@@ -20,37 +49,45 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
2049
- **Smart extraction**: Tracks HTML tag depth to determine end of HTML block
2150
- **Proper escaping**: Quotes, newlines, tabs, backslashes all escaped for valid JSON
2251
- **Array support**: HTML values in arrays work correctly
23-
- **Comprehensive test suite**: 15 new tests covering HTML content scenarios
24-
- API error responses with full HTML pages
25-
- Simple HTML fragments
26-
- HTML with nested JSON-like braces
27-
- Multiple HTML values in objects and arrays
28-
- Complex real-world edge cases
52+
- **Test status**: ✅ 15/15 tests passing
53+
54+
#### **📚 Documentation & Examples**
55+
- **Professional hex-shaped logo**: New `assets/json_remedy_logo.svg` with modern design featuring medical cross and Elixir drop
2956
- **Example documentation**: `examples/html_content_examples.exs` with 5 challenging real-world scenarios
30-
- **Missing patterns analysis**: Test files documenting 4 patterns from json_repair Python library not yet implemented:
31-
- `test_missing_pattern_1_multiple_json.exs` - Multiple JSON value aggregation (0/10 pass)
32-
- `test_missing_pattern_2_object_merging.exs` - Object boundary merging (0/10 pass)
33-
- `test_missing_pattern_3_ellipsis.exs` - Ellipsis filtering (1/10 pass)
34-
- `test_missing_pattern_4_comment_keywords.exs` - Comment keyword filtering (0/10 pass)
57+
- **Comprehensive test suite**: 65 new tests total (40 pattern tests + 15 HTML tests + 10 pre-processing tests)
3558

3659
### Enhanced
60+
- **Pre-processing architecture**: New stage before layer pipeline prevents pattern interference
61+
- **Layer 1 (ContentCleaning)**: Smarter trailing wrapper text removal - checks if trailing content is valid JSON before removing
62+
- **Pipeline orchestration**: Integrated pre-processing with main repair pipeline for seamless operation
3763
- **Documentation**: Logo integrated in README.md and HexDocs
3864
- **Package assets**: Logo included in hex package for professional documentation display
39-
- **README.md**: Added HTML handling documentation and known missing patterns section
40-
- **Test coverage**: All 82 critical tests passing, 15 new HTML tests passing (100% success rate)
65+
- **README.md**: Added pre-processing pipeline documentation, HTML handling, and updated pattern status
66+
- **Test coverage**: All critical tests passing (82 + 65 new tests = 147 tests, 100% success rate)
4167

4268
### Fixed
69+
- **Multiple JSON values**: Consecutive JSON values like `[]{}` now properly aggregated
70+
- **Object boundary issues**: Extra key-value pairs after closing braces now merged correctly
71+
- **Ellipsis placeholders**: Unquoted `...` in arrays removed while preserving quoted ellipsis
72+
- **Debug keywords**: Comment-like keywords (COMMENT, DEBUG_INFO, etc.) filtered from output
4373
- **HTML in JSON values**: Unquoted HTML after colons (e.g., `"body":<!DOCTYPE HTML>`) now properly quoted and escaped
4474
- **API error pages**: Full HTML error responses from APIs (503, 404, etc.) now handled correctly
4575
- **Complex HTML**: Nested tags, attributes with quotes, special entities all work properly
4676

4777
### Technical Details
48-
- **Smart depth tracking**: Monitors both HTML tag depth and JSON-like structure depth
78+
- **Pre-processing stage**: Runs before Layer 1 to handle patterns that would otherwise break
79+
- **Smart JSON detection**: Parses multiple consecutive JSON values with proper position tracking
80+
- **Object boundary analysis**: Tracks brace balance to identify and merge split objects
81+
- **Context-aware filtering**: Preserves quoted ellipsis and keywords while removing unquoted ones
82+
- **HTML depth tracking**: Monitors both HTML tag depth and JSON-like structure depth
4983
- **Context awareness**: Only stops at JSON delimiters when all HTML tags are closed
5084
- **Void element list**: 15 HTML5 void elements recognized (`area`, `base`, `br`, `col`, `embed`, `hr`, `img`, `input`, `link`, `meta`, `param`, `source`, `track`, `wbr`)
5185
- **Binary optimization**: HTML detection integrated into Layer 3's binary processing pipeline
5286
- **Zero regressions**: All existing tests remain passing
5387

88+
### Cleanup
89+
- Removed temporary test scripts: `test_boolean.exs`, `test_weiss.exs`
90+
5491
## [0.1.4] - 2025-10-07
5592

5693
### Added

README.md

Lines changed: 69 additions & 29 deletions
Original file line numberDiff line numberDiff line change
@@ -13,7 +13,7 @@
1313

1414
A comprehensive, production-ready JSON repair library for Elixir that intelligently fixes malformed JSON strings from any source—LLMs, legacy systems, data pipelines, streaming APIs, and human input.
1515

16-
**JsonRemedy** uses a sophisticated 5-layer repair pipeline where each layer employs the most appropriate technique: content cleaning, state machines for structural repairs, character-by-character parsing for syntax normalization, and battle-tested parsers for validation. The result is a robust system that handles virtually any JSON malformation while preserving valid content.
16+
**JsonRemedy** uses a sophisticated pre-processing stage followed by a 5-layer repair pipeline where each layer employs the most appropriate technique: pattern detection, content cleaning, state machines for structural repairs, character-by-character parsing for syntax normalization, and battle-tested parsers for validation. The result is a robust system that handles virtually any JSON malformation while preserving valid content.
1717

1818
## The Problem
1919

@@ -54,6 +54,16 @@ Standard JSON parsers fail completely on these inputs. JsonRemedy fixes them int
5454

5555
## Comprehensive Repair Capabilities
5656

57+
### 🔄 **Pre-processing Pipeline** *(v0.1.5+)*
58+
Runs **before** the main layer pipeline to handle complex patterns that would otherwise be broken by subsequent layers:
59+
60+
- **Multiple JSON detection**: `[]{}``[[], {}]` - Aggregates consecutive JSON values
61+
- **Object boundary merging**: `{"a":"b"},"c":"d"}``{"a":"b","c":"d"}` - Merges split objects
62+
- **Ellipsis filtering**: `[1,2,3,...]``[1,2,3]` - Removes unquoted ellipsis placeholders (LLM truncation markers)
63+
- **Keyword filtering**: `{"a":1, COMMENT "b":2}``{"a":1,"b":2}` - Removes debug keywords (COMMENT, DEBUG_INFO, PLACEHOLDER, TODO, etc.)
64+
65+
*Inspired by patterns from the [json_repair](https://github.com/mangiucugna/json_repair) Python library*
66+
5767
### 🧹 **Content Cleaning (Layer 1)**
5868
- **Code fences**: ````json ... ```` → clean JSON
5969
- **Comments**: `// line comments` and `/* block comments */` → removed
@@ -253,7 +263,7 @@ Learn the fundamentals with step-by-step examples:
253263
- Repairing structural issues
254264
- Processing LLM outputs
255265

256-
### 🔧 **Hardcoded Patterns Examples***NEW*
266+
### 🔧 **Hardcoded Patterns Examples***NEW in v0.1.4*
257267
```bash
258268
mix run examples/hardcoded_patterns_examples.exs
259269
```
@@ -265,6 +275,19 @@ Demonstrates advanced cleanup patterns ported from Python's `json_repair` librar
265275
- **International text**: UTF-8 support with smart quotes
266276
- **Combined patterns**: Real-world LLM output examples
267277

278+
### 🌐 **HTML Content Examples***NEW in v0.1.5*
279+
```bash
280+
mix run examples/html_content_examples.exs
281+
```
282+
Demonstrates handling of unquoted HTML content in JSON values (common when APIs return error pages):
283+
- **API 503 Service Unavailable**: Full HTML error page in JSON response
284+
- **API 404 Not Found**: HTML 404 page with comments and metadata
285+
- **Simple HTML fragments**: Bio fields and content with HTML tags
286+
- **Multiple HTML values**: Arrays of templates with HTML content
287+
- **Complex nested HTML**: HTML with JSON-like attributes and embedded scripts
288+
289+
This example showcases the HTML detection and quoting capabilities added in v0.1.5, which handle real-world scenarios where API endpoints return HTML error pages instead of JSON.
290+
268291
### 🌍 **Real-World Scenarios**
269292
```bash
270293
mix run examples/real_world_scenarios.exs
@@ -457,50 +480,62 @@ The current implementation handles **~95% of real-world malformed JSON** through
457480
- ⏳ Stream-safe parsing for incomplete JSON
458481
- ⏳ Literal disambiguation algorithms
459482

460-
### 📋 **Known Missing Patterns**
483+
### **Previously Missing Patterns - Now Implemented!** *(v0.1.5)*
461484

462-
Based on comprehensive analysis of the [json_repair](https://github.com/mangiucugna/json_repair) Python library, the following patterns are **documented but not yet implemented**. Test cases exist in the repository to track these:
485+
Based on comprehensive analysis of the [json_repair](https://github.com/mangiucugna/json_repair) Python library, the following patterns were initially documented as missing but are **now fully implemented** in v0.1.5:
463486

464-
**Critical Missing Patterns** *(test files provided)*:
465-
1. **Multiple JSON Values Aggregation** - `test_missing_pattern_1_multiple_json.exs`
487+
**Implemented Advanced Patterns** *(all tests passing)*:
488+
1. **Multiple JSON Values Aggregation** - `test/missing_patterns/pattern1_multiple_json_test.exs`
466489
- Pattern: `[]{}``[[],{}]`
467-
- Status: 0/10 tests pass
468-
- Will wrap multiple complete JSON values into an array
490+
- **Status: ✅ 10/10 tests pass**
491+
- Implementation: `MultipleJsonDetector` utility in pre-processing pipeline
492+
- Wraps multiple complete JSON values into an array
469493

470-
2. **Object Boundary Merging** - `test_missing_pattern_2_object_merging.exs`
494+
2. **Object Boundary Merging** - `test/missing_patterns/pattern2_object_merging_test.exs`
471495
- Pattern: `{"a":"b"},"c":"d"}``{"a":"b","c":"d"}`
472-
- Status: 0/10 tests pass
473-
- Will merge additional key-value pairs after object close
496+
- **Status: ✅ 10/10 tests pass**
497+
- Implementation: `ObjectMerger` module in Layer 3
498+
- Merges additional key-value pairs after premature object close
474499

475-
3. **Ellipsis Filtering** - `test_missing_pattern_3_ellipsis.exs`
500+
3. **Ellipsis Filtering** - `test/missing_patterns/pattern3_ellipsis_test.exs`
476501
- Pattern: `[1,2,3,...]``[1,2,3]`
477-
- Status: 1/10 tests pass (quoted ellipsis preserved correctly)
478-
- Will filter unquoted `...` placeholders from arrays
502+
- **Status: ✅ 10/10 tests pass**
503+
- Implementation: `EllipsisFilter` module in Layer 3
504+
- Filters unquoted `...` placeholders from arrays (common in LLM output)
479505

480-
4. **Comment Keywords Filtering** - `test_missing_pattern_4_comment_keywords.exs`
506+
4. **Comment Keywords Filtering** - `test/missing_patterns/pattern4_comment_keywords_test.exs`
481507
- Pattern: `{"a":1, COMMENT "b":2}``{"a":1,"b":2}`
482-
- Status: 0/10 tests pass
483-
- Will filter unquoted keywords like `COMMENT`, `SHOULD_NOT_EXIST`
508+
- **Status: ✅ 10/10 tests pass**
509+
- Implementation: `KeywordFilter` module in Layer 3
510+
- Filters unquoted keywords: `COMMENT`, `SHOULD_NOT_EXIST`, `DEBUG_INFO`, `PLACEHOLDER`, `TODO`, `FIXME`, etc.
484511

485-
These patterns are planned for future releases but do not block production use for most real-world scenarios. The current implementation handles the vast majority of malformed JSON encountered in practice.
512+
These advanced patterns handle edge cases commonly found in LLM outputs, debug logs, and malformed API responses. All 40 pattern tests pass with 100% success rate.
486513

487-
## The 5-Layer Architecture
514+
## The Pre-processing + 5-Layer Architecture
488515

489-
JsonRemedy's strength comes from its pragmatic, layered approach where each layer uses the optimal technique:
516+
JsonRemedy's strength comes from its pragmatic, layered approach where each stage uses the optimal technique:
490517

491518
```elixir
492519
defmodule JsonRemedy.LayeredRepair do
493520
def repair(input) do
494521
input
495-
|> Layer1.content_cleaning() # Cleaning: Remove wrappers, comments, normalize encoding
496-
|> Layer2.structural_repair() # State machine: Fix delimiters, nesting, structure
497-
|> Layer3.syntax_normalization() # Char parsing: Fix quotes, booleans, commas
498-
|> Layer4.validation_attempt() # Jason.decode: Fast path for clean JSON
499-
|> Layer5.tolerant_parsing() # Custom parser: Handle edge cases gracefully (FUTURE)
522+
|> PreProcessing.detect_and_fix() # Pre-process: Multiple JSON, object merging, filtering
523+
|> Layer1.content_cleaning() # Cleaning: Remove wrappers, comments, normalize encoding
524+
|> Layer2.structural_repair() # State machine: Fix delimiters, nesting, structure
525+
|> Layer3.syntax_normalization() # Char parsing: Fix quotes, booleans, commas
526+
|> Layer4.validation_attempt() # Jason.decode: Fast path for clean JSON
527+
|> Layer5.tolerant_parsing() # Custom parser: Handle edge cases gracefully (FUTURE)
500528
end
501529
end
502530
```
503531

532+
### 🔄 **Pre-processing Stage** *(v0.1.5)*
533+
**Technique**: Pattern detection and early transformation
534+
- Detects and aggregates multiple consecutive JSON values
535+
- Merges split objects with boundary issues
536+
- Filters ellipsis and debug keywords
537+
- Runs before Layer 1 to prevent pattern interference
538+
504539
### 🧹 **Layer 1: Content Cleaning**
505540
**Technique**: String operations
506541
- Removes code fences, comments, wrapper text
@@ -972,20 +1007,25 @@ mix run bench/memory_profile.exs
9721007

9731008
```
9741009
lib/
975-
├── json_remedy.ex # Main API
1010+
├── json_remedy.ex # Main API with pre-processing
9761011
├── json_remedy/
9771012
│ ├── layer_behaviour.ex # Common interface for all layers
1013+
│ ├── utils/
1014+
│ │ └── multiple_json_detector.ex # ✅ Pre-processing: Multiple JSON aggregation
9781015
│ ├── layer1/
9791016
│ │ └── content_cleaning.ex # ✅ Code fences, comments, wrappers
9801017
│ ├── layer2/
9811018
│ │ └── structural_repair.ex # ✅ Delimiters, nesting, state machine
9821019
│ ├── layer3/
983-
│ │ └── syntax_normalization.ex # ✅ Quotes, booleans, char-by-char parsing
1020+
│ │ ├── syntax_normalization.ex # ✅ Quotes, booleans, char-by-char parsing
1021+
│ │ ├── object_merger.ex # ✅ Pre-processing: Object boundary merging
1022+
│ │ ├── ellipsis_filter.ex # ✅ Filter unquoted ellipsis
1023+
│ │ └── keyword_filter.ex # ✅ Filter debug keywords
9841024
│ ├── layer4/
985-
│ │ └── validation.ex # Jason.decode optimization
1025+
│ │ └── validation.ex # Jason.decode optimization
9861026
│ ├── layer5/ # ⏳ PLANNED
9871027
│ │ └── tolerant_parsing.ex # ⏳ Custom parser with error recovery
988-
│ ├── pipeline.ex # Layer orchestration
1028+
│ ├── pipeline.ex # Layer orchestration with pre-processing
9891029
│ ├── performance.ex # Monitoring and health checks
9901030
│ └── config.ex # Configuration management
9911031
```

0 commit comments

Comments
 (0)