feat: support Claude Code transcripts#168
Conversation
There was a problem hiding this comment.
Code Review
This pull request implements a parser for Claude Code JSONL transcripts, adding _content_to_text and _parse_claude_code_transcript functions to server.py and src/api/routes/memory.py, and includes a new test file. Feedback indicates that the parsing logic is duplicated and should be moved to a shared module to reduce maintenance overhead. Additionally, a performance optimization was suggested to include a heuristic check for JSON content before attempting to parse the transcript lines.
| def _content_to_text(content: Any) -> str: | ||
| """Extract readable text from Claude Code message content blocks.""" | ||
| if isinstance(content, str): | ||
| return content.strip() | ||
| if isinstance(content, list): | ||
| chunks: list[str] = [] | ||
| for item in content: | ||
| if isinstance(item, str): | ||
| chunks.append(item) | ||
| elif isinstance(item, dict) and item.get("type") == "text": | ||
| chunks.append(str(item.get("text", ""))) | ||
| return "\n".join(chunk.strip() for chunk in chunks if chunk.strip()).strip() | ||
| return "" |
There was a problem hiding this comment.
The logic for _content_to_text and _parse_claude_code_transcript is duplicated between server.py and src/api/routes/memory.py. This increases maintenance overhead and the risk of inconsistencies as the parsing logic evolves. Consider moving these utilities to a shared module (e.g., src/utils/transcripts.py) that both files can import from.
| current_user_query: str | None = None | ||
| assistant_chunks: list[str] = [] | ||
|
|
||
| for raw_line in text.splitlines(): |
There was a problem hiding this comment.
The current implementation of _parse_claude_code_transcript iterates through every line of the input text and attempts to parse it as JSON. This can be inefficient for large non-JSON transcripts (e.g., standard markdown files that don't match Cursor or Antigravity formats). Since Claude Code transcripts are JSONL files, adding a quick heuristic check at the beginning of the function can avoid unnecessary processing.
| for raw_line in text.splitlines(): | |
| if not text.strip().startswith("{"): | |
| return [] | |
| for raw_line in text.splitlines(): |
| current_user_query: str | None = None | ||
| assistant_chunks: List[str] = [] | ||
|
|
||
| for raw_line in text.splitlines(): |
There was a problem hiding this comment.
The current implementation of _parse_claude_code_transcript iterates through every line of the input text and attempts to parse it as JSON. This can be inefficient for large non-JSON transcripts. Adding a quick heuristic check at the beginning of the function can avoid unnecessary processing for files that are clearly not in JSONL format.
| for raw_line in text.splitlines(): | |
| if not text.strip().startswith("{"): | |
| return [] | |
| for raw_line in text.splitlines(): |
| current_user_query: str | None = None | ||
| assistant_chunks: List[str] = [] | ||
|
|
||
| for raw_line in text.splitlines(): |
There was a problem hiding this comment.
Good call. Since Claude Code transcripts are JSONL, the shared parser should first reject obvious non-JSONL input before iterating through every line. This should be fixed in the shared parser rather than separately in both files.
|
|
||
|
|
||
| def _content_to_text(content: Any) -> str: | ||
| """Extract readable text from Claude Code message content blocks.""" |
There was a problem hiding this comment.
Agree. Since this parser is used by both the standalone server and the production memory route, please move the Claude transcript parsing into src/utils/transcripts.py and have both server.py and src/api/routes/memory.py import the shared parser from there.
Ankit-Kotnala
left a comment
There was a problem hiding this comment.
The feature is good, but @LoikStyle should centralize the parser and clean up the test before merge.
|
@LoikStyle please have a look on the suggestions |
|
| Filename | Overview |
|---|---|
| server.py | Adds _content_to_text and _parse_claude_code_transcript as a catch-all JSONL parser in _parse_transcript_text; logic is duplicated from memory.py |
| src/api/routes/memory.py | Primary implementation of Claude Code JSONL parsing; placed as catch-all fallback with no format guard, risking false-positive detection of arbitrary JSONL files |
| tests/test_claude_code_transcript.py | New test file using AST-based function extraction to test the parser in isolation; covers happy path and tool-only filtering, but misses false-positive and edge-case scenarios |
Flowchart
%%{init: {'theme': 'neutral'}}%%
flowchart TD
A[Upload transcript text] --> B{Contains '_Exported on' and 'from Cursor'?}
B -- Yes --> C[_parse_cursor_transcript]
C --> D{pairs found?}
D -- Yes --> E[return 'cursor', pairs]
D -- No --> F
B -- No --> F{Contains '# Chat Conversation' and heading markers?}
F -- Yes --> G[_parse_antigravity_transcript]
G --> H{pairs found?}
H -- Yes --> I[return 'antigravity', pairs]
H -- No --> J
F -- No --> J[_parse_claude_code_transcript - NO format guard]
J --> K{pairs found?}
K -- Yes --> L[return 'claude_code', pairs]
K -- No --> M[return 'unknown', empty]
Reviews (1): Last reviewed commit: "feat: support Claude Code transcripts" | Re-trigger Greptile
| pairs = _parse_claude_code_transcript(text) | ||
| if pairs: | ||
| return "claude_code", pairs |
There was a problem hiding this comment.
Catch-all JSONL parser with no format guard
Unlike the Cursor and Antigravity parsers, _parse_claude_code_transcript has no upfront marker check, so it runs on any input that isn't matched by the earlier detectors. Any JSONL file containing objects with role: "user" / role: "assistant" (or matching type values) and text content — such as an OpenAI or Gemini API log, a generic chat export, or a database dump — will be silently misidentified as a Claude Code transcript and its contents ingested as memories. Consider adding a lightweight guard, e.g. checking that at least one line is a dict containing a "message" key wrapping another dict with a "role" field, before committing to the "claude_code" format. The same issue exists in server.py at the equivalent call site.
| has_nested_message = isinstance(event.get("message"), dict) | ||
| message = event["message"] if has_nested_message else event | ||
| role = message.get("role") or (event.get("type") if not has_nested_message else None) |
There was a problem hiding this comment.
event.type not used as role fallback when message dict is present but lacks role
event.get("type") is only tried as a fallback when has_nested_message is False. When a nested message dict exists but contains no "role" key (e.g. {"type":"user","message":{"content":"..."}} — a plausible compact variant), role resolves to None and the event is silently dropped. Using the outer type field as a fallback in both branches would be safer.
| if role == "user": | ||
| if current_user_query and assistant_chunks: | ||
| pairs.append(MessagePair( | ||
| user_query=current_user_query, | ||
| agent_response="\n\n".join(assistant_chunks).strip(), | ||
| )) | ||
| current_user_query = content | ||
| assistant_chunks = [] | ||
| elif current_user_query: | ||
| current_user_query = f"{current_user_query}\n\n{content}" | ||
| else: | ||
| current_user_query = content | ||
| elif role == "assistant" and current_user_query: | ||
| assistant_chunks.append(content) |
There was a problem hiding this comment.
Tool-only assistant turns silently merge the preceding and following user messages
When an assistant turn contains only tool calls, _content_to_text returns "" and the line is skipped entirely via continue. From the state machine's perspective there is no assistant response between the two user messages, so the elif current_user_query branch fires and concatenates them. For example: User "Add tests" → Assistant [tool_use only] → User "Run them" → Assistant "Done." produces the pair user_query = "Add tests\n\nRun them" instead of two separate pairs, misrepresenting the conversation structure in stored memories.
| def _content_to_text(content: Any) -> str: | ||
| """Extract readable text from Claude Code message content blocks.""" | ||
| if isinstance(content, str): | ||
| return content.strip() | ||
| if isinstance(content, list): | ||
| chunks: list[str] = [] | ||
| for item in content: | ||
| if isinstance(item, str): | ||
| chunks.append(item) | ||
| elif isinstance(item, dict) and item.get("type") == "text": | ||
| chunks.append(str(item.get("text", ""))) | ||
| return "\n".join(chunk.strip() for chunk in chunks if chunk.strip()).strip() | ||
| return "" | ||
|
|
||
|
|
||
| def _parse_claude_code_transcript(text: str) -> list[dict[str, str]]: | ||
| """Parse Claude Code JSONL transcripts into message pairs.""" | ||
| pairs: list[dict[str, str]] = [] | ||
| current_user_query: str | None = None | ||
| assistant_chunks: list[str] = [] | ||
|
|
||
| for raw_line in text.splitlines(): | ||
| raw_line = raw_line.strip() | ||
| if not raw_line: | ||
| continue | ||
|
|
||
| try: | ||
| event = json.loads(raw_line) | ||
| except json.JSONDecodeError: | ||
| continue | ||
|
|
||
| if not isinstance(event, dict): | ||
| continue | ||
|
|
||
| has_nested_message = isinstance(event.get("message"), dict) | ||
| message = event["message"] if has_nested_message else event | ||
| role = message.get("role") or (event.get("type") if not has_nested_message else None) | ||
| content = _content_to_text(message.get("content")) | ||
| if not content: | ||
| continue | ||
|
|
||
| if role == "user": | ||
| if current_user_query and assistant_chunks: | ||
| pairs.append({ | ||
| "user_query": current_user_query, | ||
| "agent_response": "\n\n".join(assistant_chunks).strip(), | ||
| }) | ||
| current_user_query = content | ||
| assistant_chunks = [] | ||
| elif current_user_query: | ||
| current_user_query = f"{current_user_query}\n\n{content}" | ||
| else: | ||
| current_user_query = content | ||
| elif role == "assistant" and current_user_query: | ||
| assistant_chunks.append(content) | ||
|
|
||
| if current_user_query and assistant_chunks: | ||
| pairs.append({ | ||
| "user_query": current_user_query, | ||
| "agent_response": "\n\n".join(assistant_chunks).strip(), | ||
| }) | ||
|
|
||
| return pairs |
There was a problem hiding this comment.
_content_to_text and _parse_claude_code_transcript are duplicated verbatim
Both functions are copied character-for-character between server.py and src/api/routes/memory.py, continuing the existing duplication pattern for the Cursor and Antigravity parsers. Any future bug fix or format change will need to be applied in two places. Centralising the logic in a shared module (e.g. src/utils/transcript_parsers.py) and importing from both files would remove the maintenance risk.
Summary
Test Plan
python3 -m pytest tests/test_claude_code_transcript.py -q -o addopts=''python3 -m py_compile src/api/routes/memory.py server.pyFixes #156