Skip to content

feat: support Claude Code transcripts#168

Open
LoikStyle wants to merge 1 commit into
XortexAI:mainfrom
LoikStyle:loikstyle/claude-code-transcript-156
Open

feat: support Claude Code transcripts#168
LoikStyle wants to merge 1 commit into
XortexAI:mainfrom
LoikStyle:loikstyle/claude-code-transcript-156

Conversation

@LoikStyle
Copy link
Copy Markdown

Summary

  • Add Claude Code JSONL transcript parsing to the context import pipeline
  • Extract only conversational text from user/assistant turns and ignore tool-only blocks
  • Include focused regression coverage for Claude Code transcript uploads

Test Plan

  • python3 -m pytest tests/test_claude_code_transcript.py -q -o addopts=''
  • python3 -m py_compile src/api/routes/memory.py server.py

Fixes #156

Copy link
Copy Markdown

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request implements a parser for Claude Code JSONL transcripts, adding _content_to_text and _parse_claude_code_transcript functions to server.py and src/api/routes/memory.py, and includes a new test file. Feedback indicates that the parsing logic is duplicated and should be moved to a shared module to reduce maintenance overhead. Additionally, a performance optimization was suggested to include a heuristic check for JSON content before attempting to parse the transcript lines.

Comment thread server.py
Comment on lines +798 to +810
def _content_to_text(content: Any) -> str:
"""Extract readable text from Claude Code message content blocks."""
if isinstance(content, str):
return content.strip()
if isinstance(content, list):
chunks: list[str] = []
for item in content:
if isinstance(item, str):
chunks.append(item)
elif isinstance(item, dict) and item.get("type") == "text":
chunks.append(str(item.get("text", "")))
return "\n".join(chunk.strip() for chunk in chunks if chunk.strip()).strip()
return ""
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The logic for _content_to_text and _parse_claude_code_transcript is duplicated between server.py and src/api/routes/memory.py. This increases maintenance overhead and the risk of inconsistencies as the parsing logic evolves. Consider moving these utilities to a shared module (e.g., src/utils/transcripts.py) that both files can import from.

Comment thread server.py
current_user_query: str | None = None
assistant_chunks: list[str] = []

for raw_line in text.splitlines():
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The current implementation of _parse_claude_code_transcript iterates through every line of the input text and attempts to parse it as JSON. This can be inefficient for large non-JSON transcripts (e.g., standard markdown files that don't match Cursor or Antigravity formats). Since Claude Code transcripts are JSONL files, adding a quick heuristic check at the beginning of the function can avoid unnecessary processing.

Suggested change
for raw_line in text.splitlines():
if not text.strip().startswith("{"):
return []
for raw_line in text.splitlines():

Comment thread src/api/routes/memory.py
current_user_query: str | None = None
assistant_chunks: List[str] = []

for raw_line in text.splitlines():
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The current implementation of _parse_claude_code_transcript iterates through every line of the input text and attempts to parse it as JSON. This can be inefficient for large non-JSON transcripts. Adding a quick heuristic check at the beginning of the function can avoid unnecessary processing for files that are clearly not in JSONL format.

Suggested change
for raw_line in text.splitlines():
if not text.strip().startswith("{"):
return []
for raw_line in text.splitlines():

Comment thread src/api/routes/memory.py
current_user_query: str | None = None
assistant_chunks: List[str] = []

for raw_line in text.splitlines():
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good call. Since Claude Code transcripts are JSONL, the shared parser should first reject obvious non-JSONL input before iterating through every line. This should be fixed in the shared parser rather than separately in both files.

Comment thread server.py


def _content_to_text(content: Any) -> str:
"""Extract readable text from Claude Code message content blocks."""
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree. Since this parser is used by both the standalone server and the production memory route, please move the Claude transcript parsing into src/utils/transcripts.py and have both server.py and src/api/routes/memory.py import the shared parser from there.

Copy link
Copy Markdown
Collaborator

@Ankit-Kotnala Ankit-Kotnala left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The feature is good, but @LoikStyle should centralize the parser and clean up the test before merge.

@ishaanxgupta
Copy link
Copy Markdown
Member

@LoikStyle please have a look on the suggestions

@greptile-apps
Copy link
Copy Markdown

greptile-apps Bot commented May 23, 2026

Greptile Summary

This PR adds Claude Code JSONL transcript parsing to the context import pipeline in both server.py and src/api/routes/memory.py, extracting only conversational text from user/assistant turns while filtering out tool-use blocks.

  • A new _content_to_text helper strips tool-use and other non-text content blocks from message payloads, and _parse_claude_code_transcript builds user/assistant pairs from the resulting JSONL stream.
  • The parser is registered as the last-resort fallback in _parse_transcript_text with no format-specific guard, meaning any JSONL file containing role-paired objects will be detected as Claude Code — unlike the Cursor and Antigravity detectors which check for unique textual markers first.
  • Two focused regression tests are included, but both files duplicate the full implementation rather than sharing it from a common module.

Confidence Score: 3/5

The new parser introduces a catch-all detection path with no format guard, which means any JSONL file with user/assistant role pairs could be silently ingested as Claude Code memories — a present defect in format detection that should be addressed before merging.

The format detection in _parse_transcript_text now relies on order alone for Claude Code: if Cursor and Antigravity checks fail, any JSONL with role-paired content is treated as Claude Code and could store incorrect memory pairs. Additionally, tool-only assistant turns between two user messages silently merge those messages into a single pair, misrepresenting the conversation structure stored in memory.

The detection guard in _parse_transcript_text in both src/api/routes/memory.py (line 581) and server.py (line 922) warrants the most attention.

Important Files Changed

Filename Overview
server.py Adds _content_to_text and _parse_claude_code_transcript as a catch-all JSONL parser in _parse_transcript_text; logic is duplicated from memory.py
src/api/routes/memory.py Primary implementation of Claude Code JSONL parsing; placed as catch-all fallback with no format guard, risking false-positive detection of arbitrary JSONL files
tests/test_claude_code_transcript.py New test file using AST-based function extraction to test the parser in isolation; covers happy path and tool-only filtering, but misses false-positive and edge-case scenarios

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A[Upload transcript text] --> B{Contains '_Exported on' and 'from Cursor'?}
    B -- Yes --> C[_parse_cursor_transcript]
    C --> D{pairs found?}
    D -- Yes --> E[return 'cursor', pairs]
    D -- No --> F
    B -- No --> F{Contains '# Chat Conversation' and heading markers?}
    F -- Yes --> G[_parse_antigravity_transcript]
    G --> H{pairs found?}
    H -- Yes --> I[return 'antigravity', pairs]
    H -- No --> J
    F -- No --> J[_parse_claude_code_transcript - NO format guard]
    J --> K{pairs found?}
    K -- Yes --> L[return 'claude_code', pairs]
    K -- No --> M[return 'unknown', empty]
Loading

Fix All in Cursor Fix All in Codex Fix All in Claude Code

Reviews (1): Last reviewed commit: "feat: support Claude Code transcripts" | Re-trigger Greptile

Comment thread src/api/routes/memory.py
Comment on lines +581 to +583
pairs = _parse_claude_code_transcript(text)
if pairs:
return "claude_code", pairs
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Catch-all JSONL parser with no format guard

Unlike the Cursor and Antigravity parsers, _parse_claude_code_transcript has no upfront marker check, so it runs on any input that isn't matched by the earlier detectors. Any JSONL file containing objects with role: "user" / role: "assistant" (or matching type values) and text content — such as an OpenAI or Gemini API log, a generic chat export, or a database dump — will be silently misidentified as a Claude Code transcript and its contents ingested as memories. Consider adding a lightweight guard, e.g. checking that at least one line is a dict containing a "message" key wrapping another dict with a "role" field, before committing to the "claude_code" format. The same issue exists in server.py at the equivalent call site.

Fix in Cursor Fix in Codex Fix in Claude Code

Comment thread src/api/routes/memory.py
Comment on lines +490 to +492
has_nested_message = isinstance(event.get("message"), dict)
message = event["message"] if has_nested_message else event
role = message.get("role") or (event.get("type") if not has_nested_message else None)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 event.type not used as role fallback when message dict is present but lacks role

event.get("type") is only tried as a fallback when has_nested_message is False. When a nested message dict exists but contains no "role" key (e.g. {"type":"user","message":{"content":"..."}} — a plausible compact variant), role resolves to None and the event is silently dropped. Using the outer type field as a fallback in both branches would be safer.

Fix in Cursor Fix in Codex Fix in Claude Code

Comment thread src/api/routes/memory.py
Comment on lines +497 to +510
if role == "user":
if current_user_query and assistant_chunks:
pairs.append(MessagePair(
user_query=current_user_query,
agent_response="\n\n".join(assistant_chunks).strip(),
))
current_user_query = content
assistant_chunks = []
elif current_user_query:
current_user_query = f"{current_user_query}\n\n{content}"
else:
current_user_query = content
elif role == "assistant" and current_user_query:
assistant_chunks.append(content)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Tool-only assistant turns silently merge the preceding and following user messages

When an assistant turn contains only tool calls, _content_to_text returns "" and the line is skipped entirely via continue. From the state machine's perspective there is no assistant response between the two user messages, so the elif current_user_query branch fires and concatenates them. For example: User "Add tests" → Assistant [tool_use only] → User "Run them" → Assistant "Done." produces the pair user_query = "Add tests\n\nRun them" instead of two separate pairs, misrepresenting the conversation structure in stored memories.

Fix in Cursor Fix in Codex Fix in Claude Code

Comment thread server.py
Comment on lines +798 to +860
def _content_to_text(content: Any) -> str:
"""Extract readable text from Claude Code message content blocks."""
if isinstance(content, str):
return content.strip()
if isinstance(content, list):
chunks: list[str] = []
for item in content:
if isinstance(item, str):
chunks.append(item)
elif isinstance(item, dict) and item.get("type") == "text":
chunks.append(str(item.get("text", "")))
return "\n".join(chunk.strip() for chunk in chunks if chunk.strip()).strip()
return ""


def _parse_claude_code_transcript(text: str) -> list[dict[str, str]]:
"""Parse Claude Code JSONL transcripts into message pairs."""
pairs: list[dict[str, str]] = []
current_user_query: str | None = None
assistant_chunks: list[str] = []

for raw_line in text.splitlines():
raw_line = raw_line.strip()
if not raw_line:
continue

try:
event = json.loads(raw_line)
except json.JSONDecodeError:
continue

if not isinstance(event, dict):
continue

has_nested_message = isinstance(event.get("message"), dict)
message = event["message"] if has_nested_message else event
role = message.get("role") or (event.get("type") if not has_nested_message else None)
content = _content_to_text(message.get("content"))
if not content:
continue

if role == "user":
if current_user_query and assistant_chunks:
pairs.append({
"user_query": current_user_query,
"agent_response": "\n\n".join(assistant_chunks).strip(),
})
current_user_query = content
assistant_chunks = []
elif current_user_query:
current_user_query = f"{current_user_query}\n\n{content}"
else:
current_user_query = content
elif role == "assistant" and current_user_query:
assistant_chunks.append(content)

if current_user_query and assistant_chunks:
pairs.append({
"user_query": current_user_query,
"agent_response": "\n\n".join(assistant_chunks).strip(),
})

return pairs
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 _content_to_text and _parse_claude_code_transcript are duplicated verbatim

Both functions are copied character-for-character between server.py and src/api/routes/memory.py, continuing the existing duplication pattern for the Cursor and Antigravity parsers. Any future bug fix or format change will need to be applied in two places. Centralising the logic in a shared module (e.g. src/utils/transcript_parsers.py) and importing from both files would remove the maintenance risk.

Fix in Cursor Fix in Codex Fix in Claude Code

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

add support of claude code transcript in /context page

3 participants