feat: support Claude Code transcripts by LoikStyle · Pull Request #168 · XortexAI/XMem

LoikStyle · 2026-05-11T05:08:16Z

Summary

Add Claude Code JSONL transcript parsing to the context import pipeline
Extract only conversational text from user/assistant turns and ignore tool-only blocks
Include focused regression coverage for Claude Code transcript uploads

Test Plan

python3 -m pytest tests/test_claude_code_transcript.py -q -o addopts=''
python3 -m py_compile src/api/routes/memory.py server.py

Fixes #156

gemini-code-assist

Code Review

This pull request implements a parser for Claude Code JSONL transcripts, adding _content_to_text and _parse_claude_code_transcript functions to server.py and src/api/routes/memory.py, and includes a new test file. Feedback indicates that the parsing logic is duplicated and should be moved to a shared module to reduce maintenance overhead. Additionally, a performance optimization was suggested to include a heuristic check for JSON content before attempting to parse the transcript lines.

gemini-code-assist · 2026-05-11T05:13:28Z

+def _content_to_text(content: Any) -> str:
+    """Extract readable text from Claude Code message content blocks."""
+    if isinstance(content, str):
+        return content.strip()
+    if isinstance(content, list):
+        chunks: list[str] = []
+        for item in content:
+            if isinstance(item, str):
+                chunks.append(item)
+            elif isinstance(item, dict) and item.get("type") == "text":
+                chunks.append(str(item.get("text", "")))
+        return "\n".join(chunk.strip() for chunk in chunks if chunk.strip()).strip()
+    return ""


The logic for _content_to_text and _parse_claude_code_transcript is duplicated between server.py and src/api/routes/memory.py. This increases maintenance overhead and the risk of inconsistencies as the parsing logic evolves. Consider moving these utilities to a shared module (e.g., src/utils/transcripts.py) that both files can import from.

gemini-code-assist · 2026-05-11T05:13:28Z

+    current_user_query: str | None = None
+    assistant_chunks: list[str] = []
+
+    for raw_line in text.splitlines():


The current implementation of _parse_claude_code_transcript iterates through every line of the input text and attempts to parse it as JSON. This can be inefficient for large non-JSON transcripts (e.g., standard markdown files that don't match Cursor or Antigravity formats). Since Claude Code transcripts are JSONL files, adding a quick heuristic check at the beginning of the function can avoid unnecessary processing.

Suggested change

for raw_line in text.splitlines():

if not text.strip().startswith("{"):

return []

for raw_line in text.splitlines():

gemini-code-assist · 2026-05-11T05:13:29Z

+    current_user_query: str | None = None
+    assistant_chunks: List[str] = []
+
+    for raw_line in text.splitlines():


The current implementation of _parse_claude_code_transcript iterates through every line of the input text and attempts to parse it as JSON. This can be inefficient for large non-JSON transcripts. Adding a quick heuristic check at the beginning of the function can avoid unnecessary processing for files that are clearly not in JSONL format.

Suggested change

for raw_line in text.splitlines():

if not text.strip().startswith("{"):

return []

for raw_line in text.splitlines():

Ankit-Kotnala · 2026-05-11T18:33:23Z

+    current_user_query: str | None = None
+    assistant_chunks: List[str] = []
+
+    for raw_line in text.splitlines():


Good call. Since Claude Code transcripts are JSONL, the shared parser should first reject obvious non-JSONL input before iterating through every line. This should be fixed in the shared parser rather than separately in both files.

Ankit-Kotnala · 2026-05-11T18:33:31Z



+def _content_to_text(content: Any) -> str:
+    """Extract readable text from Claude Code message content blocks."""


Agree. Since this parser is used by both the standalone server and the production memory route, please move the Claude transcript parsing into src/utils/transcripts.py and have both server.py and src/api/routes/memory.py import the shared parser from there.

Ankit-Kotnala

The feature is good, but @LoikStyle should centralize the parser and clean up the test before merge.

ishaanxgupta · 2026-05-16T12:17:31Z

@LoikStyle please have a look on the suggestions

greptile-apps · 2026-05-23T09:21:34Z

Greptile Summary

This PR adds Claude Code JSONL transcript parsing to the context import pipeline in both server.py and src/api/routes/memory.py, extracting only conversational text from user/assistant turns while filtering out tool-use blocks.

A new _content_to_text helper strips tool-use and other non-text content blocks from message payloads, and _parse_claude_code_transcript builds user/assistant pairs from the resulting JSONL stream.
The parser is registered as the last-resort fallback in _parse_transcript_text with no format-specific guard, meaning any JSONL file containing role-paired objects will be detected as Claude Code — unlike the Cursor and Antigravity detectors which check for unique textual markers first.
Two focused regression tests are included, but both files duplicate the full implementation rather than sharing it from a common module.

Confidence Score: 3/5

The new parser introduces a catch-all detection path with no format guard, which means any JSONL file with user/assistant role pairs could be silently ingested as Claude Code memories — a present defect in format detection that should be addressed before merging.

The format detection in _parse_transcript_text now relies on order alone for Claude Code: if Cursor and Antigravity checks fail, any JSONL with role-paired content is treated as Claude Code and could store incorrect memory pairs. Additionally, tool-only assistant turns between two user messages silently merge those messages into a single pair, misrepresenting the conversation structure stored in memory.

The detection guard in _parse_transcript_text in both src/api/routes/memory.py (line 581) and server.py (line 922) warrants the most attention.

Important Files Changed

Filename	Overview
server.py	Adds `_content_to_text` and `_parse_claude_code_transcript` as a catch-all JSONL parser in `_parse_transcript_text`; logic is duplicated from `memory.py`
src/api/routes/memory.py	Primary implementation of Claude Code JSONL parsing; placed as catch-all fallback with no format guard, risking false-positive detection of arbitrary JSONL files
tests/test_claude_code_transcript.py	New test file using AST-based function extraction to test the parser in isolation; covers happy path and tool-only filtering, but misses false-positive and edge-case scenarios

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A[Upload transcript text] --> B{Contains '_Exported on' and 'from Cursor'?}
    B -- Yes --> C[_parse_cursor_transcript]
    C --> D{pairs found?}
    D -- Yes --> E[return 'cursor', pairs]
    D -- No --> F
    B -- No --> F{Contains '# Chat Conversation' and heading markers?}
    F -- Yes --> G[_parse_antigravity_transcript]
    G --> H{pairs found?}
    H -- Yes --> I[return 'antigravity', pairs]
    H -- No --> J
    F -- No --> J[_parse_claude_code_transcript - NO format guard]
    J --> K{pairs found?}
    K -- Yes --> L[return 'claude_code', pairs]
    K -- No --> M[return 'unknown', empty]

_{Reviews (1): Last reviewed commit: "feat: support Claude Code transcripts" | Re-trigger Greptile}

greptile-apps · 2026-05-23T09:21:38Z

+    pairs = _parse_claude_code_transcript(text)
+    if pairs:
+        return "claude_code", pairs


Catch-all JSONL parser with no format guard

Unlike the Cursor and Antigravity parsers, _parse_claude_code_transcript has no upfront marker check, so it runs on any input that isn't matched by the earlier detectors. Any JSONL file containing objects with role: "user" / role: "assistant" (or matching type values) and text content — such as an OpenAI or Gemini API log, a generic chat export, or a database dump — will be silently misidentified as a Claude Code transcript and its contents ingested as memories. Consider adding a lightweight guard, e.g. checking that at least one line is a dict containing a "message" key wrapping another dict with a "role" field, before committing to the "claude_code" format. The same issue exists in server.py at the equivalent call site.

greptile-apps · 2026-05-23T09:21:39Z

+        has_nested_message = isinstance(event.get("message"), dict)
+        message = event["message"] if has_nested_message else event
+        role = message.get("role") or (event.get("type") if not has_nested_message else None)


event.type not used as role fallback when message dict is present but lacks role

event.get("type") is only tried as a fallback when has_nested_message is False. When a nested message dict exists but contains no "role" key (e.g. {"type":"user","message":{"content":"..."}} — a plausible compact variant), role resolves to None and the event is silently dropped. Using the outer type field as a fallback in both branches would be safer.

greptile-apps · 2026-05-23T09:21:40Z

+        if role == "user":
+            if current_user_query and assistant_chunks:
+                pairs.append(MessagePair(
+                    user_query=current_user_query,
+                    agent_response="\n\n".join(assistant_chunks).strip(),
+                ))
+                current_user_query = content
+                assistant_chunks = []
+            elif current_user_query:
+                current_user_query = f"{current_user_query}\n\n{content}"
+            else:
+                current_user_query = content
+        elif role == "assistant" and current_user_query:
+            assistant_chunks.append(content)


Tool-only assistant turns silently merge the preceding and following user messages

When an assistant turn contains only tool calls, _content_to_text returns "" and the line is skipped entirely via continue. From the state machine's perspective there is no assistant response between the two user messages, so the elif current_user_query branch fires and concatenates them. For example: User "Add tests" → Assistant [tool_use only] → User "Run them" → Assistant "Done." produces the pair user_query = "Add tests\n\nRun them" instead of two separate pairs, misrepresenting the conversation structure in stored memories.

greptile-apps · 2026-05-23T09:21:41Z

+def _content_to_text(content: Any) -> str:
+    """Extract readable text from Claude Code message content blocks."""
+    if isinstance(content, str):
+        return content.strip()
+    if isinstance(content, list):
+        chunks: list[str] = []
+        for item in content:
+            if isinstance(item, str):
+                chunks.append(item)
+            elif isinstance(item, dict) and item.get("type") == "text":
+                chunks.append(str(item.get("text", "")))
+        return "\n".join(chunk.strip() for chunk in chunks if chunk.strip()).strip()
+    return ""
+
+
+def _parse_claude_code_transcript(text: str) -> list[dict[str, str]]:
+    """Parse Claude Code JSONL transcripts into message pairs."""
+    pairs: list[dict[str, str]] = []
+    current_user_query: str | None = None
+    assistant_chunks: list[str] = []
+
+    for raw_line in text.splitlines():
+        raw_line = raw_line.strip()
+        if not raw_line:
+            continue
+
+        try:
+            event = json.loads(raw_line)
+        except json.JSONDecodeError:
+            continue
+
+        if not isinstance(event, dict):
+            continue
+
+        has_nested_message = isinstance(event.get("message"), dict)
+        message = event["message"] if has_nested_message else event
+        role = message.get("role") or (event.get("type") if not has_nested_message else None)
+        content = _content_to_text(message.get("content"))
+        if not content:
+            continue
+
+        if role == "user":
+            if current_user_query and assistant_chunks:
+                pairs.append({
+                    "user_query": current_user_query,
+                    "agent_response": "\n\n".join(assistant_chunks).strip(),
+                })
+                current_user_query = content
+                assistant_chunks = []
+            elif current_user_query:
+                current_user_query = f"{current_user_query}\n\n{content}"
+            else:
+                current_user_query = content
+        elif role == "assistant" and current_user_query:
+            assistant_chunks.append(content)
+
+    if current_user_query and assistant_chunks:
+        pairs.append({
+            "user_query": current_user_query,
+            "agent_response": "\n\n".join(assistant_chunks).strip(),
+        })
+
+    return pairs


_content_to_text and _parse_claude_code_transcript are duplicated verbatim

Both functions are copied character-for-character between server.py and src/api/routes/memory.py, continuing the existing duplication pattern for the Cursor and Antigravity parsers. Any future bug fix or format change will need to be applied in two places. Centralising the logic in a shared module (e.g. src/utils/transcript_parsers.py) and importing from both files would remove the maintenance risk.

feat: support Claude Code transcripts

dc401c6

LoikStyle requested review from ishaanxgupta and ved015 as code owners May 11, 2026 05:08

github-actions Bot added tests api labels May 11, 2026

LoikStyle mentioned this pull request May 11, 2026

add support of claude code transcript in /context page #156

Open

gemini-code-assist Bot reviewed May 11, 2026

View reviewed changes

ishaanxgupta requested review from Ankit-Kotnala May 11, 2026 08:37

Ankit-Kotnala reviewed May 11, 2026

View reviewed changes

greptile-apps Bot reviewed May 23, 2026

View reviewed changes



		def _content_to_text(content: Any) -> str:
		"""Extract readable text from Claude Code message content blocks."""

Conversation

LoikStyle commented May 11, 2026

Summary

Test Plan

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist Bot May 11, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot May 11, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot May 11, 2026

Choose a reason for hiding this comment

Uh oh!

Ankit-Kotnala May 11, 2026

Choose a reason for hiding this comment

Uh oh!

Ankit-Kotnala May 11, 2026

Choose a reason for hiding this comment

Uh oh!

Ankit-Kotnala left a comment

Choose a reason for hiding this comment

Uh oh!

ishaanxgupta commented May 16, 2026

Uh oh!

greptile-apps Bot commented May 23, 2026

Greptile Summary

Confidence Score: 3/5

Important Files Changed

Flowchart

Uh oh!

greptile-apps Bot May 23, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps Bot May 23, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps Bot May 23, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps Bot May 23, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants