Quick Start Guide

This guide will help you get started quickly with the Doc + Code Extractor (JSONL) used for building a drift-detection dataset.

Unlike the old drift miner, this script does not try to guess drift or filter commits by “docs fix” keywords. It simply extracts evidence (docstrings + code + hierarchy/call context) and leaves labeling (drift / no drift) to a downstream LLM pipeline.

Prerequisites

Python 3.8 or higher
pip (Python package manager)
GitHub account (for API token — optional but strongly recommended)

Installation

Step 1: Clone the Repository

git clone https://github.com/pranavgupta0001/Coding-Doc-Agent.git
cd Coding-Doc-Agent

Step 2: Install Dependencies

pip install -r requirements.txt

Step 3: Set Up GitHub Token (Optional but Recommended)

Without a token, you're limited to 60 API requests per hour. With a token, you get 5,000 requests per hour.

Create a GitHub Personal Access Token

Go to https://github.com/settings/tokens
Click "Generate new token (classic)"
Give it a name like "Drift Miner"
Select scope: public_repo (for accessing public repositories)
Click "Generate token"
Copy the token (you won't see it again!)

Configure the Token

Option A: Environment variable

export GITHUB_TOKEN="your_token_here"

Option B: .env file

cp .env.example .env
# Edit .env and add your token
echo "GITHUB_TOKEN=your_token_here" > .env

Basic Usage (Updated)

The extractor produces per-symbol records (functions, classes, and class methods).

Output format is JSONL → one symbol record per line.

This is designed to support hierarchical drift evaluation:

function-level evidence
method-within-class evidence (includes class_doc)
module-level evidence (includes module_doc)
lightweight intra-file call context (callees, callers)

Example 1: Mine NumPy (Small Sample)

python3 doc_code_extractor.py \
  --repos numpy/numpy \
  --max-files 25 \
  --max-symbols 300 \
  --output doc_code_records.jsonl

Expected output:

Mining repo: numpy/numpy
  Found 25 .py files (capped at 25)
  Wrote 300 symbol records for numpy/numpy

Done. Total records written: 300
Output: doc_code_records.jsonl

Example 2: Mine SciPy and NumPy

python3 doc_code_extractor.py \
  --repos scipy/scipy numpy/numpy \
  --max-files 40 \
  --max-symbols 600 \
  --output multi_repo_records.jsonl

Example 3: Specify a Git Ref (Branch/SHA/Tag)

python3 doc_code_extractor.py \
  --repos numpy/numpy \
  --ref 10e9faf1afbecca9316ce752c8a1dc8807137edb \
  --max-files 25 \
  --max-symbols 300 \
  --output numpy_pinned.jsonl

Example 4: With GitHub Token

python3 doc_code_extractor.py \
  --repos numpy/numpy scipy/scipy \
  --token YOUR_GITHUB_TOKEN \
  --max-files 50 \
  --max-symbols 800 \
  --output doc_code_records.jsonl

Understanding the Output

One line = one symbol (function/class/method) at one repo ref.

{
  "repository": "numpy/numpy",
  "ref": "main",
  "commit_sha": "10e9faf1afbecca9316ce752c8a1dc8807137edb",
  "file": "tools/check_python_h_first.py",

  "symbol_path": "sort_order",
  "symbol_type": "function",
  "signature": "def sort_order(path: str) -> tuple[int, str]:",

  "doc": null,
  "code": "def sort_order(path: str) -> tuple[int, str]:\n    ...",

  "context": {
    "module_doc": "Check that Python.h is included before any stdlib headers.\n\nMay be a bit overzealous, but it should get the job done.",
    "class_doc": null,
    "parent_class": null,
    "siblings": ["check_python_h_included_first", "sort_order", "process_files"],
    "callees": ["os.path.basename", "os.path.splitext"],
    "callers": []
  }
}

Key Fields

symbol_path: Unique “path” to the symbol
- Function: "tokenize"
- Class: "Client"
- Method: "Client.get"
symbol_type: "function" | "class" | "method"
doc: The extracted docstring (may be null if missing)
code: The extracted source segment for the symbol
context.module_doc: Module-level docstring (top of file, if present)
context.class_doc: Class docstring (for methods)
context.callees / callers: Lightweight intra-file call context
- callees: what the symbol calls
- callers: what calls this symbol (within the same file, best-effort)

Why There Aren't any Drift Labels

This extractor is intentionally label-free.

We use it to gather neutral evidence from OSS repositories, then a downstream agentic LLM system classifies each example as:

consistent (no drift)
inconsistent (drift)

Optionally, we will supplement with synthetic drift examples by mutating otherwise-consistent extracted records.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Quick Start Guide

Prerequisites

Installation

Step 1: Clone the Repository

Step 2: Install Dependencies

Step 3: Set Up GitHub Token (Optional but Recommended)

Create a GitHub Personal Access Token

Configure the Token

Basic Usage (Updated)

Example 1: Mine NumPy (Small Sample)

Example 2: Mine SciPy and NumPy

Example 3: Specify a Git Ref (Branch/SHA/Tag)

Example 4: With GitHub Token

Understanding the Output

Key Fields

Why There Aren't any Drift Labels

FilesExpand file tree

QUICKSTART.md

Latest commit

History

QUICKSTART.md

File metadata and controls

Quick Start Guide

Prerequisites

Installation

Step 1: Clone the Repository

Step 2: Install Dependencies

Step 3: Set Up GitHub Token (Optional but Recommended)

Create a GitHub Personal Access Token

Configure the Token

Basic Usage (Updated)

Example 1: Mine NumPy (Small Sample)

Example 2: Mine SciPy and NumPy

Example 3: Specify a Git Ref (Branch/SHA/Tag)

Example 4: With GitHub Token

Understanding the Output

Key Fields

Why There Aren't any Drift Labels