Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
19 commits
Select commit Hold shift + click to select a range
344a171
ntegrate Ollama & OpenAI, fix PYTHONPATH, update pyproject and tests
AhmadHakami Sep 9, 2025
f6fbfba
Enhance README.md for clarity and readability
AhmadHakami Sep 9, 2025
659a925
Resolve empty QA pairs by fixing batch completion routing
AhmadHakami Sep 10, 2025
e4473ab
Resolve empty QA pairs by fixing batch completion routing
AhmadHakami Sep 10, 2025
95b16b2
_openai_token_param(): picks max_completion_tokens for gpt-5-*.
AhmadHakami Sep 10, 2025
5ca3aa2
remove new_dirctory folder
AhmadHakami Sep 10, 2025
c422537
feat(cli,generators): add language control and .env support; tests in…
AhmadHakami Sep 10, 2025
3ec5f42
fix: preserve non-ASCII in JSON outputs; add language flag docs/tests
AhmadHakami Sep 10, 2025
c7e091f
Updated LLMClient to adapt to OpenAI’s O-series constraints:
AhmadHakami Sep 10, 2025
ae1843f
Add CLI flag --difficulty [easy|medium|advanced]
AhmadHakami Sep 10, 2025
79cf39c
feat(cot): add visible progress and speed up Ollama; fix(cli): show l…
AhmadHakami Sep 10, 2025
aa1ea2e
Cleaned up CLI and added --page-range/--page_range to both ingest and…
AhmadHakami Sep 11, 2025
23e7d3f
QA/CoT now derive content from intelligently chunked text in the spec…
AhmadHakami Sep 11, 2025
9f27376
Enforce difficulty and specificity in QA and CoT prompts
AhmadHakami Sep 11, 2025
2a9a9e9
Fixed QAGenerator.generate_qa_pairs:
AhmadHakami Sep 11, 2025
8fa9550
Updated utils/llm_processing.parse_qa_pairs to:
AhmadHakami Sep 11, 2025
9c3aacb
QA: harden prompts + parsing (Arabic/fenced JSON) and anti-meta filters
AhmadHakami Sep 11, 2025
7b358eb
Enforce requested number of pairs in QA.
AhmadHakami Sep 11, 2025
c3f4c3d
tests/unit/test_multimodal_qa_generator.py: ensures the multimodal ge…
AhmadHakami Sep 11, 2025
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
42 changes: 41 additions & 1 deletion .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -8,4 +8,44 @@ __pycache__
data/\ndata/\n*.pdf
.venv-ci/
data/
example_output
example_output
.vscode

# Generated and output files
data/generated/
data/inference/
data/outputs/
*.log
logs/
.cache/
__pycache__/
*.cache
*.tmp
*.temp
inference_outputs/
model_outputs/
results/
outputs/

# Jupyter notebook checkpoints
.ipynb_checkpoints/

# IDE files
.vscode/
.idea/
*.swp
*.swo

# OS files
.DS_Store
Thumbs.db

{new_directory}
.env
# Common misspellings / alternate names for generated artifacts
inferenced/
outptu/
data/inferenced/
data/outptu/
inferenced
outptu
8 changes: 7 additions & 1 deletion DOCS.md
Original file line number Diff line number Diff line change
Expand Up @@ -417,7 +417,8 @@ synthetic-data-kit create [OPTIONS] INPUT

| Option | Description |
|--------|-------------|
| `--type TEXT` | Content type to generate [qa\|summary\|cot] |
| `--type TEXT` | Content type to generate [qa\|summary\|cot\|cot-enhance\|multimodal-qa] |
| `-d, --difficulty TEXT` | Question difficulty [easy\|medium\|advanced] (for qa, cot, multimodal-qa) |
| `-o, --output-dir PATH` | Directory to save generated content |
| `--api-base TEXT` | VLLM API base URL |
| `-m, --model TEXT` | Model to use |
Expand All @@ -439,6 +440,11 @@ synthetic-data-kit create data/output/document.txt --type summary
# Generate Chain of Thought (CoT) reasoning examples
synthetic-data-kit create data/output/document.txt --type cot

# Control difficulty
synthetic-data-kit create data/output/document.txt --type qa --difficulty medium
synthetic-data-kit create data/output/document.txt --type cot --difficulty advanced
synthetic-data-kit create data/output/document.lance --type multimodal-qa --difficulty easy

# Use custom model
synthetic-data-kit create data/output/document.txt -m "meta-llama/Llama-3.3-8B-Instruct"
```
Expand Down
Loading