Skip to content

Additional modifications to train on plant genomes#12

Merged
gonzalobenegas merged 2 commits intomainfrom
experiment-plants
Dec 1, 2025
Merged

Additional modifications to train on plant genomes#12
gonzalobenegas merged 2 commits intomainfrom
experiment-plants

Conversation

@gonzalobenegas
Copy link
Copy Markdown
Member

No description provided.

@claude
Copy link
Copy Markdown

claude bot commented Dec 1, 2025

PR Review: Additional modifications to train on plant genomes

This PR adds support for training on plant genomes by creating sharded versions of the Angiosperm_16_genomes dataset. I've conducted a comprehensive review following CLAUDE.md standards.


Summary of Changes

  • Added 2 utility scripts in experiments/training_data/ for dataset preparation
  • Modified configs/data/plants.yaml to use local sharded dataset
  • Added new experiment config: plants_clm_transformer_base.yaml
  • Updated README with dataset download command

✅ Strengths

  1. Proper experiments directory usage - Scripts correctly placed in experiments/ with appropriate standards
  2. PEP 723 inline script metadata - Excellent use of inline dependency declarations
  3. Clear separation of concerns - Sharding and uploading properly separated
  4. Good documentation - Scripts have clear docstrings

🔍 Critical Issues Requiring Action

1. Missing Issue Reference ⚠️

Problem: No GitHub issue referenced, violating Agentic Git Flow
CLAUDE.md: "All work starts with GitHub issue"
Action: Link to issue or create one

2. Missing Tests ⚠️

Problem: New experiment config not tested
Action: Add test to test_configs.py to verify config instantiation

3. Dataset Path Validation ⚠️

Problem: Changed from HF dataset ID to local path - needs verification
Action: Test that load_dataset() works with local path and streaming

4. Missing Documentation ⚠️

Problem: No README in experiments/training_data/
Action: Add README explaining workflow (shard → upload → download)


📝 Code Quality Issues

Type Hints

  • shard_dataset.py:77 - main() missing return type hint (add -> None)
  • upload_dataset.py:24 - main() missing return type hint

Hardcoded Values

  • upload_dataset.py:37 - Default repo ID hardcoded to "gonzalobenegas/..."
  • Recommendation: Make required or document that users should customize

Error Handling

  • upload_dataset.py:73-80 - No try/except for network errors
  • shard_dataset.py - No validation that output dir is writable

⚡ Performance Observations

Good:

  • Gradient accumulation correctly calculated: 2048 / (256 * 8) = 1
  • Uses zstd compression with multithreading

Could Improve:

  • shard_dataset.py:64 - Serial shard writing could be parallelized
  • Compression threads=-1 may oversubscribe on shared machines
  • log_every_n_steps: 1000 very sparse (only 20 logs for 20k steps)

🚦 Recommendation: REQUEST CHANGES

Required Before Merge:

  1. ✅ Associate with GitHub issue
  2. ✅ Add config test to test_configs.py
  3. ✅ Verify dataset loading works with local path
  4. ✅ Add experiments/training_data/README.md

Strongly Recommended:

  1. Fix type hints (add -> None to main())
  2. Add error handling to upload script
  3. Consider log_every_n_steps: 100 instead of 1000
  4. Make HF username configurable

📋 Next Steps

  1. Create/link GitHub issue
  2. Test config: uv run python glm_experiments/train.py experiment=plants_clm_transformer_base debug=fdr
  3. Add test case to test_configs.py
  4. Run make format and make test-full

Overall: Core implementation is solid! Just needs polish to align with project standards. Happy to help with any of these items.

@gonzalobenegas gonzalobenegas merged commit 338e46d into main Dec 1, 2025
6 checks passed
@gonzalobenegas gonzalobenegas deleted the experiment-plants branch December 1, 2025 23:32
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant