Additional modifications to train on plant genomes by gonzalobenegas · Pull Request #12 · Open-Athena/glm-experiments

gonzalobenegas · 2025-12-01T23:27:31Z

No description provided.

claude · 2025-12-01T23:31:59Z

PR Review: Additional modifications to train on plant genomes

This PR adds support for training on plant genomes by creating sharded versions of the Angiosperm_16_genomes dataset. I've conducted a comprehensive review following CLAUDE.md standards.

Summary of Changes

Added 2 utility scripts in experiments/training_data/ for dataset preparation
Modified configs/data/plants.yaml to use local sharded dataset
Added new experiment config: plants_clm_transformer_base.yaml
Updated README with dataset download command

✅ Strengths

Proper experiments directory usage - Scripts correctly placed in experiments/ with appropriate standards
PEP 723 inline script metadata - Excellent use of inline dependency declarations
Clear separation of concerns - Sharding and uploading properly separated
Good documentation - Scripts have clear docstrings

🔍 Critical Issues Requiring Action

1. Missing Issue Reference ⚠️

Problem: No GitHub issue referenced, violating Agentic Git Flow
CLAUDE.md: "All work starts with GitHub issue"
Action: Link to issue or create one

2. Missing Tests ⚠️

Problem: New experiment config not tested
Action: Add test to test_configs.py to verify config instantiation

3. Dataset Path Validation ⚠️

Problem: Changed from HF dataset ID to local path - needs verification
Action: Test that load_dataset() works with local path and streaming

4. Missing Documentation ⚠️

Problem: No README in experiments/training_data/
Action: Add README explaining workflow (shard → upload → download)

📝 Code Quality Issues

Type Hints

shard_dataset.py:77 - main() missing return type hint (add -> None)
upload_dataset.py:24 - main() missing return type hint

Hardcoded Values

upload_dataset.py:37 - Default repo ID hardcoded to "gonzalobenegas/..."
Recommendation: Make required or document that users should customize

Error Handling

upload_dataset.py:73-80 - No try/except for network errors
shard_dataset.py - No validation that output dir is writable

⚡ Performance Observations

Good:

Gradient accumulation correctly calculated: 2048 / (256 * 8) = 1
Uses zstd compression with multithreading

Could Improve:

shard_dataset.py:64 - Serial shard writing could be parallelized
Compression threads=-1 may oversubscribe on shared machines
log_every_n_steps: 1000 very sparse (only 20 logs for 20k steps)

🚦 Recommendation: REQUEST CHANGES

Required Before Merge:

✅ Associate with GitHub issue
✅ Add config test to test_configs.py
✅ Verify dataset loading works with local path
✅ Add experiments/training_data/README.md

Strongly Recommended:

Fix type hints (add -> None to main())
Add error handling to upload script
Consider log_every_n_steps: 100 instead of 1000
Make HF username configurable

📋 Next Steps

Create/link GitHub issue
Test config: uv run python glm_experiments/train.py experiment=plants_clm_transformer_base debug=fdr
Add test case to test_configs.py
Run make format and make test-full

Overall: Core implementation is solid! Just needs polish to align with project standards. Happy to help with any of these items.

gonzalobenegas added 2 commits December 1, 2025 23:12

Prepare for plant experiments

a97a01b

Update readme

e90b6f4

gonzalobenegas merged commit 338e46d into main Dec 1, 2025
6 checks passed

gonzalobenegas deleted the experiment-plants branch December 1, 2025 23:32

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Additional modifications to train on plant genomes#12

Additional modifications to train on plant genomes#12
gonzalobenegas merged 2 commits intomainfrom
experiment-plants

gonzalobenegas commented Dec 1, 2025

Uh oh!

claude bot commented Dec 1, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

gonzalobenegas commented Dec 1, 2025

Uh oh!

claude bot commented Dec 1, 2025

PR Review: Additional modifications to train on plant genomes

Summary of Changes

✅ Strengths

🔍 Critical Issues Requiring Action

1. Missing Issue Reference ⚠️

2. Missing Tests ⚠️

3. Dataset Path Validation ⚠️

4. Missing Documentation ⚠️

📝 Code Quality Issues

Type Hints

Hardcoded Values

Error Handling

⚡ Performance Observations

Good:

Could Improve:

🚦 Recommendation: REQUEST CHANGES

Required Before Merge:

Strongly Recommended:

📋 Next Steps

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant