Skip to content

Conversation

@nick-youngblut
Copy link
Contributor

This pull request introduces several enhancements and new features to the embedding and transcriptomic workflows, focusing on improving usability, adding functionality, and refining the CLI interface. The key changes include adding a new vectordb command for summarizing vector database statistics, improving CLI descriptions and formatting, and updating database operations for better metadata handling.

Enhancements to embedding workflows:

  • Upserting: Added Lancedb upserting based on the cell barcode and dataset.
  • New vectordb command: Added a vectordb subcommand to summarize LanceDB vector database statistics, including dataset counts, embedding keys, and dimensions. Supports output in table, JSON, and YAML formats (src/state/_cli/_emb/_vectordb.py, src/state/__main__.py, README.md) [1] [2] [3].
  • Improved query results formatting: Updated the query results to include new columns (e.g., query_cell_id, subject_rank) and renamed columns for clarity. Added parallel processing support with a --max-workers option (src/state/_cli/_emb/_query.py) [1] [2] [3].
  • Enhanced embedding metadata: Added a --dataset-name option for specifying dataset names during embedding transformations. Removed the --lancedb-update option and improved logging for database operations (src/state/_cli/_emb/_transform.py, src/state/emb/inference.py) [1] [2] [3].

Improvements to CLI interface:

  • Custom CLI formatter: Introduced a CustomFormatter class to combine default and raw formatting styles for better CLI help messages (src/state/_cli/_utils.py, src/state/__main__.py) [1] [2].
  • Detailed subcommand descriptions: Added descriptive help messages for embedding (emb) and transcriptomic (tx) subcommands, improving usability (src/state/_cli/_emb/__init__.py, src/state/_cli/_tx/__init__.py) [1] [2].

Documentation updates:

  • README enhancements: Updated the README to reflect the new vectordb command and clarified examples for querying and transforming embeddings (README.md) [1] [2].

- Updated README to clarify that existing cell records will be updated with new embeddings and provided example dataset details.
- Improved CLI argument parsing by adding custom descriptions for embedding and transcriptomic commands.
- Introduced `CustomFormatter` for better help message formatting.
- Added `max-workers` argument for parallel processing in embedding queries.
- Refined result formatting in `run_emb_query` to include additional metadata and renamed columns for clarity.
- Updated `StateVectorDB` to support merging and updating entries in LanceDB.
- Added custom descriptions for embedding commands in `add_arguments_emb`.
- Improved help messages for query filtering in `add_arguments_query`.
- Enhanced transcriptomic command descriptions in `add_arguments_tx` for clarity.
- Introduced `CustomFormatter` for better formatting of help messages across CLI commands.
- Revised the description for the `transform` command to specify that results can be inserted into a LanceDB vector store.
- Enhanced the `query` command description to indicate it searches for cells with similar embeddings in a LanceDB vector store created with the `transform` command.
- Introduced `run_emb_vectordb` command to retrieve and display summary statistics of the LanceDB vector database.
- Updated `README.md` with usage instructions and output format options for the new command.
- Enhanced `StateVectorDB` class with `get_database_summary` method to compute and return comprehensive database statistics.
- Added argument parsing for the new command in the CLI module.
- Updated argument names in `add_arguments_infer` and `add_arguments_preprocess_infer` to use kebab-case (e.g., `--embed-key`, `--pert-col`, `--model-dir`, `--celltype-col`, `--batch-size`).
- Modified help message for the `--seed` argument in `add_arguments_preprocess_infer` to remove default value mention.
- Removed commented-out help handler in `add_arguments_train` for clarity.
- Updated argument names in the CLI to use kebab-case for consistency (e.g., `--output-dir`, `--num-hvgs`, `--control-condition`, `--pert-col`).
- Added `--log-level` argument to allow dynamic logging level configuration across commands.
- Modified logging setup in various modules to utilize the new `log_level` argument for consistent logging behavior.
- Updated `README.md` with corresponding changes to command usage examples.
- Reformatted command examples in the README to use multi-line syntax for better readability.
- Ensured consistency in the presentation of CLI commands for `tx predict` and `tx infer` sections.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant