-
Notifications
You must be signed in to change notification settings - Fork 21
Last Dev Release into Main #124
base: main
Are you sure you want to change the base?
Conversation
…nly still failing.
Duplication bug in task caching
Removed the old increment calls to assign event IDs in favor of hashes of subject IDs and timestamps which can be run lazily.
Enable flash attention
Fixes a metrics thing and enables tuning subset sizes as well for speed.
Specify test set exclusions
…_flat_reps Add computation over future summary windows to flat reps
…resource intensive manner.
Fixes the slowdowns and bugs caused by the prior improved compute practices, but requires a nested tensor package.
…removed references to np.NaN as that is removed in np 2.0
Some small changes to update things for more recent versions of packages.
WalkthroughThis pull request introduces numerous refactorings and removals across the codebase. Most changes replace standard print statements with structured logging via Loguru, update method signatures, and modify parameters. Several configuration classes have been enhanced with new attributes and validation logic, while deprecated preprocessing and evaluation modules are removed. Updates also include changes to dataset handling (Polars and PyTorch), attention mechanisms in Transformer modules, and documentation revisions. Additionally, new scripts are added and tests are updated to reflect these modifications. Changes
Sequence Diagram(s)sequenceDiagram
participant User
participant Hydra
participant LoggerInit
participant ScriptMain
User->>Hydra: Launch script with Hydra config
Hydra->>LoggerInit: Call hydra_loguru_init()
LoggerInit->>Logger: Initialize Loguru logger (set up log file, etc.)
Hydra->>ScriptMain: Provide configuration
ScriptMain->>Logger: Log messages via logger.info/error/warning during execution
Possibly related PRs
📜 Recent review detailsConfiguration used: CodeRabbit UI ⛔ Files ignored due to path filters (1)
📒 Files selected for processing (1)
🚧 Files skipped from review as they are similar to previous changes (1)
⏰ Context from checks skipped due to timeout of 90000ms (1)
🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments. CodeRabbit Commands (Invoked using PR comments)
Other keywords and placeholders
CodeRabbit Configuration File (
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 5
🧹 Nitpick comments (15)
sample_data/generate_synthetic_data.py (4)
118-119: Code style: Multiple statements on one line.Multiple statements appear on single lines after colons, which violates the PEP 8 style guide.
Apply this formatting change:
- if running_HR < 30: running_HR = 30 - if running_HR > 300: running_HR = 300 + if running_HR < 30: + running_HR = 30 + if running_HR > 300: + running_HR = 300 - if running_temp < 95: running_temp = 95 - if running_temp > 104: running_temp = 104 + if running_temp < 95: + running_temp = 95 + if running_temp > 104: + running_temp = 104Also applies to: 122-123
🧰 Tools
🪛 Ruff (0.8.2)
118-118: Multiple statements on one line (colon)
(E701)
119-119: Multiple statements on one line (colon)
(E701)
190-190: Remove unused variables.The variable
monthsis defined but never used in multiple places.Consider removing these unused variable assignments or using them if they were intended to be part of the logic:
- months = 30 * daysAlso applies to: 250-250
🧰 Tools
🪛 Ruff (0.8.2)
190-190: Local variable
monthsis assigned to but never usedRemove assignment to unused variable
months(F841)
193-193: Remove unused variable 'lab_ps'.The variable
lab_psis assigned a value fromnp.random.dirichlet()but never used.Either remove the assignment or use the variable as intended:
- lab_ps = np.random.dirichlet(alpha=[0.1 for _ in range(5)])🧰 Tools
🪛 Ruff (0.8.2)
193-193: Local variable
lab_psis assigned to but never usedRemove assignment to unused variable
lab_ps(F841)
212-212: Simplify dictionary key check.Using
.keys()in a membership test is redundant and less efficient.Simplify the code by removing the
.keys()call:- for lab in base_lab_gaps.keys(): + for lab in base_lab_gaps:🧰 Tools
🪛 Ruff (0.8.2)
212-212: Use
key in dictinstead ofkey in dict.keys()Remove
.keys()(SIM118)
EventStream/tasks/profile.py (1)
7-7: Update to modern Polars API usage.The code now uses
pl.enable_string_cache()without an explicit boolean parameter, aligning with the current Polars API which defaults this parameter toTrue.EventStream/data/dataset_polars.py (1)
2013-2013: Use direct membership check instead of.keys().Prefer
if key in some_dict:overif key in some_dict.keys():.- if mod_col in source_df[mod_col].dtype.keys(): + if mod_col in source_df[mod_col].dtype:🧰 Tools
🪛 Ruff (0.8.2)
2013-2013: Use
key in dictinstead ofkey in dict.keys()Remove
.keys()(SIM118)
EventStream/transformer/transformer.py (3)
240-240: Potentially redundant cast.
sdpa_resultis re-cast tovalue.dtypeeven thoughqueryandkeywere already cast. If there is no type change from intermediate operations, you can safely omit this step.
242-242: Duplicate return statement.
You return(sdpa_result, None)at both line 238 and line 242. Consider consolidating the code paths if they are identical, to avoid confusion.
613-613: Initialize sinusoidal terms with normal distribution.
Using random initialization forsin_div_termandcos_div_termbreaks from typical fixed-frequency sinusoidal encodings. If you're intentionally learning these frequencies, clarify the rationale in docstrings to assist maintainers in understanding the design choice.Also applies to: 616-620
EventStream/data/dataset_base.py (1)
1484-1492: Construct a unified vocabulary list with UNK at index 0.
The logic is clear. However, the variableminfor m, idxmap in self.unified_vocabulary_idxmap.items():is never used inside the loop. Renaming it to_would better reflect that it’s unused.🧰 Tools
🪛 Ruff (0.8.2)
1488-1488: Loop control variable
mnot used within loop bodyRename unused
mto_m(B007)
EventStream/data/pytorch_dataset.py (3)
132-155: Usage ofjust_cacheargument.The newly introduced
just_cache: bool = Falseparameter is not referenced within__init__. If this is intentional (e.g., future usage or external calls to skip steps), consider adding clarifying docstrings or removing the parameter if it's not needed.def __init__(self, config: PytorchDatasetConfig, split: str, just_cache: bool = False): ... + # If you plan to use just_cache in the future, documenting it here + # might prevent confusion among collaborators.
209-209: Avoid shadowing then_eventsloop variable.The identifier
n_eventsinside theforloop parameter overshadows the outern_eventsdefinition, which can confuse maintainers or static analysis tools.for i, (subj, subj_n_events) in enumerate(zip(subject_ids, n_events)): if subj in self.subj_indices or subj in self.subj_seq_bounds: raise ValueError(f"Duplicate subject {subj} in {shard}!") ...🧰 Tools
🪛 Ruff (0.8.2)
209-209: Loop control variable
n_eventsoverrides iterable it iterates(B020)
364-395: Data anomalies handling.Automatically removing malformed subjects meshes well with real-world data patterns where negative or zero inter-event times occur. Document any assumptions about potential side effects (like how it impacts training set distribution).
EventStream/data/config.py (2)
979-983: Loading measurement configs.Filtering out dropped measurements is a logical design choice. No issues spotted, but consider logging the count of loaded vs. dropped measurements for auditing.
1011-1046: Data parameter hashing & caching.The
_data_parameters_and_hashproperty is a pragmatic approach to uniquely identify dataset settings. Consider clarifying that changes in any of these fields generate a new directory.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
⛔ Files ignored due to path filters (7)
.github/workflows/tests.ymlis excluded by!**/*.ymlconfigs/dataset_base.yamlis excluded by!**/*.yamlconfigs/normalizer_config/standard_scaler.yamlis excluded by!**/*.yamlconfigs/outlier_detector_config/stddev_cutoff.yamlis excluded by!**/*.yamlpyproject.tomlis excluded by!**/*.tomlsample_data/dataset.yamlis excluded by!**/*.yamlsample_data/pretrain_NA.yamlis excluded by!**/*.yaml
📒 Files selected for processing (52)
EventStream/baseline/FT_task_baseline.py(6 hunks)EventStream/data/README.md(2 hunks)EventStream/data/config.py(14 hunks)EventStream/data/dataset_base.py(21 hunks)EventStream/data/dataset_polars.py(44 hunks)EventStream/data/preprocessing/README.md(0 hunks)EventStream/data/preprocessing/__init__.py(0 hunks)EventStream/data/preprocessing/preprocessor.py(0 hunks)EventStream/data/preprocessing/standard_scaler.py(0 hunks)EventStream/data/preprocessing/stddev_cutoff.py(0 hunks)EventStream/data/pytorch_dataset.py(9 hunks)EventStream/data/time_dependent_functor.py(3 hunks)EventStream/data/types.py(1 hunks)EventStream/data/visualize.py(7 hunks)EventStream/evaluation/MCF_evaluation.py(0 hunks)EventStream/evaluation/general_generative_evaluation.py(0 hunks)EventStream/logger.py(1 hunks)EventStream/tasks/profile.py(4 hunks)EventStream/transformer/config.py(5 hunks)EventStream/transformer/lightning_modules/embedding.py(2 hunks)EventStream/transformer/lightning_modules/fine_tuning.py(7 hunks)EventStream/transformer/lightning_modules/generative_modeling.py(6 hunks)EventStream/transformer/lightning_modules/zero_shot_evaluator.py(3 hunks)EventStream/transformer/model_output.py(7 hunks)EventStream/transformer/transformer.py(4 hunks)EventStream/utils.py(2 hunks)README.md(1 hunks)configs/README.md(0 hunks)docs/MIMIC_IV_tutorial/data_extraction_processing.md(2 hunks)sample_data/build_sample_task_DF.py(1 hunks)sample_data/generate_synthetic_data.py(9 hunks)scripts/build_dataset.py(6 hunks)scripts/convert_to_ESDS.py(1 hunks)scripts/finetune.py(1 hunks)scripts/generate_trajectories.py(0 hunks)scripts/get_embeddings.py(2 hunks)scripts/launch_finetuning_wandb_hp_sweep.py(2 hunks)scripts/launch_from_scratch_supervised_wandb_hp_sweep.py(2 hunks)scripts/launch_pretraining_wandb_hp_sweep.py(2 hunks)scripts/launch_sklearn_baseline_supervised_wandb_hp_sweep.py(2 hunks)scripts/prepare_pretrain_subsets.py(3 hunks)scripts/pretrain.py(2 hunks)scripts/sklearn_baseline.py(1 hunks)scripts/zeroshot.py(2 hunks)setup.py(1 hunks)tests/data/preprocessing/test_standard_scaler.py(0 hunks)tests/data/preprocessing/test_stddev_cutoff.py(0 hunks)tests/data/test_config.py(8 hunks)tests/data/test_dataset_base.py(2 hunks)tests/data/test_dataset_polars.py(0 hunks)tests/data/test_pytorch_dataset.py(9 hunks)tests/test_e2e_runs.py(5 hunks)
💤 Files with no reviewable changes (12)
- EventStream/data/preprocessing/README.md
- configs/README.md
- EventStream/data/preprocessing/init.py
- EventStream/evaluation/MCF_evaluation.py
- EventStream/evaluation/general_generative_evaluation.py
- EventStream/data/preprocessing/preprocessor.py
- scripts/generate_trajectories.py
- EventStream/data/preprocessing/stddev_cutoff.py
- tests/data/preprocessing/test_standard_scaler.py
- tests/data/preprocessing/test_stddev_cutoff.py
- tests/data/test_dataset_polars.py
- EventStream/data/preprocessing/standard_scaler.py
🧰 Additional context used
🧬 Code Definitions (15)
scripts/finetune.py (3)
EventStream/transformer/lightning_modules/fine_tuning.py (2) (2)
FinetuneConfig(273-442)train(446-586)scripts/get_embeddings.py (1) (1)
main(24-28)scripts/zeroshot.py (1) (1)
main(25-29)
EventStream/transformer/lightning_modules/fine_tuning.py (3)
EventStream/transformer/config.py (1) (1)
StructuredTransformerConfig(353-935)EventStream/utils.py (1) (1)
from_json_file(329-365)EventStream/data/dataset_base.py (1) (1)
load(339-377)
scripts/sklearn_baseline.py (8)
scripts/launch_sklearn_baseline_supervised_wandb_hp_sweep.py (1) (1)
main(81-111)scripts/get_embeddings.py (1) (1)
main(24-28)scripts/finetune.py (1) (1)
main(25-39)scripts/launch_finetuning_wandb_hp_sweep.py (1) (1)
main(77-107)scripts/launch_from_scratch_supervised_wandb_hp_sweep.py (1) (1)
main(81-111)scripts/launch_pretraining_wandb_hp_sweep.py (1) (1)
main(79-109)scripts/pretrain.py (1) (1)
main(29-43)scripts/zeroshot.py (1) (1)
main(25-29)
scripts/prepare_pretrain_subsets.py (10)
EventStream/data/config.py (2) (2)
SeqPaddingSide(748-761)SubsequenceSamplingStrategy(764-784)scripts/finetune.py (1) (1)
main(25-39)scripts/get_embeddings.py (1) (1)
main(24-28)scripts/pretrain.py (1) (1)
main(29-43)scripts/zeroshot.py (1) (1)
main(25-29)scripts/launch_finetuning_wandb_hp_sweep.py (1) (1)
main(77-107)scripts/launch_from_scratch_supervised_wandb_hp_sweep.py (1) (1)
main(81-111)scripts/launch_pretraining_wandb_hp_sweep.py (1) (1)
main(79-109)scripts/launch_sklearn_baseline_supervised_wandb_hp_sweep.py (1) (1)
main(81-111)scripts/sklearn_baseline.py (1) (1)
main(19-24)
scripts/build_dataset.py (1)
EventStream/logger.py (1) (1)
hydra_loguru_init(7-10)
EventStream/data/time_dependent_functor.py (1)
EventStream/data/config.py (2) (2)
measurement_metadata(1429-1455)measurement_metadata(1458-1469)
EventStream/transformer/lightning_modules/generative_modeling.py (1)
EventStream/transformer/config.py (3) (3)
MetricsConfig(106-199)Split(26-42)MetricCategories(45-61)
tests/test_e2e_runs.py (1)
tests/data/test_pytorch_dataset.py (1) (1)
tearDown(360-361)
EventStream/transformer/model_output.py (1)
EventStream/transformer/utils.py (1) (1)
str_summary(11-30)
EventStream/transformer/transformer.py (1)
EventStream/data/types.py (1) (1)
batch_size(178-183)
EventStream/data/visualize.py (1)
EventStream/data/dataset_base.py (2) (2)
events_df(302-313)events_df(316-317)
EventStream/data/dataset_polars.py (3)
EventStream/data/config.py (3) (3)
drop(1391-1415)measurement_configs(980-983)vocabulary_config(972-973)EventStream/data/types.py (5) (5)
keys(310-312)items(306-308)DataModality(832-868)NumericDataModalitySubtype(871-897)TemporalityType(808-829)EventStream/data/vocabulary.py (2) (2)
filter(186-274)idxmap(78-90)
tests/data/test_pytorch_dataset.py (4)
EventStream/data/config.py (2) (2)
PytorchDatasetConfig(788-1062)VocabularyConfig(699-745)tests/data/test_dataset_base.py (1) (1)
setUp(179-186)tests/data/test_types.py (1) (1)
setUp(65-67)EventStream/data/pytorch_dataset.py (2) (2)
PytorchDataset(61-609)max_seq_len(417-418)
EventStream/data/pytorch_dataset.py (2)
EventStream/utils.py (2) (2)
count_or_proportion(26-88)to_dict(252-287)EventStream/data/config.py (6) (6)
PytorchDatasetConfig(788-1062)SeqPaddingSide(748-761)SubsequenceSamplingStrategy(764-784)to_dict(955-959)to_dict(1538-1554)to_dict(1934-1944)
EventStream/data/config.py (1)
EventStream/data/dataset_base.py (2) (2)
vocabulary_config(1426-1448)measurement_configs(975-984)
🪛 Ruff (0.8.2)
sample_data/generate_synthetic_data.py
118-118: Multiple statements on one line (colon)
(E701)
119-119: Multiple statements on one line (colon)
(E701)
122-122: Multiple statements on one line (colon)
(E701)
123-123: Multiple statements on one line (colon)
(E701)
190-190: Local variable months is assigned to but never used
Remove assignment to unused variable months
(F841)
193-193: Local variable lab_ps is assigned to but never used
Remove assignment to unused variable lab_ps
(F841)
212-212: Use key in dict instead of key in dict.keys()
Remove .keys()
(SIM118)
250-250: Local variable months is assigned to but never used
Remove assignment to unused variable months
(F841)
EventStream/data/dataset_polars.py
1648-1648: Function definition does not bind loop variable m
(B023)
1648-1648: Function definition does not bind loop variable key_prefix
(B023)
1652-1652: Function definition does not bind loop variable m
(B023)
1652-1652: Function definition does not bind loop variable val_prefix
(B023)
1654-1654: Function definition does not bind loop variable m
(B023)
1654-1654: Function definition does not bind loop variable val_prefix
(B023)
1657-1657: Function definition does not bind loop variable m
(B023)
1657-1657: Function definition does not bind loop variable val_prefix
(B023)
1658-1658: Function definition does not bind loop variable m
(B023)
1658-1658: Function definition does not bind loop variable val_prefix
(B023)
1659-1659: Function definition does not bind loop variable m
(B023)
1659-1659: Function definition does not bind loop variable val_prefix
(B023)
1666-1666: Function definition does not bind loop variable m
(B023)
2013-2013: Use key in dict instead of key in dict.keys()
Remove .keys()
(SIM118)
EventStream/data/pytorch_dataset.py
209-209: Loop control variable n_events overrides iterable it iterates
(B020)
EventStream/data/dataset_base.py
1488-1488: Loop control variable m not used within loop body
Rename unused m to _m
(B007)
⏰ Context from checks skipped due to timeout of 90000ms (1)
- GitHub Check: run_tests_ubuntu
🔇 Additional comments (169)
README.md (1)
34-36: Good addition of compatibility information.This note clearly explains the version requirement for Polars and why it matters, helping users troubleshoot common issues when using outdated environments. The specific mention of the function naming change (
groupbyvsgroup_by) directly ties to code changes in this PR.sample_data/build_sample_task_DF.py (1)
29-30: Function name updated to match Polars >= 0.19.The change from
groupbytogroup_byaligns with the newer Polars API, as noted in the README update. This ensures compatibility with the required library version.sample_data/generate_synthetic_data.py (1)
3-5: Improved docstring formatting and clarity.The docstring has been reformatted with proper line wrapping for better readability, without changing the content.
Also applies to: 10-12
EventStream/data/README.md (1)
79-81: Consistent representation of missing values in documentation.The change from
np.NaNtofloat('nan')ensures documentation consistency with the actual implementation for representing missing values in the codebase.Also applies to: 393-395
EventStream/utils.py (1)
19-19: Appropriate import addition for structured logging.Adding the Loguru logger import is consistent with the PR objective of replacing print statements with structured logging.
EventStream/transformer/lightning_modules/zero_shot_evaluator.py (3)
13-13: Appropriate import addition for structured logging.Adding the Loguru logger import is consistent with the PR objective of replacing print statements with structured logging.
172-175: Good replacement of print with logger.error for failures.Replacing print statements with structured logging for error conditions is appropriate and improves the ability to track and filter issues.
384-384: Good replacement of print with logger.info for status messages.Using logger.info for informational messages is consistent with good logging practices.
EventStream/transformer/lightning_modules/embedding.py (3)
9-9: Appropriate import addition for structured logging.Adding the Loguru logger import is consistent with the PR objective of replacing print statements with structured logging.
157-159: Good replacement of print with logger.info for file existence checks.Using structured logging for messages about existing files improves log consistency and filtering capabilities.
161-161: Good replacement of print with logger.info for saving operations.Using structured logging for saving operations enhances observability and maintains a consistent logging approach.
EventStream/data/types.py (1)
753-753: Improved DataFrame display format for truncated lists.The change to use ellipses for truncation improves readability of the DataFrame representation, especially for large lists of floating-point values.
EventStream/tasks/profile.py (2)
85-85: DataFrame schema updates for nullable columns.The schema for
start_timecolumn has been updated fromf32tonulltype, correctly reflecting that this column can contain null values, which is consistent with the example data.Also applies to: 98-98
154-154:❓ Verification inconclusive
Standardize Polars API method naming.
The code has been updated to use
group_byinstead ofgroupbyandpl.len()instead ofpl.count(). These changes align with the current Polars API naming conventions for better consistency.Let's confirm these changes match the current Polars API:
Also applies to: 156-156, 182-182, 256-257, 283-284, 300-301, 341-341, 356-357
🏁 Script executed:
#!/bin/bash # Let's check if these are the correct method names in the current Polars version pip show polars | grep Version python -c "import polars as pl; print('group_by method exists:', hasattr(pl.DataFrame, 'group_by')); print('len function exists:', hasattr(pl, 'len'))"Length of output: 368
Polars API Naming Update – Manual Verification Needed
The updated code now uses
group_byin place ofgroupbyandpl.len()instead ofpl.count(), which aligns with the intended new naming conventions. However, our automated verification could not confirm this because the Polars module wasn’t present in the environment.
- Action Required: Please manually verify that the current Polars version used in your project supports these method names.
- Suggestion: Ensure the Polars package is installed (e.g., via your CI or local environment) and re-run a version check to confirm the API changes.
EventStream/transformer/lightning_modules/generative_modeling.py (4)
15-15: Add structured logging with Loguru.Added import of
loggerfromloguruto implement structured logging throughout the module.
283-286: Improve error handling with structured logging.Replaced print statements with
logger.errorfor improved error visibility and organization when metric computation fails. This provides better debugging capabilities in production environments.
522-529: Enhanced metrics configuration for validation phases.Updated
final_validation_metrics_configto specifically include loss metrics for tuning and held-out splits instead of skipping all metrics. This provides more meaningful validation metrics during final model evaluation.The previous default (
do_skip_all_metrics=False) was non-specific, while the new implementation explicitly defines which metrics to include for each split, making the configuration more predictable and explicit.
599-600: Replace print statements with structured logging.Changed multiple print statements to appropriate
logger.infocalls throughout the file. This standardizes the logging approach and enables better control over log output management.Also applies to: 604-605, 627-628, 708-709
EventStream/data/visualize.py (1)
164-165: Standardize Polars API method naming conventions.Updated all occurrences of
groupbytogroup_byto align with current Polars API naming conventions. This ensures compatibility with newer versions of the library and maintains consistent method naming throughout the codebase.Also applies to: 182-183, 256-257, 283-284, 300-301, 341-342, 356-357
EventStream/transformer/model_output.py (7)
12-12: Add structured logging with Loguru.Added import of
loggerfromloguruto implement structured logging throughout the module.
434-437: Improve warning messages with structured logging.Replaced print statements with
logger.warningcalls for improper measurement detection. This provides better log level control and visibility of warnings in production environments.
461-461: Standardize warning logging.Replaced print statement with
logger.warningfor consistent logging of improper measurement warnings.
529-534: Enhance error message clarity with detailed context.Improved the error handling by providing more detailed error messages that include the measurement name, shape information, vocabulary offset, and the actual indices that caused the error. This makes debugging easier by providing more context about the failure.
1027-1031: Improve error message for index construction failures.Enhanced the error message when concatenating dynamic indices fails, making it easier to debug by showing the shapes of both tensor arguments.
1496-1503: Provide comprehensive error context for classification issues.Improved error message for classification loss computation failures with detailed information about vocabulary ranges and label values. This makes it significantly easier to identify and diagnose issues with classification predictions.
1361-1361: Enhance TTE error messages with tensor summaries.Improved the error message for time-to-event log probability computation failures by including tensor shape and content information, which aids in debugging.
EventStream/data/time_dependent_functor.py (2)
147-149: Use of.dt.total_nanoseconds()for better precision.This change correctly updates the age calculation to use
.dt.total_nanoseconds(), which enhances clarity and precision.
190-193: Example parameters initialization in docstring.These added fields in the example (
mean,std,thresh_large,thresh_small) align with the updated metadata logic. No concerns here.tests/test_e2e_runs.py (6)
33-40: Added "esds" directory for new dataset workflow.Including
"esds"in the loop is consistent with the new ESDS dataset creation steps.
48-56: Conditional subtests for clearer reporting.Using
subTestwhenuse_subtest=Trueimproves test structure and error isolation. This is a good testing practice.
73-73: Skipping subtests for dataset build command.Passing
use_subtest=Falseis correct if we want a single fail/success outcome for the build process.
75-82: New method for ESDS dataset creation.The
build_ESDS_datasetmethod is a clean addition to generate ESDS format from the core dataset.
110-110: Disable subtest for pretraining command.Again, disabling subtest for pretraining is consistent if the entire command is treated as one test step.
256-256: Invoke ESDS dataset build in end-to-end test.Ensures ESDS creation is part of the comprehensive E2E pipeline.
EventStream/data/dataset_polars.py (20)
13-13: Importingdefaultdictto manage grouped data structures.
No issues here.
15-15: Importingtimedeltafor rolling window offsets.
Seems appropriate for time-based aggregation.
22-22: Importingpyarrowfor schema definitions.
Enables better handling of Arrow schemas.
23-23: Added Loguru for structured logging.
Enhances debug logging capabilities.
38-38: Enabling Polars string cache.Helps with reducing memory usage when dealing with many repeated strings.
40-55: Mapping Polars dtypes to PyArrow dtypes.Covering a wide range of data types ensures consistent schema conversions.
115-118: New columns inMETADATA_SCHEMA.Adds
thresh_high,thresh_low,mean,stdfor refined numeric metadata processing.
193-193: New debug logging for query.
No issues; helpful for troubleshooting.
222-224: Reading from database with updated parameters.Changes ensure correct argument passing to
read_database_uri.
231-231: Aliassubject_idin column expressions.Provides consistent naming for subject ID column.
642-642: Calling.collect()onevents_df.Collecting lazy computations at this step is crucial for upcoming transformations.
645-646: Group by unique timestamps and subject_id.When
agg_by_time_scaleis None, grouping ensures distinct event definitions.
649-656: Aggregating timestamps into dynamic buckets.Uses
group_by_dynamicfor time-bucketed event grouping. Logical approach.
664-667: Hash new aggregated event group.Same hashing approach applies here. Watch for collisions if the dataset grows large.
707-707: Debug log for subject event counts.No issues, improves visibility during the collection step.
725-742: Robust error handling for column filtering.This block clarifies type mismatches by enumerating erroneous data. Looks good.
869-873: Joining integer-inference results into metadata.Combining
is_int_exproutcomes is seamless. No concerns.
883-883: Group byvocab_keys_colin numerical inference.No direct issues with grouping logic.
634-640:✅ Verification successful
Ensure matching event_id dtype across dataframes.
Casting
dynamic_measurements_dfevent_id to matchevents_dfavoids type mismatches. Verify that no data is inadvertently truncated.
🏁 Script executed:
#!/bin/bash # Check for all references to event_id casts and potential mismatches. rg -A 3 'col\("event_id"\).*cast'Length of output: 348
Review Verification: Event ID Casting is Correct, but Check for Data Truncation
The cast operation in
EventStream/data/dataset_polars.pyat lines 634–640 appropriately ensures that theevent_iddatatype ofdynamic_measurements_dfmatches that ofevents_df. Our search confirms that this is the sole usage of a cast forevent_idand no extraneous casts are present.
- Ensure that the cast from the original dtype to
event_id_dtdoes not inadvertently truncate or alter data (e.g., if converting from a higher-precision numeric type to a lower one).Overall, the implementation appears correct as long as you’ve confirmed through testing that no precision loss occurs during the cast.
332-334:❓ Verification inconclusive
Hash-based event_id assignment.
Generating
event_idthrough hashing is convenient but can theoretically risk collisions for extremely large datasets. Consider a collision-safe approach if data volumes are massive.
🏁 Script executed:
#!/bin/bash # If there's a collision check mechanism, let's locate references to event hashing. rg 'event_id.*hash'Length of output: 21
Hash-based event_id assignment – verify collision management
The current implementation generates
event_idvia a hash on selected columns without any collision detection. While collisions are very unlikely, especially on typical dataset sizes, there isn’t any accompanying mechanism to detect or handle collisions—this could be a potential risk when processing extremely large datasets. Please verify whether a collision-safe approach is needed for your expected volumes or if additional safeguards should be implemented.EventStream/transformer/transformer.py (12)
122-122: Use a descriptive dropout naming.
The rename fromattn_dropouttoattn_dropout_pclarifies that it is a dropout probability. This is standard practice and appears consistent with the rest of the code.
179-179: Explicitly raise an error for unsupported head_mask.
You raise aValueErrorifhead_maskis notNone, which is appropriate since the feature is currently unsupported. Consider adding more detail in the docstring about future plans or alternatives for head masking.Also applies to: 185-187
188-188: Initialize mask_value with negative infinity.
Usingtorch.finfo(value.dtype).minand creating atorch.fullfor masking is a standard approach for attention masking. The implementation is straightforward and correct.Also applies to: 190-191
193-196: Casting query and key to match value dtype.
Ensuringqueryandkeyshare the same floating precision asvaluehelps maintain consistency in operations. This is aligned with certain GPT-Neo/J designs.
199-202: Handling batch_size=1 and padded attention.
Raising an error whenattention_mask[0, 0, 0, -1] < -1forbatch_size == 1appears to be a specific constraint of your “BetterTransformer.” Please verify edge cases to ensure no unexpected user interruptions in multi-pass evaluation.
204-218: Attention mask ignored in single-batch or training mode.
Bypassingattention_maskin this branch could be risky if the model relies on specialized or variable-length masks. Confirm that ignoring the mask forbatch_size == 1or in training is truly the intended behavior.
219-232: Applying causal and external attention masks together.
Combining thecausal_maskandattention_maskis valid. Double-check the indexing (key_length - query_length) for scenarios wherequery_lengthequals or exceedskey_length.
234-236: Scaled dot product attention for multi-batch scenario.
Usingis_causal=Falsehere complements the explicit causal mask above. This separation of responsibilities keeps the logic clearer.
238-238: Returning None for attention weights.
Providing(sdpa_result, None)makes sense if you do not currently expose attention weights. This is consistent with the rest of the_attnusage.
626-631: Parameter assignment for even vs. odd embedding dimensions.
Truncatingcos_div_termfor odd embedding sizes handles uneven dimensional splits. Check for corner cases likeembedding_dim=1to ensure no out-of-range errors.
678-680: Exponent-based frequency calculation.
Using exponentials oftorch.arange(...)is a standard approach to generate sinusoidal frequencies. This is consistent with well-known positional encoding methods.Also applies to: 682-683
689-691: Non-trainable parameters for sinusoidal encoding.
Definingsin_div_termandcos_div_termwithrequires_grad=Falseensures static embeddings, closely matching the original sinusoidal approach.Also applies to: 692-694
EventStream/data/dataset_base.py (8)
21-22: Added polars and loguru imports.
Importingpolarsfor data manipulation and usinglogurufor structured logging is consistent with your broader logging refactor.Also applies to: 24-24
187-187: Refined signature for loading subject data.
Passingschema.columns_to_loadandsubject_id_col=schema.subject_id_colstandardizes_load_input_dfcalls. Check that any previously handled ID-type conversions are still addressed.Also applies to: 189-189, 191-191
220-220: Consolidated input loading.
Reusing_load_input_dfwith combinedall_columnshelps unify data retrieval. Ensure the function handles missing or duplicate columns gracefully.
226-227: Debug logging for filtering.
Addinglogger.debug("Filtering")provides traceability when applying conditions inschema.filter_on. This is good for diagnosing potential data coverage issues.
264-264: Appending events and measurements.
Appending the newly processedeventsandmeasurementsis straightforward. Consider confirming they are valid (non-None) before adding to the lists.Also applies to: 266-266
290-290: Enhanced logging during load and extraction.
Replacing silent operations or print statements with structuredlogger.info/logger.debugcalls improves observability of the data flow.Also applies to: 310-310, 329-329, 367-367, 475-475, 479-479
510-511: Post-processing events after assignment.
Applying_agg_by_timeand_sort_eventswhenevents_dfis not None should help maintain chronological correctness. Ensure_validate_initial_dfsis still valid for edge cases.Also applies to: 513-515, 521-521
578-578: Mandatory subject sets in splitting logic.
Your newmandatory_set_IDsfeature ensures some subject IDs are forced into specific splits. Confirm that leftover IDs and fraction-based splits still behave as expected without conflict.Also applies to: 592-592, 625-634, 635-637
EventStream/logger.py (1)
1-1: Hydra + Loguru integration.
Defininghydra_loguru_init()to place logs in Hydra’s output directory is a neat solution. This improves consistency across scripts.Also applies to: 3-4, 7-10
scripts/get_embeddings.py (1)
14-14: Implementation of structured logging looks good.The addition of the Loguru initialization properly integrates structured logging into the script's execution flow, which is consistent with the codebase-wide refactoring approach described in the PR.
Also applies to: 25-25
scripts/zeroshot.py (1)
15-15: Structured logging implementation looks good.The Loguru initialization is properly placed at the start of the main function, ensuring logging is set up before any operations are performed. This is consistent with the logging pattern implemented across other scripts.
Also applies to: 26-26
scripts/sklearn_baseline.py (1)
15-15: Consistent logging implementation.The integration of Loguru-based logging follows the same pattern as other scripts in the codebase, providing consistency in how logging is initialized across the application.
Also applies to: 20-20
scripts/launch_sklearn_baseline_supervised_wandb_hp_sweep.py (1)
18-19: Logging implementation matches the project pattern.The addition of Loguru initialization at the beginning of the main function follows the established pattern across the codebase, ensuring consistent logging behavior before the main execution flow.
Also applies to: 82-82
scripts/launch_pretraining_wandb_hp_sweep.py (2)
18-18: Well-structured import of the logging initialization function.The addition of the
hydra_loguru_initimport from theEventStream.loggermodule follows good practices, clearly indicating the source of the function.
80-80: Appropriate placement of logging initialization.The
hydra_loguru_init()function is called at the beginning of themainfunction, ensuring logging is configured before any other operations occur. This is a good practice for enabling comprehensive logging throughout the script's execution.setup.py (1)
27-27: Script entry point correctly updated.The script entry has been updated from
"scripts/launch_wandb_hp_sweep.py"to"scripts/launch_pretraining_wandb_hp_sweep.py", which aligns with the refactoring and specialization of the script for pretraining tasks.scripts/pretrain.py (2)
19-19: Appropriate import of logging initialization function.The import of
hydra_loguru_initfromEventStream.loggeris well-placed and consistent with the logging standardization across the codebase.
30-30: Correct placement of logging initialization.The
hydra_loguru_init()function is called at the beginning of themainfunction, ensuring logging is properly configured before any other operations. This approach is consistent with the standardized logging strategy being implemented across the codebase.scripts/finetune.py (2)
18-18: Well-structured import of logging initialization.The import of
hydra_loguru_initfromEventStream.loggerfollows good practices and is consistent with the logging standardization across the codebase.
26-26: Proper placement of logging initialization.The
hydra_loguru_init()call at the beginning of themainfunction ensures consistent logging configuration before any processing begins. This follows the same pattern implemented across all scripts, creating a unified logging approach.scripts/launch_finetuning_wandb_hp_sweep.py (2)
18-19: Good addition of structured logging import.Adding the
hydra_loguru_initimport supports the transition to structured logging with Loguru.
78-78: Excellent logging initialization.Calling
hydra_loguru_init()at the beginning of themainfunction ensures proper logging setup before any operations are performed. This will create log files in the Hydra output directory, improving traceability.scripts/launch_from_scratch_supervised_wandb_hp_sweep.py (2)
18-19: Good addition of structured logging import.Adding the
hydra_loguru_initimport supports the transition to structured logging with Loguru.
82-82: Excellent logging initialization.Calling
hydra_loguru_init()at the beginning of themainfunction ensures proper logging setup before any operations are performed. This will create log files in the Hydra output directory, improving traceability.EventStream/transformer/config.py (11)
14-14: Good addition of logger import.Adding Loguru's logger import supports the transition from print statements to structured logging.
575-577: Improved logging for warning messages.Replacing print statements with
logger.warningprovides better log structure, level control, and output options.
581-583: Improved logging for warning messages.Replacing print statements with
logger.warningprovides better log structure, level control, and output options.
629-633: Improved logging for configuration warnings.Using
logger.warninginstead of print statements for configuration parameter warnings.
636-640: Improved logging for configuration warnings.Using
logger.warninginstead of print statements for configuration parameter warnings.
643-648: Improved logging for configuration warnings.Using
logger.warninginstead of print statements for configuration parameter warnings.
651-653: Improved logging for configuration warnings.Using
logger.warninginstead of print statements for configuration parameter warnings.
656-658: Improved logging for configuration warnings.Using
logger.warninginstead of print statements for configuration parameter warnings.
759-764: Improved logging for TTE configuration warnings.Using
logger.warninginstead of print statements for Time-To-Event configuration parameter warnings.
767-771: Improved logging for TTE configuration warnings.Using
logger.warninginstead of print statements for Time-To-Event configuration parameter warnings.
774-778: Improved logging for TTE configuration warnings.Using
logger.warninginstead of print statements for Time-To-Event configuration parameter warnings.scripts/build_dataset.py (6)
18-18: Good addition of logger import.Adding Loguru's logger import supports the transition from print statements to structured logging.
34-34: Good addition of hydra_loguru_init import.This import brings in the Hydra-compatible logger initialization function.
54-58: Updated docstring example for logger usage.The docstring examples have been updated to demonstrate the use of Loguru logger instead of print statements, which provides better guidance for developers.
74-74: Improved warning logging.Replaced print statement with
logger.warningfor clearer and more structured warning messages.
83-84: Excellent logging initialization.Calling
hydra_loguru_init()at the beginning of themainfunction ensures proper logging setup before any operations are performed. This will create log files in the Hydra output directory, improving traceability.
365-365: Improved informational logging.Replaced print statement with
logger.infofor providing information about omitted configuration parameters, which improves log consistency and structure.docs/MIMIC_IV_tutorial/data_extraction_processing.md (3)
24-25: Updated repository name from EFGPT to ESGPT.The repository naming has been consistently updated throughout the document, reflecting the project's new name.
27-33: Updated command line references to use ESGPT_PATH.The commands now correctly reference the updated environment variable
ESGPT_PATHinstead of the previousEFGPT_PATH, maintaining consistency with the project renaming.
46-47: Updated Hydra configuration reference to use ESGPT.The Hydra configuration now correctly points to the ESGPT repository's configs path, completing the renaming update throughout the document.
scripts/prepare_pretrain_subsets.py (4)
24-28: Added required imports for structured logging.The addition of the
logurulogger and the customhydra_loguru_initutility function enables structured logging throughout the script, which is a significant improvement over using print statements.
33-33: Initialized structured logging system.Added initialization of the Loguru logging system via
hydra_loguru_init()at the beginning of the main function, ensuring consistent logging configuration throughout script execution.
63-63: Replaced print with structured logging for experiment directory.Using
logger.info()instead ofprint()provides better log management, including timestamps, log levels, and potential filtering capabilities.
255-255: Replaced print with structured logging for command file writing.Using
logger.info()instead ofprint()for notifying about command file creation ensures consistent log formatting and improves observability of the script's execution.EventStream/baseline/FT_task_baseline.py (7)
19-19: Added import for structured logging.The addition of the
logurulogger import enables structured logging throughout the file, replacing print statements with more flexible and powerful logging capabilities.
39-39: Enabled string caching for polars.Added
pl.enable_string_cache()to improve performance when working with string columns in polars dataframes, which is especially beneficial for repeated operations on the same string values.
191-191: Added explicit column selection after caching filtered task.This new line selects specific columns (
"subject_id","timestamp", and window features) after caching the filtered task data, which ensures consistent column ordering and potentially reduces memory usage by excluding unnecessary columns.
654-654: Replaced print with structured logging for config saving.Using
logger.info()instead ofprint()for configuration saving notification provides better log management, including timestamps and log levels.
679-679: Replaced print with structured logging for representations loading.Using
logger.info()instead ofprint()for loading representations notification improves log consistency and management.
711-718: Replaced print with structured logging for dataset loading.Using
logger.info()instead ofprint()for dataset loading notification improves observability and log management. The multi-line format maintains readability while providing structured logging benefits.
721-730: Replaced print statements with structured logging for model lifecycle.Converting all model initialization, fitting, saving, and evaluation print statements to
logger.info()calls creates a consistent logging pattern throughout the model lifecycle, improving debugging and monitoring capabilities.EventStream/transformer/lightning_modules/fine_tuning.py (7)
17-17: Added import for structured logging.The addition of the
logurulogger import enables structured logging throughout the file, replacing print statements with more flexible and configurable logging capabilities.
187-190: Enhanced error reporting with structured logging.Replaced print statement with
logger.error()for metric computation failures, which provides better visibility for errors and maintains consistent logging practices throughout the codebase.
400-403: Improved warning visibility with structured logging.Using
logger.warning()instead of print for notifying about missing train subset seed provides better log categorization and filtering capabilities, making it easier to identify potential configuration issues.
406-443: Replaced print statements with structured logging for configuration handling.Replaced multiple print statements with appropriate
logger.info()andlogger.warning()calls for loading and overwriting configuration parameters, providing consistent logging throughout the configuration management process.
470-476: Enhanced configuration file saving with structured logging.Using
logger.info()instead of print for configuration file saving operations improves logging consistency and enhances visibility of file operations.
492-492: Updated commented-out code to use structured logging.Even in commented-out code, print statements were updated to use
logger.info(), maintaining consistency throughout the codebase for future use when this code is uncommented.
579-579: Enhanced metrics saving notification with structured logging.Using
logger.info()instead of print for final metrics saving notification maintains consistent logging practices throughout the file.scripts/convert_to_ESDS.py (6)
4-9: Good practice using stackprinter with a graceful fallback.The script correctly implements a try-except block for the stackprinter import, allowing the script to continue running without this development dependency. This is a good pattern for optional dependencies.
26-31: Well-structured configuration class with appropriate type hints.The
ConversionConfigclass is properly decorated with@hydra_dataclassand includes clear type hints with union types for string or Path parameters. The default values are sensible for optional parameters.
33-37: Good use of post-initialization to standardize path types.The
__post_init__method ensures consistency by converting string path arguments to Path objects, which will make path operations more reliable throughout the code.
40-46: Configuration setup and validation is implemented correctly.The main function is properly decorated with Hydra's decorator, initializes logging, and handles configuration instantiation if needed. The type check ensures the configuration is of the expected type.
47-55: Thorough check for existing files with clear error messages.The script checks if output files already exist and either raises a descriptive error or handles overwriting as specified in the config. Good use of f-strings in the error messages for clarity.
58-73: Well-implemented dataset processing with progress tracking.The code effectively:
- Loads the dataset
- Splits subjects into configurable chunk sizes
- Uses tqdm for progress tracking (both for the overall process and individual chunks)
- Saves each chunk as a Parquet file with properly nested directories
The approach is scalable and user-friendly with its progress visualization.
tests/data/test_dataset_base.py (2)
259-277: Good addition of test case for mandatory subject IDs.This new test comprehensively verifies that the split functionality correctly handles mandatory subject IDs, ensuring they're assigned to the specified splits regardless of the random seed. The test asserts all key aspects:
- The correct keys in the split
- All subject IDs being accounted for
- Mandatory sets appearing in the correct keys
- Expected sizes of each split
491-498: Updated measurement metadata structure aligns with new implementation.The change from
outlier_modelandnormalizercolumns tomean,std,thresh_small, andthresh_largereflects a shift in how measurement statistics are stored and used throughout the codebase. This update ensures the test correctly validates the new structure.tests/data/test_config.py (6)
183-190: Updated measurement metadata structure is consistent with other changes.The replacement of
outlier_modelandnormalizercolumns withmean,std,thresh_small, andthresh_largemaintains consistency with the changes in other parts of the codebase. This ensures all tests properly validate the new metadata structure.
204-206: Consistent metadata structure update in Series-based metadata.The change to the Series-based measurement metadata maintains the same pattern, ensuring that all metadata-related code and tests are using the same structure regardless of whether the metadata is stored in a DataFrame or Series.
253-257: Simplified univariate metadata structure.The univariate measurement metadata has been simplified to only include the
meanvalue, removing the previous dictionary-based structure that included outlier model and normalizer. This aligns with the broader changes to metadata handling.
317-324: Consistent metadata update in empty metadata tests.The same pattern of replacing
outlier_modelandnormalizerwithmean,std,thresh_small, andthresh_largeis applied to the tests for empty metadata initialization, ensuring consistency across all test cases.
341-343: Maintained consistency in test for functional time-dependent measurement metadata.The Series-based metadata for functional time-dependent measurements is updated with the same structure, ensuring that all test cases reflect the changes to metadata formatting throughout the codebase.
454-455: Simplified outlier detector configuration.The test for valid parameter validation now uses an empty dictionary for
outlier_detector_configinstead of a complex structure with aclskey, reflecting a simplification in how outlier detection is configured.tests/data/test_pytorch_dataset.py (8)
11-13: Updated imports to use ragged tensor implementation.The import of
JointNestedRaggedTensorDictfromnested_ragged_tensors.ragged_numpyreplaces the previous direct import oftorch, indicating a shift to a more specialized tensor handling implementation.
56-59: Added time delta calculations for better event sequence representation.The addition of time delta calculations for each subject provides a more complete representation of event sequences, which is valuable for time-series analysis. The approach correctly handles the calculation of differences between consecutive event times and appends NaN for the last event.
Also applies to: 66-69, 77-79
85-86: Updated DataFrame to include time delta information and handle empty cases.The DataFrame structure now includes the time_delta field and properly handles empty cases with empty lists. This allows the test to verify time delta handling behavior when no events exist.
315-359: Improved test setup with proper isolation and resource management.The addition of
setUpmethod initializes a temporary directory and prepares all necessary test files and directories. This is a significant improvement in test organization and isolation:
- Creates temporary directories that will be automatically cleaned up
- Prepares fake split data with proper directory structure
- Writes necessary configuration and data files
- Ensures tests run in an isolated environment
360-361: Proper resource cleanup in tearDown.The addition of a
tearDownmethod ensures that temporary resources are properly cleaned up after test execution, preventing resource leaks and maintaining test isolation.
363-374: Refactored get_pyd method for better reusability.The
get_pydmethod has been refactored to use the temporary directory created in setUp and accept optional parameters. This makes the method more flexible and reusable across different test cases.
421-424: Improved assertion with better error messages.The assertion in
test_normalize_taskhas been updated to useassertTruewith a detailed error message that shows both the expected and actual values. This will make debugging failed tests much easier.
427-427: Updated test initialization to use refactored get_pyd method.The test now uses the refactored
get_pydmethod with appropriate parameters, which is consistent with the other changes in the file.EventStream/data/pytorch_dataset.py (7)
8-19: Replace print statements with structured logging.These import changes align well with the switch to Loguru-based logging, improving flexibility and configurability over plain print statements.
57-57: Check for off-by-one or null edge cases.The usage of
unique().sort().search_sorted()is correct for building indices, but be cautious about columns with extremely large or all-null data. Ensure downstream code handles columns with no valid data gracefully.Could you confirm if there's an upstream validation that ensures
colcontains at least some valid entries (non-null) before callingto_int_index?
61-61: Appropriate class inheritance.Inheriting from both
SeedableMixinandtorch.utils.data.Datasetis a good approach for handling reproducible random operations. No concerns here.
295-320: Extend unit testing for newly introduced logic.The new task logic (like
normalize_taskand stored vocabs) is well structured. Ensure this logic is thoroughly tested, particularly handling corner cases (e.g., empty or invalid columns).Do you have tests covering tasks with missing labels or unexpected data types?
511-547: Robust dynamic-only collate function.Logic for splitting dense vs. dynamic fields and converting each data type is sound. The usage of masks and
.astype(...)conversions are consistent with torch's type system.
549-555: Conditionally included fields.Enforcing checks before accessing collated fields (e.g.,
start_time,subject_id) is good practice. Confirm that the calling site sets the config flags appropriately to avoid potential key errors when fields aren't in the batch.Would you like to run checks ensuring no key misalignment can occur if external code sets these flags incorrectly?
563-571: Explicit task handling.Using Python's structural pattern matching for
task_typesis clean and explicit. The_branch raises aNotImplementedError, preventing silent failures.EventStream/data/config.py (12)
7-8: Enhanced tooling imports.Imports for
hashlibandjsonappear consistent with the new caching features and parameter hashing. No concerns here.
19-19: Loguru-based logging integration.Replacing print statements with
loggercalls is consistent with the broader codebase shift, consolidating logging.
809-812: New subset parameters for tuning data.Adding
tuning_subset_sizeandtuning_subset_seedextends dataset configurability. Validate via the same logic used fortrain_subset_sizeto avoid mismatch or unexpected seeds.
896-907: Caching with random vs. non-random subsequences.The logic disallowing multi-epoch caching for non-random sampling is helpful to avoid stale subsequences. The exception message is clear.
929-931: Auto-generation of seeds.Randomly assigning subset seeds helps reproducibility. Confirm that logs clearly inform the user of the assigned seed for easy reference.
939-953: Consistent logic for tuning subset size.Validation and fallback behavior here mirror the training subset logic. This consistency is excellent for maintainability and clarity.
967-974: Vocabulary config paths.Accessible property-based approach for configuration paths is neatly organized. Ensure offline usage of these files handles missing file scenarios gracefully.
Do you handle missing or corrupted JSON gracefully in
VocabularyConfig.from_json_file?
989-995: Derived paths for caching task dataframes.Conditional logic is straightforward. If
task_df_nameis None, returningNoneis a good design to reduce confusion.
997-1002: Safely constructing file paths.This ensures only relevant paths are used when a task name is present. It is consistent with the logic in
.cached_task_dir.
1004-1009: Task info file path.Returning
Noneiftask_df_nameis not set is consistent with the property approach used elsewhere.
1051-1057: Safely persisting data parameters.Storing data parameters as JSON fosters reproducibility. Ensure partial writes or file permission errors are handled gracefully to avoid a broken cache.
Are you handling potential file I/O exceptions anywhere in the codebase?
1059-1062: Tensorized cache discovery.Glob-based scanning for pre-collated NPZ files is a neat approach. Implementation is straightforward and flexible.
| # raising the below exception to avoid multirun failure | ||
| print(f"EXCEPTION: {ex}") | ||
| print(traceback.print_exc(), file=sys.stderr) | ||
| logger.error(f"EXCEPTION: {ex}\nTRACEBACK:\n{traceback.print_exc()}") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fix potential issue with traceback handling in logging.
The current implementation uses traceback.print_exc() within an f-string, but this function prints to stderr and returns None. This will result in the string "None" being included in your log message instead of the actual traceback.
Replace with traceback.format_exc() to properly capture the traceback as a string:
-logger.error(f"EXCEPTION: {ex}\nTRACEBACK:\n{traceback.print_exc()}")
+logger.error(f"EXCEPTION: {ex}\nTRACEBACK:\n{traceback.format_exc()}")📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| logger.error(f"EXCEPTION: {ex}\nTRACEBACK:\n{traceback.print_exc()}") | |
| logger.error(f"EXCEPTION: {ex}\nTRACEBACK:\n{traceback.format_exc()}") |
| mean = float(measurement_metadata["mean"]) if "mean" in measurement_metadata else 0 | ||
| std = float(measurement_metadata["std"]) if "std" in measurement_metadata else 1 | ||
|
|
||
| if "thresh_large" in measurement_metadata: | ||
| thresh_large = float(measurement_metadata["thresh_large"]) | ||
| else: | ||
| thresh_large = float("inf") | ||
|
|
||
| thresh_large = measurement_metadata["outlier_model"]["thresh_large_"] | ||
| thresh_small = measurement_metadata["outlier_model"]["thresh_small_"] | ||
| if "thresh_small" in measurement_metadata: | ||
| thresh_small = float(measurement_metadata["thresh_small"]) | ||
| else: | ||
| thresh_small = float("-inf") | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
💡 Verification agent
🧩 Analysis chain
Handle zero standard deviation case.
While defaulting mean and std to 0 and 1 respectively is fine, watch out for scenarios where std might actually be 0 in real data, which can cause divisions by zero downstream.
Run a quick check across the codebase to ensure that zero std is handled or flagged appropriately.
🏁 Script executed:
#!/bin/bash
# Search for code paths that do 1/std or similar
rg --context 3 '1\s*/\s*std|\/\s*std'Length of output: 138
ACTION: Verify Division-by-Zero Handling for std
The defaults of mean and std (0 and 1, respectively) are acceptable only if real data never results in a zero standard deviation. However, if std comes in as 0, downstream operations—especially divisions using std—can lead to runtime errors. Please verify that every code path performing a division by std has an appropriate check or fallback to avoid a division-by-zero error.
To facilitate this review, I've prepared a new verification script that searches the repository for any use of std in division contexts. Please run the following script:
If the command finds any divisions by std without proper zero-checks, then please update the relevant sections to handle the zero case (for example, by checking if std is 0 and providing a safe alternative).
🏁 Script executed:
#!/bin/bash
rg --context 3 '/\s*std'Length of output: 1641
ACTION REQUIRED: Handle Division-by-Zero on Standard Deviation
It appears that the file EventStream/data/time_dependent_functor.py contains a division by std without checking if it's zero (specifically, in the line new_age = (new_age - mean) / std). This poses a risk of division-by-zero errors if measurement_metadata["std"] is provided as zero.
- Location:
EventStream/data/time_dependent_functor.py- Affected Line: Calculation line
new_age = (new_age - mean) / std
- Affected Line: Calculation line
- Suggestion: Add a condition to check if
stdis zero (or near zero), and either raise an error or set a safe fallback to prevent runtime exceptions.
| .name.map(lambda c: f"dynamic/{m}/{c.replace(key_prefix, '')}/count"), | ||
| ( | ||
| (cs.starts_with(val_prefix).is_not_null() & cs.starts_with(val_prefix).is_not_nan()) | ||
| .sum() | ||
| .map_alias(lambda c: f"dynamic/{m}/{c.replace(val_prefix, '')}/has_values_count") | ||
| .name.map(lambda c: f"dynamic/{m}/{c.replace(val_prefix, '')}/has_values_count") | ||
| ), | ||
| val_col.sum().map_alias(lambda c: f"dynamic/{m}/{c.replace(val_prefix, '')}/sum"), | ||
| val_col.sum().name.map(lambda c: f"dynamic/{m}/{c.replace(val_prefix, '')}/sum"), | ||
| (val_col**2) | ||
| .sum() | ||
| .map_alias(lambda c: f"dynamic/{m}/{c.replace(val_prefix, '')}/sum_sqd"), | ||
| val_col.min().map_alias(lambda c: f"dynamic/{m}/{c.replace(val_prefix, '')}/min"), | ||
| val_col.max().map_alias(lambda c: f"dynamic/{m}/{c.replace(val_prefix, '')}/max"), | ||
| .name.map(lambda c: f"dynamic/{m}/{c.replace(val_prefix, '')}/sum_sqd"), | ||
| val_col.min().name.map(lambda c: f"dynamic/{m}/{c.replace(val_prefix, '')}/min"), | ||
| val_col.max().name.map(lambda c: f"dynamic/{m}/{c.replace(val_prefix, '')}/max"), | ||
| ] | ||
| else: | ||
| column_cols = [m] | ||
| values_cols = [m] | ||
| select_cols = [m] | ||
| aggs = [ | ||
| pl.all() | ||
| .is_not_null() | ||
| .sum() | ||
| .cast(count_type) | ||
| .map_alias(lambda c: f"dynamic/{m}/{c}/count") | ||
| pl.all().is_not_null().sum().cast(count_type).name.map(lambda c: f"dynamic/{m}/{c}/count") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Lambda closure capturing loop variable in dynamic aggregator.
Using m inside .name.map(lambda c: ...) in a loop can be risky if referencing the loop variable; Python may capture the final value of m. Consider defining a small helper function or using a default argument trick to bind m at definition time to avoid potential closures.
- .name.map(lambda c: f"dynamic/{m}/{c.replace(key_prefix, '')}/count")
+ def make_col_name(prefix_m):
+ return lambda c: f"dynamic/{prefix_m}/{c.replace(key_prefix, '')}/count"
+ .name.map(make_col_name(m))Also applies to: 1652-1652, 1654-1654, 1657-1657, 1658-1658, 1659-1659, 1666-1666
🧰 Tools
🪛 Ruff (0.8.2)
1648-1648: Function definition does not bind loop variable m
(B023)
1648-1648: Function definition does not bind loop variable key_prefix
(B023)
1652-1652: Function definition does not bind loop variable m
(B023)
1652-1652: Function definition does not bind loop variable val_prefix
(B023)
1654-1654: Function definition does not bind loop variable m
(B023)
1654-1654: Function definition does not bind loop variable val_prefix
(B023)
1657-1657: Function definition does not bind loop variable m
(B023)
1657-1657: Function definition does not bind loop variable val_prefix
(B023)
1658-1658: Function definition does not bind loop variable m
(B023)
1658-1658: Function definition does not bind loop variable val_prefix
(B023)
1659-1659: Function definition does not bind loop variable m
(B023)
1659-1659: Function definition does not bind loop variable val_prefix
(B023)
1666-1666: Function definition does not bind loop variable m
(B023)
| .alias("event_id") | ||
| ) | ||
| .with_columns( | ||
| "event_id", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
💡 Verification agent
🧩 Analysis chain
Potential no-op in with_columns.
The string literal "event_id" alone doesn’t look like a valid expression. Confirm it’s intended or specify an expression if needed.
🏁 Script executed:
#!/bin/bash
rg -A 2 '\.with_columns\(\s*"event_id"'Length of output: 41
I'm not seeing enough context from the initial search results. Let's gather more details about how "event_id" is used within the .with_columns() call. Please run the following shell script to print lines 660–680 from the file EventStream/data/dataset_polars.py so we can better understand the surrounding code context:
Once we have that output, we'll be in a better position to verify if "event_id" is intended as-is, or if it should be replaced with a valid expression (e.g., using a column accessor like pl.col("event_id")).
🏁 Script executed:
#!/bin/bash
sed -n '660,680p' EventStream/data/dataset_polars.pyLength of output: 812
Action Required: Update the expression in .with_columns
The current usage:
.with_columns(
"event_id",
pl.col("event_type")
.list.eval(pl.col("").cast(pl.Utf8))
.list.join("&")
.cast(pl.Categorical)
.alias("event_type"),
)appears problematic because the string literal "event_id" does not create a valid expression—it will likely insert a constant string rather than referencing the existing column. If the intention is to keep the existing "event_id" column unchanged, please replace the literal with an expression, for example using pl.col("event_id").
Please update this as follows:
.with_columns(
pl.col("event_id"), # Referencing the existing column
pl.col("event_type")
.list.eval(pl.col("").cast(pl.Utf8))
.list.join("&")
.cast(pl.Categorical)
.alias("event_type"),
)| if events is None: | ||
| logger.warning(f"Empty new events dataframe of type {event_type}!") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion
Switched from error to warning for empty events.
Allowing the pipeline to continue if no events exist may be beneficial for robust data handling, but verify that downstream logic gracefully handles empty frames.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 2
🧹 Nitpick comments (15)
EventStream/logger.py (1)
7-10: Consider adding additional logging configuration optionsWhile the current implementation works well, you might want to consider adding more configuration options for the logger, such as:
- Log rotation settings
- Log level configuration
- Formatting options
- Error handling for cases where Hydra config isn't available
This would make the logging functionality more flexible and robust for different usage scenarios.
def hydra_loguru_init() -> None: """Must be called from a hydra main!""" hydra_path = hydra.core.hydra_config.HydraConfig.get().runtime.output_dir - log.add(os.path.join(hydra_path, "main.log")) + log.add( + os.path.join(hydra_path, "main.log"), + rotation="10 MB", + level="INFO", + format="{time} | {level} | {message}", + )sample_data/generate_synthetic_data.py (5)
118-119: Fix multiple statements on a single lineThere are several instances of multiple statements on a single line using colons, which violates PEP 8 style guidelines.
- if running_HR < 30: running_HR = 30 - if running_HR > 300: running_HR = 300 + if running_HR < 30: + running_HR = 30 + if running_HR > 300: + running_HR = 300🧰 Tools
🪛 Ruff (0.8.2)
118-118: Multiple statements on one line (colon)
(E701)
119-119: Multiple statements on one line (colon)
(E701)
122-123: Fix multiple statements on a single lineSimilar to the previous comment, these lines also have multiple statements on a single line.
- if running_temp < 95: running_temp = 95 - if running_temp > 104: running_temp = 104 + if running_temp < 95: + running_temp = 95 + if running_temp > 104: + running_temp = 104🧰 Tools
🪛 Ruff (0.8.2)
122-122: Multiple statements on one line (colon)
(E701)
123-123: Multiple statements on one line (colon)
(E701)
190-190: Remove unused variable assignmentThe variable
monthsis defined but never used in this function.- months = 30 * daysSimilar unused variable also appears at line 250.
🧰 Tools
🪛 Ruff (0.8.2)
190-190: Local variable
monthsis assigned to but never usedRemove assignment to unused variable
months(F841)
193-193: Remove unused variable assignmentThe variable
lab_psis assigned but never used in this function.- lab_ps = np.random.dirichlet(alpha=[0.1 for _ in range(5)])🧰 Tools
🪛 Ruff (0.8.2)
193-193: Local variable
lab_psis assigned to but never usedRemove assignment to unused variable
lab_ps(F841)
212-212: Simplify dictionary key lookupWhen checking if a key exists in a dictionary, it's more efficient to use
key in dictinstead ofkey in dict.keys().- for lab in base_lab_gaps.keys(): + for lab in base_lab_gaps:🧰 Tools
🪛 Ruff (0.8.2)
212-212: Use
key in dictinstead ofkey in dict.keys()Remove
.keys()(SIM118)
docs/MIMIC_IV_tutorial/data_extraction_processing.md (1)
28-33: Updated command examples to use ESGPT_PATH and fix hard tabs.The command examples have been updated to reference the new ESGPT_PATH variable name. However, the markdown contains hard tabs which should be replaced with spaces for better consistency.
-PYTHONPATH="$ESGPT_PATH:$PYTHONPATH" python \ - $ESGPT_PATH/scripts/build_dataset.py \ - --config-path=$(pwd)/configs \ - --config-name=dataset \ - "hydra.searchpath=[$ESGPT_PATH/configs]" [configuration args...] +PYTHONPATH="$ESGPT_PATH:$PYTHONPATH" python \ + $ESGPT_PATH/scripts/build_dataset.py \ + --config-path=$(pwd)/configs \ + --config-name=dataset \ + "hydra.searchpath=[$ESGPT_PATH/configs]" [configuration args...]🧰 Tools
🪛 markdownlint-cli2 (0.17.2)
29-29: Hard tabs
Column: 1(MD010, no-hard-tabs)
30-30: Hard tabs
Column: 1(MD010, no-hard-tabs)
31-31: Hard tabs
Column: 1(MD010, no-hard-tabs)
32-32: Hard tabs
Column: 1(MD010, no-hard-tabs)
EventStream/data/config.py (1)
939-954: Use logger.warning instead of print.
When the tuning subset seed is auto-assigned, switch from- print(f"WARNING! tuning_subset_size is set, but tuning_subset_seed is not. Setting to {seed}") + logger.warning(f"tuning_subset_size is set, but tuning_subset_seed is not. Setting to {seed}")scripts/convert_to_ESDS.py (3)
4-10: Consider using a more robust error handling approach for stackprinter.The current implementation silently passes if the stackprinter import fails. While this prevents failing on a missing dev dependency, it might make debugging harder since you lose enhanced error tracing without any notification.
try: import stackprinter stackprinter.set_excepthook(style="darkbg2") except ImportError: - pass # no need to fail because of missing dev dependency + import warnings + warnings.warn("stackprinter not available; using standard exception handling")
34-37: Useisinstance()instead oftype() isfor type checking.Using
type() isfor type checking is less idiomatic and more brittle thanisinstance(), which also handles subclasses appropriately.- if type(self.dataset_dir) is str: + if isinstance(self.dataset_dir, str): self.dataset_dir = Path(self.dataset_dir) - if type(self.ESDS_save_dir) is str: + if isinstance(self.ESDS_save_dir, str): self.ESDS_save_dir = Path(self.ESDS_save_dir)
44-46: Add more details to handle different configuration types.The conversion from Hydra config to your custom config class is handled, but there's no information about what other config types might be expected or what might cause this conversion to be needed.
if type(cfg) is not ConversionConfig: + # Hydra might return a generic OmegaConf object, convert it to our specific config class cfg = hydra.utils.instantiate(cfg, _convert_="object")tests/data/test_pytorch_dataset.py (1)
421-424: Use more descriptive assertion messages.While the assertion includes output values, a more descriptive message would make test failures easier to understand.
self.assertTrue( (got_vals == want_vals).all(), - f"want_vals:\n{want_vals.to_pandas()}\ngot_vals:\n{got_vals.to_pandas()}", + f"Values mismatch for {task} normalization. \nExpected:\n{want_vals.to_pandas()}\nGot:\n{got_vals.to_pandas()}", )tests/test_e2e_runs.py (1)
48-64: Improved test command execution with subtest support.The
_test_commandmethod has been enhanced to optionally execute commands within a subtest context, which provides better organization of test results. However, there is code duplication between the subtest and non-subtest cases.def _test_command(self, command_parts: list[str], case_name: str, use_subtest: bool = True): + def run_and_check_command(): + command_out = subprocess.run(" ".join(command_parts), shell=True, capture_output=True) + stderr = command_out.stderr.decode() + stdout = command_out.stdout.decode() + self.assertEqual( + command_out.returncode, 0, f"Command errored!\nstderr:\n{stderr}\nstdout:\n{stdout}" + ) + return stdout, stderr + if use_subtest: with self.subTest(case_name): - command_out = subprocess.run(" ".join(command_parts), shell=True, capture_output=True) - stderr = command_out.stderr.decode() - stdout = command_out.stdout.decode() - self.assertEqual( - command_out.returncode, 0, f"Command errored!\nstderr:\n{stderr}\nstdout:\n{stdout}" - ) + return run_and_check_command() else: - command_out = subprocess.run(" ".join(command_parts), shell=True, capture_output=True) - stderr = command_out.stderr.decode() - stdout = command_out.stdout.decode() - self.assertEqual( - command_out.returncode, 0, f"Command errored!\nstderr:\n{stderr}\nstdout:\n{stdout}" - ) + return run_and_check_command()EventStream/data/dataset_polars.py (2)
646-646: Grouping by unique timestamps
Be mindful of performance if numerous timestamps exist; large data might require chunked approaches.
670-670: Concatenating event types
Joining multiple event types with '&' may complicate parsing. Consider a more structured approach or array-like storage.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
⛔ Files ignored due to path filters (7)
.github/workflows/tests.ymlis excluded by!**/*.ymlconfigs/dataset_base.yamlis excluded by!**/*.yamlconfigs/normalizer_config/standard_scaler.yamlis excluded by!**/*.yamlconfigs/outlier_detector_config/stddev_cutoff.yamlis excluded by!**/*.yamlpyproject.tomlis excluded by!**/*.tomlsample_data/dataset.yamlis excluded by!**/*.yamlsample_data/pretrain_NA.yamlis excluded by!**/*.yaml
📒 Files selected for processing (52)
EventStream/baseline/FT_task_baseline.py(6 hunks)EventStream/data/README.md(2 hunks)EventStream/data/config.py(14 hunks)EventStream/data/dataset_base.py(21 hunks)EventStream/data/dataset_polars.py(44 hunks)EventStream/data/preprocessing/README.md(0 hunks)EventStream/data/preprocessing/__init__.py(0 hunks)EventStream/data/preprocessing/preprocessor.py(0 hunks)EventStream/data/preprocessing/standard_scaler.py(0 hunks)EventStream/data/preprocessing/stddev_cutoff.py(0 hunks)EventStream/data/pytorch_dataset.py(9 hunks)EventStream/data/time_dependent_functor.py(3 hunks)EventStream/data/types.py(1 hunks)EventStream/data/visualize.py(7 hunks)EventStream/evaluation/MCF_evaluation.py(0 hunks)EventStream/evaluation/general_generative_evaluation.py(0 hunks)EventStream/logger.py(1 hunks)EventStream/tasks/profile.py(4 hunks)EventStream/transformer/config.py(5 hunks)EventStream/transformer/lightning_modules/embedding.py(2 hunks)EventStream/transformer/lightning_modules/fine_tuning.py(7 hunks)EventStream/transformer/lightning_modules/generative_modeling.py(6 hunks)EventStream/transformer/lightning_modules/zero_shot_evaluator.py(3 hunks)EventStream/transformer/model_output.py(7 hunks)EventStream/transformer/transformer.py(4 hunks)EventStream/utils.py(2 hunks)README.md(1 hunks)configs/README.md(0 hunks)docs/MIMIC_IV_tutorial/data_extraction_processing.md(2 hunks)sample_data/build_sample_task_DF.py(1 hunks)sample_data/generate_synthetic_data.py(9 hunks)scripts/build_dataset.py(6 hunks)scripts/convert_to_ESDS.py(1 hunks)scripts/finetune.py(1 hunks)scripts/generate_trajectories.py(0 hunks)scripts/get_embeddings.py(2 hunks)scripts/launch_finetuning_wandb_hp_sweep.py(2 hunks)scripts/launch_from_scratch_supervised_wandb_hp_sweep.py(2 hunks)scripts/launch_pretraining_wandb_hp_sweep.py(2 hunks)scripts/launch_sklearn_baseline_supervised_wandb_hp_sweep.py(2 hunks)scripts/prepare_pretrain_subsets.py(3 hunks)scripts/pretrain.py(2 hunks)scripts/sklearn_baseline.py(1 hunks)scripts/zeroshot.py(2 hunks)setup.py(1 hunks)tests/data/preprocessing/test_standard_scaler.py(0 hunks)tests/data/preprocessing/test_stddev_cutoff.py(0 hunks)tests/data/test_config.py(8 hunks)tests/data/test_dataset_base.py(2 hunks)tests/data/test_dataset_polars.py(0 hunks)tests/data/test_pytorch_dataset.py(9 hunks)tests/test_e2e_runs.py(5 hunks)
💤 Files with no reviewable changes (12)
- tests/data/preprocessing/test_stddev_cutoff.py
- configs/README.md
- EventStream/data/preprocessing/README.md
- EventStream/evaluation/general_generative_evaluation.py
- EventStream/data/preprocessing/init.py
- EventStream/data/preprocessing/standard_scaler.py
- EventStream/data/preprocessing/stddev_cutoff.py
- tests/data/preprocessing/test_standard_scaler.py
- EventStream/data/preprocessing/preprocessor.py
- EventStream/evaluation/MCF_evaluation.py
- tests/data/test_dataset_polars.py
- scripts/generate_trajectories.py
🧰 Additional context used
🧬 Code Definitions (15)
scripts/prepare_pretrain_subsets.py (6)
scripts/get_embeddings.py (1) (1)
main(24-28)scripts/finetune.py (1) (1)
main(25-39)scripts/pretrain.py (1) (1)
main(29-43)scripts/zeroshot.py (1) (1)
main(25-29)scripts/build_dataset.py (1) (1)
main(82-376)scripts/sklearn_baseline.py (1) (1)
main(19-24)
scripts/sklearn_baseline.py (8)
scripts/launch_sklearn_baseline_supervised_wandb_hp_sweep.py (1) (1)
main(81-111)scripts/get_embeddings.py (1) (1)
main(24-28)scripts/finetune.py (1) (1)
main(25-39)scripts/launch_finetuning_wandb_hp_sweep.py (1) (1)
main(77-107)scripts/launch_from_scratch_supervised_wandb_hp_sweep.py (1) (1)
main(81-111)scripts/launch_pretraining_wandb_hp_sweep.py (1) (1)
main(79-109)scripts/pretrain.py (1) (1)
main(29-43)scripts/zeroshot.py (1) (1)
main(25-29)
scripts/finetune.py (3)
EventStream/transformer/lightning_modules/fine_tuning.py (2) (2)
FinetuneConfig(273-442)train(446-586)scripts/get_embeddings.py (1) (1)
main(24-28)scripts/zeroshot.py (1) (1)
main(25-29)
tests/data/test_dataset_base.py (2)
EventStream/data/pytorch_dataset.py (1) (1)
subject_ids(402-403)EventStream/data/dataset_base.py (1) (1)
split(573-653)
EventStream/transformer/model_output.py (1)
EventStream/transformer/utils.py (1) (1)
str_summary(11-30)
EventStream/transformer/lightning_modules/generative_modeling.py (1)
EventStream/transformer/config.py (3) (3)
MetricsConfig(106-199)Split(26-42)MetricCategories(45-61)
EventStream/baseline/FT_task_baseline.py (2)
EventStream/data/dataset_base.py (1) (1)
split(573-653)EventStream/data/config.py (1) (1)
drop(1391-1415)
EventStream/transformer/lightning_modules/fine_tuning.py (1)
EventStream/transformer/config.py (1) (1)
StructuredTransformerConfig(353-935)
EventStream/data/visualize.py (1)
EventStream/data/dataset_base.py (2) (2)
events_df(302-313)events_df(316-317)
scripts/build_dataset.py (1)
EventStream/logger.py (1) (1)
hydra_loguru_init(7-10)
EventStream/transformer/transformer.py (1)
EventStream/data/types.py (1) (1)
batch_size(178-183)
tests/test_e2e_runs.py (1)
tests/data/test_pytorch_dataset.py (1) (1)
tearDown(360-361)
EventStream/data/pytorch_dataset.py (3)
EventStream/utils.py (1) (1)
count_or_proportion(26-88)EventStream/data/config.py (1) (1)
PytorchDatasetConfig(788-1062)EventStream/data/dataset_base.py (1) (1)
split(573-653)
EventStream/data/dataset_polars.py (2)
EventStream/data/config.py (3) (3)
drop(1391-1415)measurement_configs(980-983)vocabulary_config(972-973)EventStream/data/types.py (3) (3)
items(306-308)DataModality(832-868)TemporalityType(808-829)
tests/data/test_pytorch_dataset.py (2)
EventStream/data/config.py (4) (4)
PytorchDatasetConfig(788-1062)to_dict(955-959)to_dict(1538-1554)to_dict(1934-1944)EventStream/data/pytorch_dataset.py (2) (2)
PytorchDataset(61-609)max_seq_len(417-418)
🪛 Ruff (0.8.2)
sample_data/generate_synthetic_data.py
118-118: Multiple statements on one line (colon)
(E701)
119-119: Multiple statements on one line (colon)
(E701)
122-122: Multiple statements on one line (colon)
(E701)
123-123: Multiple statements on one line (colon)
(E701)
190-190: Local variable months is assigned to but never used
Remove assignment to unused variable months
(F841)
193-193: Local variable lab_ps is assigned to but never used
Remove assignment to unused variable lab_ps
(F841)
212-212: Use key in dict instead of key in dict.keys()
Remove .keys()
(SIM118)
250-250: Local variable months is assigned to but never used
Remove assignment to unused variable months
(F841)
EventStream/data/dataset_base.py
1488-1488: Loop control variable m not used within loop body
Rename unused m to _m
(B007)
EventStream/data/pytorch_dataset.py
209-209: Loop control variable n_events overrides iterable it iterates
(B020)
EventStream/data/dataset_polars.py
1648-1648: Function definition does not bind loop variable m
(B023)
1648-1648: Function definition does not bind loop variable key_prefix
(B023)
1652-1652: Function definition does not bind loop variable m
(B023)
1652-1652: Function definition does not bind loop variable val_prefix
(B023)
1654-1654: Function definition does not bind loop variable m
(B023)
1654-1654: Function definition does not bind loop variable val_prefix
(B023)
1657-1657: Function definition does not bind loop variable m
(B023)
1657-1657: Function definition does not bind loop variable val_prefix
(B023)
1658-1658: Function definition does not bind loop variable m
(B023)
1658-1658: Function definition does not bind loop variable val_prefix
(B023)
1659-1659: Function definition does not bind loop variable m
(B023)
1659-1659: Function definition does not bind loop variable val_prefix
(B023)
1666-1666: Function definition does not bind loop variable m
(B023)
2013-2013: Use key in dict instead of key in dict.keys()
Remove .keys()
(SIM118)
🪛 markdownlint-cli2 (0.17.2)
docs/MIMIC_IV_tutorial/data_extraction_processing.md
29-29: Hard tabs
Column: 1
(MD010, no-hard-tabs)
30-30: Hard tabs
Column: 1
(MD010, no-hard-tabs)
31-31: Hard tabs
Column: 1
(MD010, no-hard-tabs)
32-32: Hard tabs
Column: 1
(MD010, no-hard-tabs)
🔇 Additional comments (253)
EventStream/logger.py (1)
1-10: Good implementation of Loguru integration with HydraThe implementation is clean and effective for integrating the Loguru logger with Hydra's output directory structure.
README.md (1)
34-36: Helpful compatibility note about polars version requirementsThis note provides valuable information about version requirements and potential error symptoms, which will help users troubleshoot installation issues.
sample_data/generate_synthetic_data.py (1)
3-11: Improved docstring formattingThe docstring formatting changes improve readability and make the purpose of the script clearer.
EventStream/data/README.md (1)
393-395: Consistency update from np.NaN to float('nan')Updating the documentation to use
float('nan')instead ofnp.NaNimproves consistency with the code implementation and makes the docs less dependent on NumPy-specific terminology.setup.py (1)
27-27: Script name updated to be more specific.The script entry has been renamed from "launch_wandb_hp_sweep.py" to "launch_pretraining_wandb_hp_sweep.py", which better reflects its specific purpose for pretraining tasks.
EventStream/transformer/lightning_modules/zero_shot_evaluator.py (2)
13-13: Added structured logging with loguru.Adding the loguru logger import to support structured logging, which is more suitable for production environments than print statements.
384-384: Replaced print statement with structured logging.Improved logging by replacing a print statement with logger.info(), enhancing log readability and consistency.
docs/MIMIC_IV_tutorial/data_extraction_processing.md (2)
24-25: Repository reference updated from EFGPT to ESGPT.The environment variable has been renamed from EFGPT_PATH to ESGPT_PATH, maintaining consistency with the repository name change throughout the documentation.
46-47: Updated hydra searchpath reference to ESGPT_PATH.The hydra configuration explanation has been updated to reflect the repository name change from EFGPT to ESGPT.
EventStream/transformer/config.py (11)
14-14: Added structured logging with loguru.Adding the loguru logger import to support structured logging throughout the configuration handling.
575-578: Replaced print with logger.warning for categorical_embedding_dim.Enhanced logging by replacing print warnings with structured logger.warning() calls, improving log readability and consistency.
581-584: Replaced print with logger.warning for numerical_embedding_dim.Enhanced logging by replacing print warnings with structured logger.warning() calls, improving log readability and consistency.
629-634: Replaced print with logger.warning for measurements_per_dep_graph_level.Enhanced logging by replacing print warnings with structured logger.warning() calls, improving log readability and consistency.
636-641: Replaced print with logger.warning for do_full_block_in_seq_attention.Enhanced logging by replacing print warnings with structured logger.warning() calls, improving log readability and consistency.
643-649: Replaced print with logger.warning for do_full_block_in_dep_graph_attention.Enhanced logging by replacing print warnings with structured logger.warning() calls, improving log readability and consistency.
651-654: Replaced print with logger.warning for dep_graph_attention_types.Enhanced logging by replacing print warnings with structured logger.warning() calls, improving log readability and consistency.
656-659: Replaced print with logger.warning for dep_graph_window_size.Enhanced logging by replacing print warnings with structured logger.warning() calls, improving log readability and consistency.
759-765: Replaced print with logger.warning for TTE_lognormal_generation_num_components.Enhanced logging by replacing print warnings with structured logger.warning() calls, improving log readability and consistency.
767-772: Replaced print with logger.warning for mean_log_inter_event_time_min.Enhanced logging by replacing print warnings with structured logger.warning() calls, improving log readability and consistency.
774-779: Replaced print with logger.warning for std_log_inter_event_time_min.Enhanced logging by replacing print warnings with structured logger.warning() calls, improving log readability and consistency.
EventStream/transformer/lightning_modules/embedding.py (3)
9-9: New dependency added properly.Added the loguru logger import, which aligns with the broader initiative to replace print statements with structured logging.
157-159: Good conversion from print to structured logging.Replaced print statement with logger.info for better logging control and visibility when embeddings already exist.
161-161: Good conversion from print to structured logging.Replaced print statement with logger.info for better logging control and visibility when saving embeddings.
EventStream/data/types.py (1)
753-753: Improved truncated representation in DataFrame output.Enhanced the output formatting for dynamic indices, dynamic measurement indices, and dynamic values in the DataFrame representation by using ellipses to indicate truncated content for better readability.
EventStream/utils.py (2)
19-19: New dependency added properly.Added the loguru logger import, which aligns with the broader initiative to replace print statements with structured logging.
383-383: Improved error logging.Replaced print statements with structured logging using logger.error, which consolidates the error message and traceback into a single log entry for better error tracking and visibility.
EventStream/transformer/model_output.py (7)
12-12: New dependency added properly.Added the loguru logger import, which aligns with the broader initiative to replace print statements with structured logging.
434-437: Good conversion from print to structured logging.Replaced print statement with logger.warning for better logging control and visibility when detecting improper measurement access attempts.
461-461: Good conversion from print to structured logging.Replaced print statement with logger.warning for better logging control and visibility when detecting improper measurement generation.
529-534: Enhanced error reporting with better context.Improved error handling by preserving the original exception context with
from eand providing more detailed information about the failure, including shapes and indices. This will make debugging easier.
1027-1031: Enhanced error reporting with better context.Improved error handling by preserving the original exception context with
from eand including shape information in the error message, which will help diagnose concatenation issues more efficiently.
1361-1361: Enhanced error reporting with better context.Improved error handling by preserving the original exception context with
from eand including tensor summary information using thestr_summaryutility, which will make debugging time-to-event calculation failures more straightforward.
1496-1504: Enhanced error reporting with better context.Improved error handling with a more detailed error message that includes vocabulary information, label ranges, and tensor shapes. This will help diagnose classification validation failures more effectively.
EventStream/tasks/profile.py (5)
7-7: API update to match Polars library changesThe
pl.enable_string_cache()call has been updated to remove the explicitTrueparameter, which is now the default behavior in newer versions of Polars.
85-85: Updated datatype specification for the start_time columnThe datatype for
start_timehas been changed from a previous type tonull, which better reflects that this column can contain null values as shown in the example data.
98-98: Updated datatype specification consistent with previous changesSimilar to the previous change, the datatype for
start_timein the "foobar/bar" example has been updated tonullto consistently handle null values.
154-154: API update: groupby → group_byUpdated method name from
groupbytogroup_byto follow Polars' API changes and align with Python's snake_case naming convention.
156-156: API update: pl.count() → pl.len()Changed the aggregation function from
pl.count()topl.len()to align with Polars' API changes or to use a more appropriate function for counting samples per subject.EventStream/transformer/lightning_modules/generative_modeling.py (4)
15-15: Added structured logging with loguruAdded import for the
logurulogger to replace basic print statements with structured logging, improving error traceability and log management.
283-286: Enhanced error reporting with structured loggingReplaced print statement with
logger.error()for better error visibility and traceability when metric computation fails. This provides clearer context about the error and standardizes logging throughout the codebase.
523-528: Updated metrics configuration for validationModified
final_validation_metrics_configto explicitly include loss part metrics for both tuning and held-out splits. This ensures consistent metric tracking across validation phases instead of skipping all metrics by default.The change aligns with the
MetricsConfigclass behavior defined inEventStream/transformer/config.py, where setting specific include_metrics ensures only those metrics are tracked.
599-599: Standardized information loggingReplaced print statements with
logger.info()calls to standardize logging throughout the codebase, making debug information more consistent and easier to filter.Also applies to: 604-604, 627-627, 708-708
EventStream/data/visualize.py (7)
164-168: API update: group_by_dynamic replaces groupby_dynamicUpdated method name to follow Polars' API changes and Python's snake_case naming convention. This is part of a broader standardization of method naming in the Polars library.
182-182: API update: groupby → group_byChanged method call from
groupbytogroup_byto maintain consistency with Polars' updated API.
256-256: API update: groupby → group_by in plot_static_variables_breakdownUpdated method call in the static variables breakdown function to maintain consistency with Polars' API changes.
283-284: API update: groupby → group_by in plot_counts_over_ageConsistent update to the group_by method in the age-based plotting functionality.
300-300: API update: groupby → group_by in aggregation logicAnother instance of updating to the new method name within the counts_at_age aggregation logic.
341-341: API update: groupby → group_by in plot_events_per_patientConsistent method name update in the events per patient plotting logic.
356-356: API update: groupby → group_by in plot methodFinal instance of updating the method name in the plot function to maintain consistency throughout the file.
EventStream/transformer/lightning_modules/fine_tuning.py (7)
17-17: Added structured logging with loguruAdded import for the
logurulogger to replace basic print statements with structured logging throughout the file.
187-190: Enhanced error reporting with structured loggingReplaced print statement with
logger.error()for better error visibility and traceability when metric computation fails. This provides more context about the error and fits with the broader standardization of logging.
400-403: Improved warning messages with structured loggingUpdated warning message about train_subset_seed using
logger.warning()instead of print, providing better visibility for important configuration warnings.
415-418: Enhanced warning visibility for task_df_name conflictsReplaced print with
logger.warning()to highlight potential conflicts in task_df_name settings, making it easier to identify configuration issues.
420-422: Better logging for configuration overridesImproved logging of configuration parameter overrides with
logger.info(), providing clearer tracking of what's being changed during initialization.
470-470: Standardized information logging for file operationsReplaced print statements with
logger.info()calls for consistent logging of file operations and configuration saving, improving traceability.Also applies to: 475-475, 579-579
492-492: Consistent logging style in commented codeUpdated the commented-out print statement about model compilation to use
logger.info(), maintaining consistency with the rest of the codebase if this code is ever uncommented.EventStream/data/time_dependent_functor.py (5)
147-149: Ensure correctness of nanosecond-based time conversion.Switching from
dt.nanoseconds()todt.total_nanoseconds()offers finer granularity and improved precision in calculating age. This appears correct, but consider verifying any downstream code that might rely on the older approach to ensure consistency.
210-212: Validate default mean and standard deviation.Defaulting to mean=0 and std=1 is typically safe, but verify that downstream logic can handle these defaults without unexpected outcomes (e.g., skipping or incorrectly normalizing).
213-221: Graceful handling of threshold boundaries.Assigning
float("inf")andfloat("-inf")is a sensible default when thresholds are absent. Confirm that the rest of the pipeline (e.g., outlier detection and any numeric stability checks) properly accommodates these infinite bounds.
613-631: Random initialization of sine/cosine parameters.Defining
sin_div_termandcos_div_termviatorch.empty(...)and applying normal initialization allows flexible learnable frequencies. This approach is coherent with the described architecture.
678-694: Maintain clarity with separate sin/cos div terms.Using identical expressions for
sin_div_termandcos_div_termand separating them for even and odd embedding dims is consistent. This helps ensure correct shape handling for odd dimensions.EventStream/transformer/transformer.py (12)
122-122: Prefer naming consistency for dropout probability.Renaming
self.attn_dropouttoself.attn_dropout_pclarifies that it stores a probability rather than a Dropout layer. This naming improvement is welcome.
179-180: Document the unsupported head_mask parameter clearly.The docstring now warns that
head_maskis not supported, aligning with the raised exception below. Good job ensuring the documentation reflects actual functionality.
185-187: Explicitly raise error for unsupported head masking.Raising an error clarifies that the feature is unavailable. This is helpful for quickly catching invalid usage.
188-189: Batch size extraction is straightforward.Storing
batch_size = query.shape[0]clarifies downstream usage. Ensure that inputs always follow the (batch, head, seq_len, head_dim) pattern as expected.
190-192: Mask value extraction.Using
torch.finfo(value.dtype).minthen creating a singleton tensor clarifies that it is a scalar sentinel. This is a neat approach to avoid repeating finfo calls.
195-197: Casting query and key to match value’s dtype.Ensuring all attention inputs share the same dtype helps avoid subtle numerical issues. Good consistency fix.
199-203: Reject single-batch usage with certain padding.The logic here properly prevents a known edge case with
padding='max_length'in single-batch scenarios. This explicit check is beneficial for user clarity.
204-206: Conditional dropout usage.Applying dropout only in training mode is a standard approach. This ensures reproducibility and stable attention in evaluation mode.
209-217: Use scaled_dot_product_attention for minimal seq length.Conditionally implementing
is_causal=TrueorFalseis a neat trick for short sequences. Double-check that masks remain consistent in each scenario if you reintroduce attention masks here.
218-233: Construct and append causal mask properly.Your approach to slicing
self.biasensures an accurate local or causal context. Combining it withattention_maskby summation is a succinct way to unify multiple masks.
234-236: Single point of scaled dot-product call highlights clarity.Routing all logic into one final
scaled_dot_product_attentioncall for this branch is simpler than multiple calls. This is an effective consolidation.
242-242: Return no explicit attention weights.Returning
Nonefor attention weights indicates they’re not tracked. This is consistent with your docstring about not returning them in this path.EventStream/data/pytorch_dataset.py (66)
8-19: Enhanced logging and imports.Switching to
logurufor logging and importingSeedableMixinimproves maintainability and consistency across the codebase.
39-54: Informative code example.The updated docstring example clearly illustrates how
to_int_indexworks. It's helpful and consistent with polars usage.
57-57: Use ofsearch_sortedfor indexing.This approach finds the correct insertion indices for unique values. Confirm that lexicographic order is always the intended default for your application.
62-63: Inheritance fromSeedableMixin.Adding
SeedableMixintoPytorchDatasetfosters reproducible sampling in_seeded_getitem. This is beneficial for debugging and consistent data splits.
102-102: Extended type checking with polars Categorical.Including polars
Categorical(ordering="physical")or(ordering="lexical")inTYPE_CHECKERSensures consistent usage ofto_int_indexfor multi-class classification. Nicely done.
133-133: Ensure proper superclass initialization.Calling
super().__init__()from within__init__ensures theSeedableMixinlogic is set up. Good practice.
136-137: Expose split as instance attribute.Storing
splitallows referencing dataset segments clearly. This is straightforward and helpful during debugging.
138-139: Logging vocabulary loading.These lines add transparency about which step is running. Such logs help track dataset preparation progress.
140-141: User-friendly logging for split and shards.Communicating “Reading splits & patient shards” helps track where the code is in pipeline. Good practice.
142-142: Explicit descriptor reading step.Logging the next major step clarifies the data loading flow.
145-149: Filter subjects by minimum sequence length.Emitting a log line and then calling
filter_to_min_seq_lenclarifies the process. Ensure it’s not repeated inadvertently in multi-phase pipelines.
150-154: Optionally apply subset filtering for training.Conditionally filtering the training split with a user-defined subset size is a nice feature. The logs further clarify when random sampling is triggered.
156-156: Define property for static directory.This property-based approach organizes path references.
158-159: Clean direct path reference to “DL_reps”.This ensures the code references the correct directory consistently.
164-164: Expose task_dir for clarity.Similarly, referencing
task_dfsvia property clarifies the structure.
167-168: Property for “NRTs_dir”.This aligns with the other path-based properties. Good consistent pattern.
169-172: Vocabulary reading method separated.This dedicated function improves clarity and single responsibility.
173-174: Dedicated function for reading shards.Separating shard reading from the constructor fosters modular design.
175-178: Load shard mapping from JSON.Reading shard data once avoids repeated overhead. Good approach for large datasets.
179-182:measurement_configsproperty.This property neatly surfaces measurement configuration definitions.
189-189: Progress bar for static shards.Using tqdm for user feedback is helpful when reading many shards.
192-203: Selective column loading from parquet.Reading minimal columns reduces memory overhead. This is an efficient approach for large parquet data.
206-206: Store shard DataFrame references.Retaining them in
self.static_dfshelps unify subsequent lookups.
214-214: Track sequence bounds for each subject.Storing
(0, n_events)is straightforward for subsequent slicing logic.
215-215: Handle dataset with/without tasks consistently.Conditionally setting tasks, tasks types, and so forth keeps logic straightforward when
task_df_nameis absent.
217-221: Default fields when no tasks.Setting
index,labels,tasks, etc. to default ensures a uniform interface.
223-223: Generate path to task parquet.Combining
self.task_dirandconfig.task_df_nameis consistent with the property-based approach.
224-242: Reading and validating task configurations.Merging with existing static data, verifying columns, and checking for consistent
task_infois thorough. This ensures correctness and alignment of tasks with the actual data.
322-343: Filtering to min sequence length.Applying the user-defined minimum, logging the change, and removing short sequences is essential for data quality. The caution about comparability addresses potential differences across runs.
344-364: Random subset filtering for training.Extracting a subset of subjects with
np.random.default_rngis a clean approach. The logs highlight changes in row and subject counts, which is quite helpful.
365-365: Collecting inter-event time stats.
set_inter_event_time_statsorganizes the pipeline.
366-366: Concatenate shards before computing stats.Combining all data into one lazy frame is memory-friendly and consistent.
369-371: Select relevant column for inter-event time.Dropping nulls and nans ensures robust log-based computations.
389-389: Write malformed records for debugging.Logging them to parquet is a good fallback for investigating data anomalies.
395-395: Warn about non-positive inter-event times.Aggregating these warnings helps debug data issues promptly.
396-396: Remove malformed subjects.This ensures no zero or negative inter-event times remain, preserving model consistency.
401-401: Property-based approach for subject IDs.Returning a list of
[x[0] for x in self.index]is straightforward.
402-403: Getter method for subject_ids.Exposing it as a property to unify usage throughout the dataset.
405-407: Length matches the index size.Returning
len(self.index)is consistent with typical dataset patterns.
409-410: Check presence of tasks.
has_taskclarifies if the dataset is labeled or not.
413-414: Expose padding side.Just returning the configuration’s
seq_padding_side.
416-416: Expose maximum sequence length from config.Encapsulating
max_seq_lenhelps avoid referencing config in user code.
420-423: Indicate if it’s a subset dataset.Checks if
train_subset_size != "FULL". This is a simple, clear property.
425-425: Docstring clarifies output structure of items.Transparently describing the shape and type is beneficial for debugging.
445-446: Additional clarifications on placeholder for numeric data.Documenting
float('nan')usage for missing numeric elements helps future maintainers.
456-457: Use seed context manager in_seeded_getitem.Properly controlling randomness ensures reproducibility across fetch operations.
459-459: Clarify purpose for_seeded_getitem.Short doc clarifies it’s essentially the random-sampling-aware version of
__getitem__.
462-466: Retrieve subject setup for single index.This chunk extracts the correct shard, subject index, and static row for the requested item.
469-471: Initialize dictionary with static data arrays.This standard representation matches the docstring structure.
473-474: Conditionally include subject_id.Exposing user IDs can be critical for debugging, so having this optional is good.
476-492: Implement subsequence sampling.The random, end-based, or from-start approach covers typical scenarios. It's nice that it logs changes above.
497-499: Load dynamic slice fromJointNestedRaggedTensorDict.Efficiently reading partial data helps with large sequences.
502-505: Conditionally include start_time in minutes.Summing partial
time_deltaensures the correct offset. Good detail for time-based tasks.
507-508: Attach label(s) for tasks.Populating
out[t]fromself.labelsis straightforward for each requested task.
511-511: Dynamic-only collate function extracted.Separating dynamic collation from overall logic improves maintainability.
512-517: Fetch dense keys separately.Distinguishing “dynamic” from normal keys organizes the merging process.
521-523: VStack the dynamic ragged data.Leveraging
vstacksimplifies merging variable-length data.
524-526: Separate event and values masks.Renaming
dim1/maskanddim2/maskis more explicit, clarifying which dimension is masked.
527-537: Carefully map numpy dtypes to torch.Casting integer and float data is handled robustly. The TypeError ensures unknown dtypes are promptly flagged.
538-539: Merge dynamic and dense dictionaries.Combining them into
collatedfor straightforward post-processing is neat.
540-547: Populate the final batch dictionary.This ensures consistent naming and shape for time deltas, dynamic indices, etc.
549-555: Optional fields for start_time and subsequence bounds.Conditional logic avoids extraneous data when not needed, saving memory.
559-559: Early return if no tasks.Skipping label prep when no tasks exist saves overhead.
563-572: Collect and cast labels by task type.Handling multi-class, binary, and regression tasks distinctly ensures correct numeric transformations.
589-590: Collate dynamic data first, then handle static.This separation clarifies the two major forms of data.
591-610: Efficiently pad static fields.Looping once with
torch.nn.functional.padensures consistent shapes. Catting them is typical for batching.EventStream/data/config.py (11)
7-8: Imports for hashing & JSON handling look fine.
These imports are needed for computing and storing configuration hashes, and seem appropriate.
19-19: Introduce Loguru logger import.
This shift to structured logging is consistent with the broader logging refactor.
809-812: Docstring refinements for tuning subset size.
These lines clarify how the tuner subset is managed. The description is coherent and matches the pattern used for train subsets.
859-862: Demonstration snippet for Loguru in doctest.
This snippet showcases logger usage in a docstring example. Looks fine as an illustrative test block.
886-887: New parameters for tuning subset.
Introducingtuning_subset_sizeandtuning_subset_seedaligns well with the existing train subset logic and improves customization.
896-898: New configuration comment and field.
Declaringcache_for_epochswith a default of 1 clarifies caching behavior. The inline comment about speed/disk/mem trade-offs is helpful.
929-931: Logging seed assignment for training subset.
Usinglogger.warninginstead of
933-933: Clearing seed when subset size is “FULL” or None.
This logic is consistent with the train subset usage pattern and ensures predictable behavior.
1869-1870: Added new config fields for outlier detection and normalization.
These fields enable flexible outlier handling and toggling of numerical standardization. Well-structured addition.
1920-1923: Validation of outlier detector configuration.
Ensuringoutlier_detector_configis eitherNoneor a dictionary helps maintain consistent usage of the new outlier detection feature.
900-908:⚠️ Potential issueLikely extraneous closing parenthesis causes syntax error.
There appears to be an unmatched closing parenthesis at line 907, which would cause a syntax error. Consider removing the extra parenthesis and ensuring proper indentation, for example:902 if self.subsequence_sampling_strategy != "random" and self.cache_for_epochs > 1: 903 raise ValueError( 904 f"It does not make sense to cache for {self.cache_for_epochs} with non-random " 905 "subsequence sampling." 906 ) -907 )Likely an incorrect or invalid review comment.
scripts/pretrain.py (2)
19-19: Add import for Hydra-Loguru initialization.
This import is part of the ongoing transition to Loguru logging.
30-30: Initialize Loguru withinmain.
Callinghydra_loguru_init()at the start ensures all subsequent logging is captured consistently.scripts/launch_sklearn_baseline_supervised_wandb_hp_sweep.py (2)
18-19: Add Loguru import.
Importinghydra_loguru_initkeeps logging consistent across scripts.
82-82: Hydra-Loguru initialization at script start.
Good practice to set up logging prior to reading config or performing logic.scripts/get_embeddings.py (1)
14-14: Adding structured logging with Loguru - good improvement!The addition of
hydra_loguru_init()provides better logging capabilities compared to standard print statements. This change aligns with similar updates across other scripts in the codebase, establishing a consistent logging pattern.Also applies to: 25-25
sample_data/build_sample_task_DF.py (1)
29-30: Method update for Polars API consistencyThe change from
.groupby()to.group_by()correctly aligns with the Polars DataFrame API (imported asplon line 2). This ensures the code uses the proper method naming conventions for the Polars library rather than pandas-style methods.scripts/sklearn_baseline.py (1)
15-15: Adding structured logging with Loguru - good improvement!The addition of
hydra_loguru_init()enhances the script's logging capabilities. This change is part of a consistent pattern implemented across multiple scripts in this PR, establishing a standardized approach to logging throughout the codebase.Also applies to: 20-20
scripts/launch_finetuning_wandb_hp_sweep.py (1)
18-19: Adding structured logging with Loguru - good improvement!The addition of
hydra_loguru_init()enhances the script's logging capabilities and follows the same pattern implemented across other scripts in this PR. This creates a consistent logging approach throughout the codebase, which will improve debugging and monitoring capabilities.Also applies to: 78-78
scripts/zeroshot.py (2)
15-15: Good addition of structured logging import.This import adds the logging initialization function that will be used to configure Loguru, which is a structured logging library that offers better logging capabilities compared to simple print statements.
26-26: Good logging initialization placement.The
hydra_loguru_init()call is appropriately placed at the beginning of the main function, ensuring that logging is configured before any other operations occur. This follows the best practice of initializing logging early in the execution flow.scripts/launch_pretraining_wandb_hp_sweep.py (2)
18-19: Good addition of structured logging import.Adding the
hydra_loguru_initimport fromEventStream.loggerstandardizes logging initialization across the codebase. The blank line also helps maintain code readability.
80-80: Good logging initialization placement.The
hydra_loguru_init()call is appropriately placed at the beginning of the main function, ensuring that logging is configured before any other operations occur, including the configuration handling and WandB sweep setup.scripts/finetune.py (2)
18-18: Good addition of structured logging import.This import adds the logging initialization function that will be used to configure Loguru for structured logging, aligning with the codebase-wide standardization of logging.
26-26: Good logging initialization placement.The
hydra_loguru_init()call is correctly placed at the beginning of the main function, before handling configurations and training. This ensures consistent logging behavior throughout the execution flow.scripts/launch_from_scratch_supervised_wandb_hp_sweep.py (2)
18-19: Good addition of structured logging import.Adding the
hydra_loguru_initimport standardizes logging initialization across the codebase, consistent with the changes in other scripts. The blank line maintains good code formatting.
82-82: Good logging initialization placement.The
hydra_loguru_init()call is appropriately placed at the beginning of the main function, ensuring logging is configured before any configuration processing or WandB sweep operations.scripts/prepare_pretrain_subsets.py (5)
23-24: Added Loguru for structured loggingThis import adds the Loguru library for structured logging, which is a more powerful alternative to the standard print statements previously used. This provides better log management, filtering, and formatting capabilities.
27-28: Added hydra_loguru_init importThis import brings in the logging initialization function that will configure Loguru to work with Hydra's output directory structure, ensuring logs are properly captured in the experiment directories.
32-33: Added logging initializationThe
hydra_loguru_init()call initializes the logging system at the beginning of the main function, ensuring all subsequent logs are properly formatted and saved to Hydra's output directory.
63-63: Replaced print with structured loggingProperly migrated a print statement to Loguru's structured logging system. This provides consistency with other scripts like
finetune.pyandpretrain.pythat also use this pattern.
255-255: Replaced print with structured loggingConverted a print statement about command file writing to a structured log message, maintaining the same informational content while benefiting from Loguru's features like log levels and formatting.
EventStream/baseline/FT_task_baseline.py (10)
18-19: Added Loguru for structured loggingThis import adds the Loguru library for structured logging, consistent with the pattern applied across the codebase.
39-39: Updated Polars string cache enablingThe parameter
Truehas been removed frompl.enable_string_cache(). This reflects an API change in the Polars library, where the default value is nowTrueand specifying it explicitly is no longer necessary.
191-191: Added explicit column selection after cachingThis line adds explicit column selection for
subject_id,timestamp, and any window features when loading data from the cached parquet file. This ensures consistency in the column set and potentially improves performance by only loading needed columns.
654-654: Converted print to structured loggingReplaced print statement with Loguru's structured logging for saving configuration information.
711-712: Converted print to structured loggingReplaced print statement with Loguru's structured logging for dataset loading information.
716-718: Converted print to structured loggingReplaced print statement with Loguru's structured logging for dataset loading completion information, maintaining the same format string with elapsed time calculation.
721-721: Converted print to structured loggingReplaced print statement with Loguru's structured logging for model initialization information.
724-724: Converted print to structured loggingReplaced print statement with Loguru's structured logging for model fitting information.
726-726: Converted print to structured loggingReplaced print statement with Loguru's structured logging for model saving information.
730-730: Converted print to structured loggingReplaced print statement with Loguru's structured logging for model evaluation information.
scripts/build_dataset.py (6)
17-18: Added Loguru for structured loggingAdded import for Loguru's logger module to enable structured logging throughout the script.
34-34: Added hydra_loguru_init importImported the logging initialization function to configure Loguru with Hydra's output directory.
54-57: Updated docstring example to use LoggerThe docstring example for
add_to_containerhas been updated to demonstrate how to use Loguru for logging in tests, including setup and configuration steps.
74-74: Converted warning to structured loggingReplaced a print warning with a proper Loguru warning call. This improves logging consistency and enables better log filtering and management.
82-84: Added logging initializationAdded
hydra_loguru_init()call at the beginning of the main function with an empty line after it to initialize the logging system, consistent with other scripts in the codebase.
365-365: Converted print to structured loggingReplaced print statement about omitted configuration parameters with a Loguru structured logging call.
tests/data/test_dataset_base.py (2)
259-277: Added test for dataset splitting with mandatory IDsA new test method
test_split_mandatory_idshas been added to verify the functionality of thesplitmethod when mandatory subject IDs are specified for certain splits. The test confirms that:
- The resulting split contains all expected keys
- All subject IDs are included in at least one split
- The mandatory IDs are correctly assigned to their respective splits
- The remaining IDs are properly distributed among the random splits
This test improves coverage for the dataset splitting functionality, particularly for the case where some IDs must be assigned to specific splits.
494-497: Updated measurement metadata structureThe structure of the empty measurement metadata DataFrame has been updated to reflect changes in the
MeasurementConfigclass. The previous columns "outlier_model" and "normalizer" have been replaced with new columns "mean", "std", "thresh_small", and "thresh_large", all initialized with float data type.This change indicates a shift in how measurement values are transformed and normalized in the system, moving from a model-based approach to a statistical thresholding approach.
scripts/convert_to_ESDS.py (3)
26-38: LGTM: Well-defined configuration class.The
ConversionConfigclass properly uses thehydra_dataclassdecorator and includes appropriate attributes with type hints. The__post_init__method correctly handles string path conversion.
47-55: Well-implemented check for existing files.The code properly checks for existing files in the output directory and raises a helpful error message when overwrite is not enabled. It also handles the overwrite case well by removing the directory.
61-73: Robust chunk-based processing approach.The implementation properly handles large datasets by splitting subjects into chunks and processing each chunk separately, which is important for memory efficiency. The use of tqdm for progress tracking is also helpful.
tests/data/test_pytorch_dataset.py (7)
56-58: Good addition of explicit time delta calculations.The addition of time delta calculations for events is valuable for testing time-based features of the system. The approach of calculating deltas and adding a NaN at the end follows a common pattern for time series data.
66-68: Consistent implementation across all event time delta calculations.The implementation of delta calculations is consistent across all three subjects, which ensures uniform test data behavior.
Also applies to: 77-79
85-86: Ensure empty lists are handled consistently in the DataFrame.The change to use empty lists instead of None for subjects with no events is appropriate, but ensure that all processing code handles empty lists consistently.
143-144: Replaced null arrays with empty lists for consistency.The change from
Noneto empty lists for subjects with no dynamic indices, measurement indices, and values ensures consistent handling of empty data throughout the test dataset.Also applies to: 178-179, 185-186
315-345: Comprehensive test setup with proper resource management.The new
setUpmethod creates a well-structured test environment with all necessary files and directories. The test uses temporary directories properly and sets up realistic test data including NRT (Nested Ragged Tensor) representations.However, consider breaking this large setup method into smaller, focused helper methods for better maintainability:
def setUp(self): self._create_temp_directories() self._setup_shards() self._setup_dl_representation() self._setup_nrt_representation() self._setup_task_data() self._setup_configs() def _create_temp_directories(self): self.dir_obj = TemporaryDirectory() self.path = Path(self.dir_obj.name) self.split = "fake_split" # Additional helper methods...
360-361: Good resource cleanup in tearDown.The
tearDownmethod properly cleans up the temporary directory, which is important to prevent resource leaks in tests.
363-374: Simplified method signature improves maintainability.The simplified
get_pydmethod makes the tests more maintainable by focusing on essential parameters and using default values and kwargs for the rest.tests/data/test_config.py (5)
185-190: Updated measurement metadata structure with new statistical fields.The test has been updated to reflect changes in the metadata structure, replacing
outlier_modelandnormalizerwith new statistical fields (mean,std,thresh_small, andthresh_large). This aligns with changes to how measurement metadata is defined across the codebase.
204-206: Updated measurement metadata field order.The field order in the measurement metadata has been changed to accommodate the new statistical fields. This is consistent with the changes to the
MeasurementConfigclass.
341-343: Updated metadata fields for univariate regression.The test for the univariate regression case has been updated with the new metadata fields, ensuring consistency across test cases.
454-455: Simplified outlier detector configuration.The outlier detector configuration now allows an empty dictionary, which is a simplification from the previous implementation. This suggests a change in how outlier detection is configured in the system.
477-478: Enhanced validation for outlier detector configuration.The validation now checks that the outlier detector configuration is a dictionary, not a string, which improves type checking and prevents potential type-related issues.
tests/test_e2e_runs.py (6)
33-42: Added ESDS directory to test setup.The addition of the "esds" directory to the test environment setup is necessary to support the new ESDS dataset building functionality. This aligns with the addition of the new script
convert_to_ESDS.py.
73-74: Non-subtest approach for build_dataset is appropriate.Using
use_subtest=Falsefor the dataset building command is appropriate since this is a prerequisite for other tests rather than a separate test case itself.
75-83: Well-implemented ESDS dataset building method.The
build_ESDS_datasetmethod properly constructs the command to convert the dataset to the ESDS format, including appropriate parameters. It correctly uses the newconvert_to_ESDS.pyscript.Consider parameterizing the method for more flexibility:
-def build_ESDS_dataset(self): +def build_ESDS_dataset(self, chunk_size=25): command_parts = [ "./scripts/convert_to_ESDS.py", f"dataset_dir={self.paths['dataset']}", f"ESDS_save_dir={self.paths['esds']}", - "ESDS_chunk_size=25", + f"ESDS_chunk_size={chunk_size}", ] self._test_command(command_parts, "Build ESDS Dataset", use_subtest=True)
110-111: Non-subtest approach for pretraining is consistent with dataset building.Using
use_subtest=Falsefor pretraining is consistent with the approach used for dataset building, as both are prerequisites for subsequent tests.
201-208: Enhanced window size options and model parameters.The addition of new window size options (
-1hand-FULL) and the explicit setting ofn_estimatorsfor the random forest classifier provide more comprehensive testing scenarios while keeping the test efficient with a small number of estimators.
255-257: Complete end-to-end test flow with ESDS dataset building.The addition of the ESDS dataset building step to the end-to-end test ensures that this new functionality is properly tested as part of the complete workflow.
EventStream/data/dataset_base.py (24)
21-22: Streamlined imports for data manipulation & logging
These imports facilitate the usage of polars for data processing and loguru for structured logging.
24-24: Ensure runtime availability
This new import fromnested_ragged_tensorsrequires verification that the dependency is installed and resolvable in all environments.
187-191: Refactored_load_input_dfusage
Confirm that prior parameters likesubject_ids_maporsubject_id_dtypeare removed or replaced. This ensures consistency with the simplified method signature.
194-194: Check column collisions
Using{i: o for i, (o, _) in schema.unified_schema.items()}is convenient, but verify no duplicate out-names.
220-220: Consistent function call
The revised_load_input_dfcall aligns with the new signature. Good job keeping the codebase in sync.
226-226: Enhanced debugging
Logging “Filtering” provides better traceability during data preprocessing steps.
230-231: Timestamp resolution improvements
The debug message clarifies event handling, and_resolve_ts_colensures consistent timestamp columns.
241-242: Range-based event processing
These debug/log steps clarify range event handling. Ensuresstart_timeis properly resolved.
260-261: Guard for empty events
Logging a warning on an empty events dataframe aids in diagnosing unexpected data conditions.
264-264: Appending events
Ensure subsequent concatenation merges schemas consistently across event DataFrames.
266-266: Appending measurements
Likewise, confirm consistent measurement schema when building a unified DataFrame later.
293-293: Loading subjects info
This info-level log helps in diagnosing any file I/O issues.
310-310: Logging event loading
Keeping logs consistent fosters maintainability across I/O boundaries.
329-329: Loading dynamic measurements
Ensures clarity when reading dynamic data from disk, aligning with the broader logging pattern.
366-366: Updating config directory
Informs that theconfig.save_diris changed during reload; improves clarity for stateful loading logic.
473-474: Building subjects & logging extraction step
Combining subject build logic with a debug statement keeps the dataflow transparent.
479-479: Events and measurements built
A debug log confirming final data construction is helpful for debugging.
509-510: Assigning main DataFrames
Storingevents_dfanddynamic_measurements_dfensures consistent usage across methods.
512-514: Conditional event preprocessing
Only ifevents_dfis not None,_agg_by_time()and_sort_events()are applied, preventing accidental calls on missing data.
521-521: Validating initial dataframes
Destructuring the results of_validate_initial_dfsclarifies final subject, event, and measurement DataFrames.
722-730: Comprehensive preprocessing pipeline logs
Provides clear step-by-step updates: from subject filtering, time-derived measurement addition, to final transformations.
793-793: Warning on missing measure
Helps identify absent columns at runtime, triggering a drop to avoid silent failures.
805-805: Logging no events
Reporting that a measure has no observed events clarifies the reason behind subsequent config drops.
806-806: Dropping measure config
Dropping the measure config after detecting no events is consistent but verify it doesn’t cause unhandled references.EventStream/data/dataset_polars.py (19)
13-13: Reason fordefaultdict
Usingcollections.defaultdictcan simplify code when grouping or accumulating values.
15-15: Time offset usage
timedeltalikely supports rolling or dynamic grouping logic—ensure all edge cases are handled.
22-22: PyArrow integration
Facilitates interoperability with Arrow-based formats. Good for performance in columnar operations.
23-23: Structured logging
Consistent logging withloguruimproves observability.
38-38: String cache activation
pl.enable_string_cache()may significantly optimize repetitive string operations.
40-55: Polars-to-PyArrow dtype map
This dictionary ensures controlled mapping between Polars and Arrow. Verify coverage for all used dtypes to prevent runtime errors.
116-119: ExpandedMETADATA_SCHEMA
Addingthresh_high,thresh_low,mean, andstdsupports robust outlier detection and normalization.
301-301: Method signature flexibility
Clarifying the return withNonefor optional data aligns with the code handling empty or missing frames.
312-312: Added debug statement
Logging event processing step helps isolate issues in the pipeline.
332-335: Hashing forevent_id
Struct-based hashing is concise but confirm collision risk is minimal for large datasets or repeated timestamps.
635-635: Conditional type enforcement
Ensuresdynamic_measurements_df["event_id"]aligns withevents_df["event_id"]type.
637-639: Casting event_id
Repeated cast can cause overhead or data truncation if ID ranges are large. Validate the numeric limits.
640-640: Skipping streaming segfaults
This log clarifies a known issue with streaming in certain polars versions. Helpful for maintainers.
649-649: Temporal bucketing
group_by_dynamic(every=...)is powerful, but confirm memory usage is acceptable on large data.
664-667: Re-hashing events
Using.hash(...)again for final event IDs. Follow up on collision prevention if the data grows large.
683-683: Debug: Re-map measurements
The debug log clarifies re-mapping steps for dynamic_measurements_df.
685-687: Collect dynamic measurements
Disabling streaming ensures stability but may raise memory needs. Monitor large-file performance.
694-694: Collecting event types
Reinforces the discovered categories. Validate we handle unexpected categories gracefully upstream.
703-703: Event count storage
Maintaining a subject-to-event-count map speeds subsequent lookups and stats, if kept in sync.
| for i, (subj, n_events) in enumerate(zip(subject_ids, n_events)): | ||
| if subj in self.subj_indices or subj in self.subj_seq_bounds: | ||
| raise ValueError(f"Duplicate subject {subj} in {shard}!") | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion
Potential variable overshadowing in loop.
Your loop reads for i, (subj, n_events) in enumerate(zip(subject_ids, n_events)):. The variable n_events is used both in the zip and as the local loop variable, which might cause confusion. Renaming the loop variable could avoid overshadowing.
🧰 Tools
🪛 Ruff (0.8.2)
209-209: Loop control variable n_events overrides iterable it iterates
(B020)
| center_and_scale=True, | ||
| save_dir=None, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
💡 Verification agent
🧩 Analysis chain
Added center_and_scale parameter to default configuration.
The addition of the center_and_scale parameter with a default value of True replaces the previous normalizer_config. This change should be documented to ensure users understand the new parameter's purpose and how it differs from the previous approach.
🌐 Web query:
What is the purpose of the center_and_scale parameter in data normalization?
💡 Result:
The center_and_scale parameter in data normalization determines whether to apply centering (adjusting the mean to zero) and scaling (adjusting the spread to unit variance or a specified range) to the dataset. Here's a breakdown of its purpose:
Core Functions
-
Centering:
- Subtracts the mean (or median) from each feature, shifting the data distribution to center around zero[1][3][6].
- Example: If a feature’s mean is 50, centering adjusts values to fluctuate around 0 instead of 50[3][13].
-
Scaling:
- Adjusts the spread of data by dividing features by a metric like standard deviation, range, or median absolute deviation (MAD)[1][4][13].
- Common methods:
- Z-score: Divides by standard deviation (unit variance)[2][13].
- Range scaling: Rescales data to a predefined range (e.g., [0, 1])[14][16].
Why Use center_and_scale?
- Algorithm Compatibility: Many machine learning algorithms (e.g., SVM, PCA, k-NN) require features to have similar scales for accurate performance[2][12].
- Gradient Descent Optimization: Centered and scaled gradients converge faster during model training[16][12].
- Outlier Handling: Methods like MAD scaling reduce sensitivity to outliers compared to standard deviation[1][5].
| Method | Effect on Data | Typical Use Case |
|---|---|---|
| Z-score | Mean=0, Std=1 | Gradient-based models (e.g., SVM) |
| Min-Max Scaling | Range=[0,1] | Neural networks, image processing |
| MAD Scaling | Median=0, MAD=1 | Robust scaling for outlier-prone data |
Practical Example
Enabling center_and_scale=True in a function like MATLAB’s normalize or scikit-learn’s StandardScaler performs:
X_normalized = (X - μ) / σ
where:
- (μ) = mean (centering)
- (σ) = standard deviation (scaling)[2][16].
This ensures features are standardized (mean=0, variance=1), aligning with assumptions of many statistical models[6][13]. Without centering and scaling, features with larger ranges could dominate model outcomes[2][5].
Citations:
- 1: https://www.mathworks.com/help/matlab/ref/normalizedata.html
- 2: https://www.datacamp.com/tutorial/normalization-vs-standardization
- 3: https://www.stathelp.se/en/center_en.html
- 4: https://www.mathworks.com/help/matlab/ref/double.normalize.html
- 5: https://julielinx.github.io/blog/08_center_scale_and_latex/
- 6: https://www.myrelab.com/learn/normalization-standardization-and-data-transformation
- 7: https://blogs.sas.com/content/iml/2020/08/10/standardize-data-sas.html
- 8: https://www.datacamp.com/tutorial/normalization-in-machine-learning
- 9: https://uclouvain-cbio.github.io/WSBIM1322/sec-norm.html
- 10: https://www.youtube.com/watch?v=hOZfAFIQ4sg
- 11: https://www.splunk.com/en_us/blog/learn/data-normalization.html
- 12: https://towardsai.net/p/data-science/how-when-and-why-should-you-normalize-standardize-rescale-your-data-3f083def38ff
- 13: https://estuary.dev/blog/data-normalization/
- 14: https://www.kaggle.com/code/alexisbcook/scaling-and-normalization
- 15: https://developers.google.com/machine-learning/crash-course/numerical-data/normalization
- 16: https://scikit-learn.org/stable/modules/preprocessing.html
- 17: https://recipes.tidymodels.org/reference/step_normalize.html
Documentation Update Needed: Clarify center_and_scale Behavior
The new center_and_scale parameter (defaulted to True) now governs whether data is centered (shifting the mean to zero) and scaled (adjusting the spread to unit variance). This implementation replaces the old normalizer_config mechanism. To aid users in understanding the update, please ensure that the documentation (and any related changelogs) clearly explains:
-
What It Does:
- Centering: Subtracts the mean so that the data distribution centers around zero.
- Scaling: Divides by the standard deviation (or an equivalent metric) to standardize the data.
-
Rationale:
- Aligns with common data normalization practices widely used in algorithms like SVM, PCA, and other gradient-based models.
- Addresses potential issues where features with larger ranges might dominate model outcomes.
-
Migration Impact:
- Highlights differences from the previous
normalizer_config, ensuring users understand any behavioral changes they might need to account for.
- Highlights differences from the previous
Location:
- File:
tests/data/test_config.py - Lines: 492-493
No description provided.