Last Dev Release into Main #124

mmcdermott · 2025-03-20T16:47:18Z

No description provided.

…nly still failing.

…least.

…runs now.

Duplication bug in task caching

Removed the old increment calls to assign event IDs in favor of hashes of subject IDs and timestamps which can be run lazily.

Enable flash attention

Fixes a metrics thing and enables tuning subset sizes as well for speed.

Specify test set exclusions

…_flat_reps Add computation over future summary windows to flat reps

…resource intensive manner.

…ng, but still).

…ange modified

Fixes the slowdowns and bugs caused by the prior improved compute practices, but requires a nested tensor package.

…removed references to np.NaN as that is removed in np 2.0

Some small changes to update things for more recent versions of packages.

coderabbitai · 2025-03-20T16:47:29Z

Walkthrough

This pull request introduces numerous refactorings and removals across the codebase. Most changes replace standard print statements with structured logging via Loguru, update method signatures, and modify parameters. Several configuration classes have been enhanced with new attributes and validation logic, while deprecated preprocessing and evaluation modules are removed. Updates also include changes to dataset handling (Polars and PyTorch), attention mechanisms in Transformer modules, and documentation revisions. Additionally, new scripts are added and tests are updated to reflect these modifications.

Changes

File(s)	Change Summary
`EventStream/baseline/FT_task_baseline.py`, `EventStream/transformer/config.py`, `EventStream/transformer/lightning_modules/*`, `EventStream/utils.py`	Replaced print statements with structured logging using Loguru; added `hydra_loguru_init` for consistent logger initialization.
`EventStream/data/config.py`, `tests/data/test_config.py`	Added new attributes (`tuning_subset_size`, `tuning_subset_seed`), introduced caching properties and a hash method; updated `MeasurementConfig` (removed outlier_model/normalizer, added thresh, mean, std) and validation logic.
`EventStream/data/dataset_base.py`, `EventStream/data/dataset_polars.py`, `EventStream/data/types.py`	Simplified input dataframe loading, removed obsolete parameters (e.g., subject_ids_map), changed groupby to group_by, added ESDS representation methods, and revised missing value representations.
`EventStream/data/preprocessing/*`	Removed preprocessing files (`README.md`, `__init__.py`, `preprocessor.py`, `standard_scaler.py`, `stddev_cutoff.py`), deprecating legacy normalization and outlier detection functionality.
`EventStream/data/pytorch_dataset.py`	Removed mixin classes; updated constructor (added `just_cache` parameter) and data retrieval to return torch.Tensor; refined indexing and collate operations.
`EventStream/data/time_dependent_functor.py`	Updated age calculation to use `dt.total_nanoseconds()` and simplified measurement metadata handling in functor methods.
`EventStream/transformer/transformer.py` and related lightning modules	Revised attention mechanism (using scaled_dot_product_attention, separate dropout probability), restructured sinusoidal position encoding, and enhanced logging in training modules.
`EventStream/evaluation/*.py`	Removed evaluation files (`MCF_evaluation.py`, `general_generative_evaluation.py`) and their associated functions and classes.
`README.md`, `configs/README.md`, `docs/MIMIC_IV_tutorial/*`	Updated documentation: replaced `np.NaN` with `float('nan')`, renamed repository path variable (EFGPT to ESGPT), and removed the `normalizer_config` entry in YAML.
`scripts/` (e.g., `build_dataset.py`, `finetune.py`, `get_embeddings.py`, `launch_`, `pretrain.py`, `sklearn_baseline.py`, `zeroshot.py`, `prepare_pretrain_subsets.py`)	Integrated logging initialization via `hydra_loguru_init`, replaced print statements with logger calls, and updated script entries in setup; applied consistent logging across scripts.
`scripts/convert_to_ESDS.py`	New script added to build dataset representations in the ESDS format based on Hydra configurations.
`tests/*` (e.g., `test_standard_scaler.py`, `test_stddev_cutoff.py`, `test_dataset_polars.py`, `test_pytorch_dataset.py`, `test_e2e_runs.py`)	Removed deprecated tests; added tests for mandatory split IDs and updated dataset splitting; restructured test setup and teardown procedures.

Sequence Diagram(s)

sequenceDiagram
    participant User
    participant Hydra
    participant LoggerInit
    participant ScriptMain

    User->>Hydra: Launch script with Hydra config
    Hydra->>LoggerInit: Call hydra_loguru_init()
    LoggerInit->>Logger: Initialize Loguru logger (set up log file, etc.)
    Hydra->>ScriptMain: Provide configuration
    ScriptMain->>Logger: Log messages via logger.info/error/warning during execution

Possibly related PRs

Some small changes to update things for more recent versions of packages. #115: Addresses similar changes by replacing np.NaN with float('nan') in multiple files, affecting areas such as dataset processing and test modules.

📜 Recent review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between ce9e2c3 and 6cb0906.

⛔ Files ignored due to path filters (1)

pyproject.toml is excluded by !**/*.toml

📒 Files selected for processing (1)

README.md (1 hunks)

🚧 Files skipped from review as they are similar to previous changes (1)

README.md

⏰ Context from checks skipped due to timeout of 90000ms (1)

GitHub Check: run_tests_ubuntu

🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

Review comments: Directly reply to a review comment made by CodeRabbit. Example:
- I pushed a fix in commit <commit_id>, please review it.
- Generate unit testing code for this file.
- Open a follow-up GitHub issue for this discussion.
Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
- @coderabbitai generate unit testing code for this file.
- @coderabbitai modularize this function.
PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
- @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
- @coderabbitai read src/utils.ts and generate unit testing code.
- @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
- @coderabbitai help me debug CodeRabbit configuration file.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

@coderabbitai pause to pause the reviews on a PR.
@coderabbitai resume to resume the paused reviews.
@coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
@coderabbitai full review to do a full review from scratch and review all the files again.
@coderabbitai summary to regenerate the summary of the PR.
@coderabbitai generate docstrings to generate docstrings for this PR.
@coderabbitai resolve resolve all the CodeRabbit review comments.
@coderabbitai configuration to show the current CodeRabbit configuration for the repository.
@coderabbitai help to get help.

Other keywords and placeholders

Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (`.coderabbit.yaml`)

You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
Please see the configuration documentation for more information.
If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Documentation and Community

Visit our Documentation for detailed information on how to use CodeRabbit.
Join our Discord Community to get help, request features, and share feedback.
Follow us on X/Twitter for updates and announcements.

coderabbitai

Actionable comments posted: 5

🧹 Nitpick comments (15)

sample_data/generate_synthetic_data.py (4)
118-119: Code style: Multiple statements on one line.

Multiple statements appear on single lines after colons, which violates the PEP 8 style guide.

Apply this formatting change:
-                if running_HR < 30: running_HR = 30
-                if running_HR > 300: running_HR = 300
+                if running_HR < 30:
+                    running_HR = 30
+                if running_HR > 300:
+                    running_HR = 300

-                if running_temp < 95: running_temp = 95
-                if running_temp > 104: running_temp = 104
+                if running_temp < 95:
+                    running_temp = 95
+                if running_temp > 104:
+                    running_temp = 104
Also applies to: 122-123

🧰 Tools

🪛 Ruff (0.8.2)

118-118: Multiple statements on one line (colon)

(E701)

119-119: Multiple statements on one line (colon)

(E701)

190-190: Remove unused variables.

The variable months is defined but never used in multiple places.

Consider removing these unused variable assignments or using them if they were intended to be part of the logic:
-    months = 30 * days
Also applies to: 250-250

🧰 Tools

🪛 Ruff (0.8.2)

190-190: Local variable months is assigned to but never used

Remove assignment to unused variable months

(F841)

193-193: Remove unused variable 'lab_ps'.

The variable lab_ps is assigned a value from np.random.dirichlet() but never used.

Either remove the assignment or use the variable as intended:
-        lab_ps = np.random.dirichlet(alpha=[0.1 for _ in range(5)])
🧰 Tools

🪛 Ruff (0.8.2)

193-193: Local variable lab_ps is assigned to but never used

Remove assignment to unused variable lab_ps

(F841)

212-212: Simplify dictionary key check.

Using .keys() in a membership test is redundant and less efficient.

Simplify the code by removing the .keys() call:
-            for lab in base_lab_gaps.keys():
+            for lab in base_lab_gaps:
🧰 Tools

🪛 Ruff (0.8.2)

212-212: Use key in dict instead of key in dict.keys()

Remove .keys()

(SIM118)
EventStream/tasks/profile.py (1)

7-7: Update to modern Polars API usage.

The code now uses pl.enable_string_cache() without an explicit boolean parameter, aligning with the current Polars API which defaults this parameter to True.
EventStream/data/dataset_polars.py (1)
2013-2013: Use direct membership check instead of .keys().

Prefer if key in some_dict: over if key in some_dict.keys():.
- if mod_col in source_df[mod_col].dtype.keys():
+ if mod_col in source_df[mod_col].dtype:
🧰 Tools

🪛 Ruff (0.8.2)

2013-2013: Use key in dict instead of key in dict.keys()

Remove .keys()

(SIM118)
EventStream/transformer/transformer.py (3)

240-240: Potentially redundant cast.
sdpa_result is re-cast to value.dtype even though query and key were already cast. If there is no type change from intermediate operations, you can safely omit this step.

242-242: Duplicate return statement.
You return (sdpa_result, None) at both line 238 and line 242. Consider consolidating the code paths if they are identical, to avoid confusion.

613-613: Initialize sinusoidal terms with normal distribution.
Using random initialization for sin_div_term and cos_div_term breaks from typical fixed-frequency sinusoidal encodings. If you're intentionally learning these frequencies, clarify the rationale in docstrings to assist maintainers in understanding the design choice.

Also applies to: 616-620

EventStream/data/dataset_base.py (1)

1484-1492: Construct a unified vocabulary list with UNK at index 0.
The logic is clear. However, the variable m in for m, idxmap in self.unified_vocabulary_idxmap.items(): is never used inside the loop. Renaming it to _ would better reflect that it’s unused.

🧰 Tools

🪛 Ruff (0.8.2)

1488-1488: Loop control variable m not used within loop body

Rename unused m to _m

(B007)
EventStream/data/pytorch_dataset.py (3)
132-155: Usage of just_cache argument.

The newly introduced just_cache: bool = False parameter is not referenced within __init__. If this is intentional (e.g., future usage or external calls to skip steps), consider adding clarifying docstrings or removing the parameter if it's not needed.
 def __init__(self, config: PytorchDatasetConfig, split: str, just_cache: bool = False):
     ...
+    # If you plan to use just_cache in the future, documenting it here
+    # might prevent confusion among collaborators.
209-209: Avoid shadowing the n_events loop variable.

The identifier n_events inside the for loop parameter overshadows the outer n_events definition, which can confuse maintainers or static analysis tools.
 for i, (subj, subj_n_events) in enumerate(zip(subject_ids, n_events)):
     if subj in self.subj_indices or subj in self.subj_seq_bounds:
         raise ValueError(f"Duplicate subject {subj} in {shard}!")
     ...
🧰 Tools

🪛 Ruff (0.8.2)

209-209: Loop control variable n_events overrides iterable it iterates

(B020)

364-395: Data anomalies handling.

Automatically removing malformed subjects meshes well with real-world data patterns where negative or zero inter-event times occur. Document any assumptions about potential side effects (like how it impacts training set distribution).
EventStream/data/config.py (2)

979-983: Loading measurement configs.

Filtering out dropped measurements is a logical design choice. No issues spotted, but consider logging the count of loaded vs. dropped measurements for auditing.

1011-1046: Data parameter hashing & caching.

The _data_parameters_and_hash property is a pragmatic approach to uniquely identify dataset settings. Consider clarifying that changes in any of these fields generate a new directory.

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 7680461 and ce9e2c3.

⛔ Files ignored due to path filters (7)

.github/workflows/tests.yml is excluded by !**/*.yml
configs/dataset_base.yaml is excluded by !**/*.yaml
configs/normalizer_config/standard_scaler.yaml is excluded by !**/*.yaml
configs/outlier_detector_config/stddev_cutoff.yaml is excluded by !**/*.yaml
pyproject.toml is excluded by !**/*.toml
sample_data/dataset.yaml is excluded by !**/*.yaml
sample_data/pretrain_NA.yaml is excluded by !**/*.yaml

📒 Files selected for processing (52)

EventStream/baseline/FT_task_baseline.py (6 hunks)
EventStream/data/README.md (2 hunks)
EventStream/data/config.py (14 hunks)
EventStream/data/dataset_base.py (21 hunks)
EventStream/data/dataset_polars.py (44 hunks)
EventStream/data/preprocessing/README.md (0 hunks)
EventStream/data/preprocessing/__init__.py (0 hunks)
EventStream/data/preprocessing/preprocessor.py (0 hunks)
EventStream/data/preprocessing/standard_scaler.py (0 hunks)
EventStream/data/preprocessing/stddev_cutoff.py (0 hunks)
EventStream/data/pytorch_dataset.py (9 hunks)
EventStream/data/time_dependent_functor.py (3 hunks)
EventStream/data/types.py (1 hunks)
EventStream/data/visualize.py (7 hunks)
EventStream/evaluation/MCF_evaluation.py (0 hunks)
EventStream/evaluation/general_generative_evaluation.py (0 hunks)
EventStream/logger.py (1 hunks)
EventStream/tasks/profile.py (4 hunks)
EventStream/transformer/config.py (5 hunks)
EventStream/transformer/lightning_modules/embedding.py (2 hunks)
EventStream/transformer/lightning_modules/fine_tuning.py (7 hunks)
EventStream/transformer/lightning_modules/generative_modeling.py (6 hunks)
EventStream/transformer/lightning_modules/zero_shot_evaluator.py (3 hunks)
EventStream/transformer/model_output.py (7 hunks)
EventStream/transformer/transformer.py (4 hunks)
EventStream/utils.py (2 hunks)
README.md (1 hunks)
configs/README.md (0 hunks)
docs/MIMIC_IV_tutorial/data_extraction_processing.md (2 hunks)
sample_data/build_sample_task_DF.py (1 hunks)
sample_data/generate_synthetic_data.py (9 hunks)
scripts/build_dataset.py (6 hunks)
scripts/convert_to_ESDS.py (1 hunks)
scripts/finetune.py (1 hunks)
scripts/generate_trajectories.py (0 hunks)
scripts/get_embeddings.py (2 hunks)
scripts/launch_finetuning_wandb_hp_sweep.py (2 hunks)
scripts/launch_from_scratch_supervised_wandb_hp_sweep.py (2 hunks)
scripts/launch_pretraining_wandb_hp_sweep.py (2 hunks)
scripts/launch_sklearn_baseline_supervised_wandb_hp_sweep.py (2 hunks)
scripts/prepare_pretrain_subsets.py (3 hunks)
scripts/pretrain.py (2 hunks)
scripts/sklearn_baseline.py (1 hunks)
scripts/zeroshot.py (2 hunks)
setup.py (1 hunks)
tests/data/preprocessing/test_standard_scaler.py (0 hunks)
tests/data/preprocessing/test_stddev_cutoff.py (0 hunks)
tests/data/test_config.py (8 hunks)
tests/data/test_dataset_base.py (2 hunks)
tests/data/test_dataset_polars.py (0 hunks)
tests/data/test_pytorch_dataset.py (9 hunks)
tests/test_e2e_runs.py (5 hunks)

💤 Files with no reviewable changes (12)

EventStream/data/preprocessing/README.md
configs/README.md
EventStream/data/preprocessing/init.py
EventStream/evaluation/MCF_evaluation.py
EventStream/evaluation/general_generative_evaluation.py
EventStream/data/preprocessing/preprocessor.py
scripts/generate_trajectories.py
EventStream/data/preprocessing/stddev_cutoff.py
tests/data/preprocessing/test_standard_scaler.py
tests/data/preprocessing/test_stddev_cutoff.py
tests/data/test_dataset_polars.py
EventStream/data/preprocessing/standard_scaler.py

🧰 Additional context used

🧬 Code Definitions (15)

scripts/finetune.py (3)

EventStream/transformer/lightning_modules/fine_tuning.py (2) (2)

FinetuneConfig (273-442)

train (446-586)

scripts/get_embeddings.py (1) (1)

main (24-28)

scripts/zeroshot.py (1) (1)

main (25-29)

EventStream/transformer/lightning_modules/fine_tuning.py (3)

EventStream/transformer/config.py (1) (1)

StructuredTransformerConfig (353-935)

EventStream/utils.py (1) (1)

from_json_file (329-365)

EventStream/data/dataset_base.py (1) (1)

load (339-377)

scripts/sklearn_baseline.py (8)

scripts/launch_sklearn_baseline_supervised_wandb_hp_sweep.py (1) (1)

main (81-111)

scripts/get_embeddings.py (1) (1)

main (24-28)

scripts/finetune.py (1) (1)

main (25-39)

scripts/launch_finetuning_wandb_hp_sweep.py (1) (1)

main (77-107)

scripts/launch_from_scratch_supervised_wandb_hp_sweep.py (1) (1)

main (81-111)

scripts/launch_pretraining_wandb_hp_sweep.py (1) (1)

main (79-109)

scripts/pretrain.py (1) (1)

main (29-43)

scripts/zeroshot.py (1) (1)

main (25-29)

scripts/prepare_pretrain_subsets.py (10)

EventStream/data/config.py (2) (2)

SeqPaddingSide (748-761)

SubsequenceSamplingStrategy (764-784)

scripts/finetune.py (1) (1)

main (25-39)

scripts/get_embeddings.py (1) (1)

main (24-28)

scripts/pretrain.py (1) (1)

main (29-43)

scripts/zeroshot.py (1) (1)

main (25-29)

scripts/launch_finetuning_wandb_hp_sweep.py (1) (1)

main (77-107)

scripts/launch_from_scratch_supervised_wandb_hp_sweep.py (1) (1)

main (81-111)

scripts/launch_pretraining_wandb_hp_sweep.py (1) (1)

main (79-109)

scripts/launch_sklearn_baseline_supervised_wandb_hp_sweep.py (1) (1)

main (81-111)

scripts/sklearn_baseline.py (1) (1)

main (19-24)

scripts/build_dataset.py (1)

EventStream/logger.py (1) (1)

hydra_loguru_init (7-10)

EventStream/data/time_dependent_functor.py (1)

EventStream/data/config.py (2) (2)

measurement_metadata (1429-1455)

measurement_metadata (1458-1469)

EventStream/transformer/lightning_modules/generative_modeling.py (1)

EventStream/transformer/config.py (3) (3)

MetricsConfig (106-199)

Split (26-42)

MetricCategories (45-61)

tests/test_e2e_runs.py (1)

tests/data/test_pytorch_dataset.py (1) (1)

tearDown (360-361)

EventStream/transformer/model_output.py (1)

EventStream/transformer/utils.py (1) (1)

str_summary (11-30)

EventStream/transformer/transformer.py (1)

EventStream/data/types.py (1) (1)

batch_size (178-183)

EventStream/data/visualize.py (1)

EventStream/data/dataset_base.py (2) (2)

events_df (302-313)

events_df (316-317)

EventStream/data/dataset_polars.py (3)

EventStream/data/config.py (3) (3)

drop (1391-1415)

measurement_configs (980-983)

vocabulary_config (972-973)

EventStream/data/types.py (5) (5)

keys (310-312)

items (306-308)

DataModality (832-868)

NumericDataModalitySubtype (871-897)

TemporalityType (808-829)

EventStream/data/vocabulary.py (2) (2)

filter (186-274)

idxmap (78-90)

tests/data/test_pytorch_dataset.py (4)

EventStream/data/config.py (2) (2)

PytorchDatasetConfig (788-1062)

VocabularyConfig (699-745)

tests/data/test_dataset_base.py (1) (1)

setUp (179-186)

tests/data/test_types.py (1) (1)

setUp (65-67)

EventStream/data/pytorch_dataset.py (2) (2)

PytorchDataset (61-609)

max_seq_len (417-418)

EventStream/data/pytorch_dataset.py (2)

EventStream/utils.py (2) (2)

count_or_proportion (26-88)

to_dict (252-287)

EventStream/data/config.py (6) (6)

PytorchDatasetConfig (788-1062)

SeqPaddingSide (748-761)

SubsequenceSamplingStrategy (764-784)

to_dict (955-959)

to_dict (1538-1554)

to_dict (1934-1944)

EventStream/data/config.py (1)

EventStream/data/dataset_base.py (2) (2)

vocabulary_config (1426-1448)

measurement_configs (975-984)

🪛 Ruff (0.8.2)

sample_data/generate_synthetic_data.py

118-118: Multiple statements on one line (colon)

(E701)

119-119: Multiple statements on one line (colon)

(E701)

122-122: Multiple statements on one line (colon)

(E701)

123-123: Multiple statements on one line (colon)

(E701)

190-190: Local variable months is assigned to but never used

Remove assignment to unused variable months

(F841)

193-193: Local variable lab_ps is assigned to but never used

Remove assignment to unused variable lab_ps

(F841)

212-212: Use key in dict instead of key in dict.keys()

Remove .keys()

(SIM118)

250-250: Local variable months is assigned to but never used

Remove assignment to unused variable months

(F841)

EventStream/data/dataset_polars.py

1648-1648: Function definition does not bind loop variable m

(B023)

1648-1648: Function definition does not bind loop variable key_prefix

(B023)

1652-1652: Function definition does not bind loop variable m

(B023)

1652-1652: Function definition does not bind loop variable val_prefix

(B023)

1654-1654: Function definition does not bind loop variable m

(B023)

1654-1654: Function definition does not bind loop variable val_prefix

(B023)

1657-1657: Function definition does not bind loop variable m

(B023)

1657-1657: Function definition does not bind loop variable val_prefix

(B023)

1658-1658: Function definition does not bind loop variable m

(B023)

1658-1658: Function definition does not bind loop variable val_prefix

(B023)

1659-1659: Function definition does not bind loop variable m

(B023)

1659-1659: Function definition does not bind loop variable val_prefix

(B023)

1666-1666: Function definition does not bind loop variable m

(B023)

2013-2013: Use key in dict instead of key in dict.keys()

Remove .keys()

(SIM118)

EventStream/data/pytorch_dataset.py

209-209: Loop control variable n_events overrides iterable it iterates

(B020)

EventStream/data/dataset_base.py

1488-1488: Loop control variable m not used within loop body

Rename unused m to _m

(B007)

⏰ Context from checks skipped due to timeout of 90000ms (1)

GitHub Check: run_tests_ubuntu

🔇 Additional comments (169)

README.md (1)

34-36: Good addition of compatibility information.

This note clearly explains the version requirement for Polars and why it matters, helping users troubleshoot common issues when using outdated environments. The specific mention of the function naming change (groupby vs group_by) directly ties to code changes in this PR.

sample_data/build_sample_task_DF.py (1)

29-30: Function name updated to match Polars >= 0.19.

The change from groupby to group_by aligns with the newer Polars API, as noted in the README update. This ensures compatibility with the required library version.

sample_data/generate_synthetic_data.py (1)

3-5: Improved docstring formatting and clarity.

The docstring has been reformatted with proper line wrapping for better readability, without changing the content.

Also applies to: 10-12

EventStream/data/README.md (1)

79-81: Consistent representation of missing values in documentation.

The change from np.NaN to float('nan') ensures documentation consistency with the actual implementation for representing missing values in the codebase.

Also applies to: 393-395

EventStream/utils.py (1)

19-19: Appropriate import addition for structured logging.

Adding the Loguru logger import is consistent with the PR objective of replacing print statements with structured logging.

EventStream/transformer/lightning_modules/zero_shot_evaluator.py (3)

13-13: Appropriate import addition for structured logging.

Adding the Loguru logger import is consistent with the PR objective of replacing print statements with structured logging.

172-175: Good replacement of print with logger.error for failures.

Replacing print statements with structured logging for error conditions is appropriate and improves the ability to track and filter issues.

384-384: Good replacement of print with logger.info for status messages.

Using logger.info for informational messages is consistent with good logging practices.

EventStream/transformer/lightning_modules/embedding.py (3)

9-9: Appropriate import addition for structured logging.

Adding the Loguru logger import is consistent with the PR objective of replacing print statements with structured logging.

157-159: Good replacement of print with logger.info for file existence checks.

Using structured logging for messages about existing files improves log consistency and filtering capabilities.

161-161: Good replacement of print with logger.info for saving operations.

Using structured logging for saving operations enhances observability and maintains a consistent logging approach.

EventStream/data/types.py (1)

753-753: Improved DataFrame display format for truncated lists.

The change to use ellipses for truncation improves readability of the DataFrame representation, especially for large lists of floating-point values.
EventStream/tasks/profile.py (2)

85-85: DataFrame schema updates for nullable columns.

The schema for start_time column has been updated from f32 to null type, correctly reflecting that this column can contain null values, which is consistent with the example data.

Also applies to: 98-98

154-154:
❓ Verification inconclusive

Standardize Polars API method naming.

The code has been updated to use group_by instead of groupby and pl.len() instead of pl.count(). These changes align with the current Polars API naming conventions for better consistency.

Let's confirm these changes match the current Polars API:

Also applies to: 156-156, 182-182, 256-257, 283-284, 300-301, 341-341, 356-357

🏁 Script executed:
#!/bin/bash
# Let's check if these are the correct method names in the current Polars version
pip show polars | grep Version
python -c "import polars as pl; print('group_by method exists:', hasattr(pl.DataFrame, 'group_by')); print('len function exists:', hasattr(pl, 'len'))"
Length of output: 368
Polars API Naming Update – Manual Verification Needed

The updated code now uses group_by in place of groupby and pl.len() instead of pl.count(), which aligns with the intended new naming conventions. However, our automated verification could not confirm this because the Polars module wasn’t present in the environment.

Action Required: Please manually verify that the current Polars version used in your project supports these method names.

Suggestion: Ensure the Polars package is installed (e.g., via your CI or local environment) and re-run a version check to confirm the API changes.
EventStream/transformer/lightning_modules/generative_modeling.py (4)

15-15: Add structured logging with Loguru.

Added import of logger from loguru to implement structured logging throughout the module.

283-286: Improve error handling with structured logging.

Replaced print statements with logger.error for improved error visibility and organization when metric computation fails. This provides better debugging capabilities in production environments.

522-529: Enhanced metrics configuration for validation phases.

Updated final_validation_metrics_config to specifically include loss metrics for tuning and held-out splits instead of skipping all metrics. This provides more meaningful validation metrics during final model evaluation.

The previous default (do_skip_all_metrics=False) was non-specific, while the new implementation explicitly defines which metrics to include for each split, making the configuration more predictable and explicit.

599-600: Replace print statements with structured logging.

Changed multiple print statements to appropriate logger.info calls throughout the file. This standardizes the logging approach and enables better control over log output management.

Also applies to: 604-605, 627-628, 708-709

EventStream/data/visualize.py (1)

164-165: Standardize Polars API method naming conventions.

Updated all occurrences of groupby to group_by to align with current Polars API naming conventions. This ensures compatibility with newer versions of the library and maintains consistent method naming throughout the codebase.

Also applies to: 182-183, 256-257, 283-284, 300-301, 341-342, 356-357

EventStream/transformer/model_output.py (7)

12-12: Add structured logging with Loguru.

Added import of logger from loguru to implement structured logging throughout the module.

434-437: Improve warning messages with structured logging.

Replaced print statements with logger.warning calls for improper measurement detection. This provides better log level control and visibility of warnings in production environments.

461-461: Standardize warning logging.

Replaced print statement with logger.warning for consistent logging of improper measurement warnings.

529-534: Enhance error message clarity with detailed context.

Improved the error handling by providing more detailed error messages that include the measurement name, shape information, vocabulary offset, and the actual indices that caused the error. This makes debugging easier by providing more context about the failure.

1027-1031: Improve error message for index construction failures.

Enhanced the error message when concatenating dynamic indices fails, making it easier to debug by showing the shapes of both tensor arguments.

1496-1503: Provide comprehensive error context for classification issues.

Improved error message for classification loss computation failures with detailed information about vocabulary ranges and label values. This makes it significantly easier to identify and diagnose issues with classification predictions.

1361-1361: Enhance TTE error messages with tensor summaries.

Improved the error message for time-to-event log probability computation failures by including tensor shape and content information, which aids in debugging.

EventStream/data/time_dependent_functor.py (2)

147-149: Use of .dt.total_nanoseconds() for better precision.

This change correctly updates the age calculation to use .dt.total_nanoseconds(), which enhances clarity and precision.

190-193: Example parameters initialization in docstring.

These added fields in the example (mean, std, thresh_large, thresh_small) align with the updated metadata logic. No concerns here.

tests/test_e2e_runs.py (6)

33-40: Added "esds" directory for new dataset workflow.

Including "esds" in the loop is consistent with the new ESDS dataset creation steps.

48-56: Conditional subtests for clearer reporting.

Using subTest when use_subtest=True improves test structure and error isolation. This is a good testing practice.

73-73: Skipping subtests for dataset build command.

Passing use_subtest=False is correct if we want a single fail/success outcome for the build process.

75-82: New method for ESDS dataset creation.

The build_ESDS_dataset method is a clean addition to generate ESDS format from the core dataset.

110-110: Disable subtest for pretraining command.

Again, disabling subtest for pretraining is consistent if the entire command is treated as one test step.

256-256: Invoke ESDS dataset build in end-to-end test.

Ensures ESDS creation is part of the comprehensive E2E pipeline.
EventStream/data/dataset_polars.py (20)

13-13: Importing defaultdict to manage grouped data structures.
No issues here.

15-15: Importing timedelta for rolling window offsets.
Seems appropriate for time-based aggregation.

22-22: Importing pyarrow for schema definitions.
Enables better handling of Arrow schemas.

23-23: Added Loguru for structured logging.
Enhances debug logging capabilities.

38-38: Enabling Polars string cache.

Helps with reducing memory usage when dealing with many repeated strings.

40-55: Mapping Polars dtypes to PyArrow dtypes.

Covering a wide range of data types ensures consistent schema conversions.

115-118: New columns in METADATA_SCHEMA.

Adds thresh_high, thresh_low, mean, std for refined numeric metadata processing.

193-193: New debug logging for query.
No issues; helpful for troubleshooting.

222-224: Reading from database with updated parameters.

Changes ensure correct argument passing to read_database_uri.

231-231: Alias subject_id in column expressions.

Provides consistent naming for subject ID column.

642-642: Calling .collect() on events_df.

Collecting lazy computations at this step is crucial for upcoming transformations.

645-646: Group by unique timestamps and subject_id.

When agg_by_time_scale is None, grouping ensures distinct event definitions.

649-656: Aggregating timestamps into dynamic buckets.

Uses group_by_dynamic for time-bucketed event grouping. Logical approach.

664-667: Hash new aggregated event group.

Same hashing approach applies here. Watch for collisions if the dataset grows large.

707-707: Debug log for subject event counts.

No issues, improves visibility during the collection step.

725-742: Robust error handling for column filtering.

This block clarifies type mismatches by enumerating erroneous data. Looks good.

869-873: Joining integer-inference results into metadata.

Combining is_int_expr outcomes is seamless. No concerns.

883-883: Group by vocab_keys_col in numerical inference.

No direct issues with grouping logic.

634-640:
✅ Verification successful

Ensure matching event_id dtype across dataframes.

Casting dynamic_measurements_df event_id to match events_df avoids type mismatches. Verify that no data is inadvertently truncated.

🏁 Script executed:
#!/bin/bash
# Check for all references to event_id casts and potential mismatches.
rg -A 3 'col$"event_id"$.*cast'
Length of output: 348
Review Verification: Event ID Casting is Correct, but Check for Data Truncation

The cast operation in EventStream/data/dataset_polars.py at lines 634–640 appropriately ensures that the event_id datatype of dynamic_measurements_df matches that of events_df. Our search confirms that this is the sole usage of a cast for event_id and no extraneous casts are present.

Ensure that the cast from the original dtype to event_id_dt does not inadvertently truncate or alter data (e.g., if converting from a higher-precision numeric type to a lower one).

Overall, the implementation appears correct as long as you’ve confirmed through testing that no precision loss occurs during the cast.

332-334:
❓ Verification inconclusive

Hash-based event_id assignment.

Generating event_id through hashing is convenient but can theoretically risk collisions for extremely large datasets. Consider a collision-safe approach if data volumes are massive.

🏁 Script executed:
#!/bin/bash
# If there's a collision check mechanism, let's locate references to event hashing.
rg 'event_id.*hash'
Length of output: 21
Hash-based event_id assignment – verify collision management

The current implementation generates event_id via a hash on selected columns without any collision detection. While collisions are very unlikely, especially on typical dataset sizes, there isn’t any accompanying mechanism to detect or handle collisions—this could be a potential risk when processing extremely large datasets. Please verify whether a collision-safe approach is needed for your expected volumes or if additional safeguards should be implemented.
EventStream/transformer/transformer.py (12)

122-122: Use a descriptive dropout naming.
The rename from attn_dropout to attn_dropout_p clarifies that it is a dropout probability. This is standard practice and appears consistent with the rest of the code.

179-179: Explicitly raise an error for unsupported head_mask.
You raise a ValueError if head_mask is not None, which is appropriate since the feature is currently unsupported. Consider adding more detail in the docstring about future plans or alternatives for head masking.

Also applies to: 185-187

188-188: Initialize mask_value with negative infinity.
Using torch.finfo(value.dtype).min and creating a torch.full for masking is a standard approach for attention masking. The implementation is straightforward and correct.

Also applies to: 190-191

193-196: Casting query and key to match value dtype.
Ensuring query and key share the same floating precision as value helps maintain consistency in operations. This is aligned with certain GPT-Neo/J designs.

199-202: Handling batch_size=1 and padded attention.
Raising an error when attention_mask[0, 0, 0, -1] < -1 for batch_size == 1 appears to be a specific constraint of your “BetterTransformer.” Please verify edge cases to ensure no unexpected user interruptions in multi-pass evaluation.

204-218: Attention mask ignored in single-batch or training mode.
Bypassing attention_mask in this branch could be risky if the model relies on specialized or variable-length masks. Confirm that ignoring the mask for batch_size == 1 or in training is truly the intended behavior.

219-232: Applying causal and external attention masks together.
Combining the causal_mask and attention_mask is valid. Double-check the indexing (key_length - query_length) for scenarios where query_length equals or exceeds key_length.

234-236: Scaled dot product attention for multi-batch scenario.
Using is_causal=False here complements the explicit causal mask above. This separation of responsibilities keeps the logic clearer.

238-238: Returning None for attention weights.
Providing (sdpa_result, None) makes sense if you do not currently expose attention weights. This is consistent with the rest of the _attn usage.

626-631: Parameter assignment for even vs. odd embedding dimensions.
Truncating cos_div_term for odd embedding sizes handles uneven dimensional splits. Check for corner cases like embedding_dim=1 to ensure no out-of-range errors.

678-680: Exponent-based frequency calculation.
Using exponentials of torch.arange(...) is a standard approach to generate sinusoidal frequencies. This is consistent with well-known positional encoding methods.

Also applies to: 682-683

689-691: Non-trainable parameters for sinusoidal encoding.
Defining sin_div_term and cos_div_term with requires_grad=False ensures static embeddings, closely matching the original sinusoidal approach.

Also applies to: 692-694

EventStream/data/dataset_base.py (8)

21-22: Added polars and loguru imports.
Importing polars for data manipulation and using loguru for structured logging is consistent with your broader logging refactor.

Also applies to: 24-24

187-187: Refined signature for loading subject data.
Passing schema.columns_to_load and subject_id_col=schema.subject_id_col standardizes _load_input_df calls. Check that any previously handled ID-type conversions are still addressed.

Also applies to: 189-189, 191-191

220-220: Consolidated input loading.
Reusing _load_input_df with combined all_columns helps unify data retrieval. Ensure the function handles missing or duplicate columns gracefully.

226-227: Debug logging for filtering.
Adding logger.debug("Filtering") provides traceability when applying conditions in schema.filter_on. This is good for diagnosing potential data coverage issues.

264-264: Appending events and measurements.
Appending the newly processed events and measurements is straightforward. Consider confirming they are valid (non-None) before adding to the lists.

Also applies to: 266-266

290-290: Enhanced logging during load and extraction.
Replacing silent operations or print statements with structured logger.info/logger.debug calls improves observability of the data flow.

Also applies to: 310-310, 329-329, 367-367, 475-475, 479-479

510-511: Post-processing events after assignment.
Applying _agg_by_time and _sort_events when events_df is not None should help maintain chronological correctness. Ensure _validate_initial_dfs is still valid for edge cases.

Also applies to: 513-515, 521-521

578-578: Mandatory subject sets in splitting logic.
Your new mandatory_set_IDs feature ensures some subject IDs are forced into specific splits. Confirm that leftover IDs and fraction-based splits still behave as expected without conflict.

Also applies to: 592-592, 625-634, 635-637

EventStream/logger.py (1)

1-1: Hydra + Loguru integration.
Defining hydra_loguru_init() to place logs in Hydra’s output directory is a neat solution. This improves consistency across scripts.

Also applies to: 3-4, 7-10

scripts/get_embeddings.py (1)

14-14: Implementation of structured logging looks good.

The addition of the Loguru initialization properly integrates structured logging into the script's execution flow, which is consistent with the codebase-wide refactoring approach described in the PR.

Also applies to: 25-25

scripts/zeroshot.py (1)

15-15: Structured logging implementation looks good.

The Loguru initialization is properly placed at the start of the main function, ensuring logging is set up before any operations are performed. This is consistent with the logging pattern implemented across other scripts.

Also applies to: 26-26

scripts/sklearn_baseline.py (1)

15-15: Consistent logging implementation.

The integration of Loguru-based logging follows the same pattern as other scripts in the codebase, providing consistency in how logging is initialized across the application.

Also applies to: 20-20

scripts/launch_sklearn_baseline_supervised_wandb_hp_sweep.py (1)

18-19: Logging implementation matches the project pattern.

The addition of Loguru initialization at the beginning of the main function follows the established pattern across the codebase, ensuring consistent logging behavior before the main execution flow.

Also applies to: 82-82

scripts/launch_pretraining_wandb_hp_sweep.py (2)

18-18: Well-structured import of the logging initialization function.

The addition of the hydra_loguru_init import from the EventStream.logger module follows good practices, clearly indicating the source of the function.

80-80: Appropriate placement of logging initialization.

The hydra_loguru_init() function is called at the beginning of the main function, ensuring logging is configured before any other operations occur. This is a good practice for enabling comprehensive logging throughout the script's execution.

setup.py (1)

27-27: Script entry point correctly updated.

The script entry has been updated from "scripts/launch_wandb_hp_sweep.py" to "scripts/launch_pretraining_wandb_hp_sweep.py", which aligns with the refactoring and specialization of the script for pretraining tasks.

scripts/pretrain.py (2)

19-19: Appropriate import of logging initialization function.

The import of hydra_loguru_init from EventStream.logger is well-placed and consistent with the logging standardization across the codebase.

30-30: Correct placement of logging initialization.

The hydra_loguru_init() function is called at the beginning of the main function, ensuring logging is properly configured before any other operations. This approach is consistent with the standardized logging strategy being implemented across the codebase.

scripts/finetune.py (2)

18-18: Well-structured import of logging initialization.

The import of hydra_loguru_init from EventStream.logger follows good practices and is consistent with the logging standardization across the codebase.

26-26: Proper placement of logging initialization.

The hydra_loguru_init() call at the beginning of the main function ensures consistent logging configuration before any processing begins. This follows the same pattern implemented across all scripts, creating a unified logging approach.

scripts/launch_finetuning_wandb_hp_sweep.py (2)

18-19: Good addition of structured logging import.

Adding the hydra_loguru_init import supports the transition to structured logging with Loguru.

78-78: Excellent logging initialization.

Calling hydra_loguru_init() at the beginning of the main function ensures proper logging setup before any operations are performed. This will create log files in the Hydra output directory, improving traceability.

scripts/launch_from_scratch_supervised_wandb_hp_sweep.py (2)

18-19: Good addition of structured logging import.

Adding the hydra_loguru_init import supports the transition to structured logging with Loguru.

82-82: Excellent logging initialization.

Calling hydra_loguru_init() at the beginning of the main function ensures proper logging setup before any operations are performed. This will create log files in the Hydra output directory, improving traceability.

EventStream/transformer/config.py (11)

14-14: Good addition of logger import.

Adding Loguru's logger import supports the transition from print statements to structured logging.

575-577: Improved logging for warning messages.

Replacing print statements with logger.warning provides better log structure, level control, and output options.

581-583: Improved logging for warning messages.

Replacing print statements with logger.warning provides better log structure, level control, and output options.

629-633: Improved logging for configuration warnings.

Using logger.warning instead of print statements for configuration parameter warnings.

636-640: Improved logging for configuration warnings.

Using logger.warning instead of print statements for configuration parameter warnings.

643-648: Improved logging for configuration warnings.

Using logger.warning instead of print statements for configuration parameter warnings.

651-653: Improved logging for configuration warnings.

Using logger.warning instead of print statements for configuration parameter warnings.

656-658: Improved logging for configuration warnings.

Using logger.warning instead of print statements for configuration parameter warnings.

759-764: Improved logging for TTE configuration warnings.

Using logger.warning instead of print statements for Time-To-Event configuration parameter warnings.

767-771: Improved logging for TTE configuration warnings.

Using logger.warning instead of print statements for Time-To-Event configuration parameter warnings.

774-778: Improved logging for TTE configuration warnings.

Using logger.warning instead of print statements for Time-To-Event configuration parameter warnings.

scripts/build_dataset.py (6)

18-18: Good addition of logger import.

Adding Loguru's logger import supports the transition from print statements to structured logging.

34-34: Good addition of hydra_loguru_init import.

This import brings in the Hydra-compatible logger initialization function.

54-58: Updated docstring example for logger usage.

The docstring examples have been updated to demonstrate the use of Loguru logger instead of print statements, which provides better guidance for developers.

74-74: Improved warning logging.

Replaced print statement with logger.warning for clearer and more structured warning messages.

83-84: Excellent logging initialization.

Calling hydra_loguru_init() at the beginning of the main function ensures proper logging setup before any operations are performed. This will create log files in the Hydra output directory, improving traceability.

365-365: Improved informational logging.

Replaced print statement with logger.info for providing information about omitted configuration parameters, which improves log consistency and structure.

docs/MIMIC_IV_tutorial/data_extraction_processing.md (3)

24-25: Updated repository name from EFGPT to ESGPT.

The repository naming has been consistently updated throughout the document, reflecting the project's new name.

27-33: Updated command line references to use ESGPT_PATH.

The commands now correctly reference the updated environment variable ESGPT_PATH instead of the previous EFGPT_PATH, maintaining consistency with the project renaming.

46-47: Updated Hydra configuration reference to use ESGPT.

The Hydra configuration now correctly points to the ESGPT repository's configs path, completing the renaming update throughout the document.

scripts/prepare_pretrain_subsets.py (4)

24-28: Added required imports for structured logging.

The addition of the loguru logger and the custom hydra_loguru_init utility function enables structured logging throughout the script, which is a significant improvement over using print statements.

33-33: Initialized structured logging system.

Added initialization of the Loguru logging system via hydra_loguru_init() at the beginning of the main function, ensuring consistent logging configuration throughout script execution.

63-63: Replaced print with structured logging for experiment directory.

Using logger.info() instead of print() provides better log management, including timestamps, log levels, and potential filtering capabilities.

255-255: Replaced print with structured logging for command file writing.

Using logger.info() instead of print() for notifying about command file creation ensures consistent log formatting and improves observability of the script's execution.

EventStream/baseline/FT_task_baseline.py (7)

19-19: Added import for structured logging.

The addition of the loguru logger import enables structured logging throughout the file, replacing print statements with more flexible and powerful logging capabilities.

39-39: Enabled string caching for polars.

Added pl.enable_string_cache() to improve performance when working with string columns in polars dataframes, which is especially beneficial for repeated operations on the same string values.

191-191: Added explicit column selection after caching filtered task.

This new line selects specific columns ("subject_id", "timestamp", and window features) after caching the filtered task data, which ensures consistent column ordering and potentially reduces memory usage by excluding unnecessary columns.

654-654: Replaced print with structured logging for config saving.

Using logger.info() instead of print() for configuration saving notification provides better log management, including timestamps and log levels.

679-679: Replaced print with structured logging for representations loading.

Using logger.info() instead of print() for loading representations notification improves log consistency and management.

711-718: Replaced print with structured logging for dataset loading.

Using logger.info() instead of print() for dataset loading notification improves observability and log management. The multi-line format maintains readability while providing structured logging benefits.

721-730: Replaced print statements with structured logging for model lifecycle.

Converting all model initialization, fitting, saving, and evaluation print statements to logger.info() calls creates a consistent logging pattern throughout the model lifecycle, improving debugging and monitoring capabilities.

EventStream/transformer/lightning_modules/fine_tuning.py (7)

17-17: Added import for structured logging.

The addition of the loguru logger import enables structured logging throughout the file, replacing print statements with more flexible and configurable logging capabilities.

187-190: Enhanced error reporting with structured logging.

Replaced print statement with logger.error() for metric computation failures, which provides better visibility for errors and maintains consistent logging practices throughout the codebase.

400-403: Improved warning visibility with structured logging.

Using logger.warning() instead of print for notifying about missing train subset seed provides better log categorization and filtering capabilities, making it easier to identify potential configuration issues.

406-443: Replaced print statements with structured logging for configuration handling.

Replaced multiple print statements with appropriate logger.info() and logger.warning() calls for loading and overwriting configuration parameters, providing consistent logging throughout the configuration management process.

470-476: Enhanced configuration file saving with structured logging.

Using logger.info() instead of print for configuration file saving operations improves logging consistency and enhances visibility of file operations.

492-492: Updated commented-out code to use structured logging.

Even in commented-out code, print statements were updated to use logger.info(), maintaining consistency throughout the codebase for future use when this code is uncommented.

579-579: Enhanced metrics saving notification with structured logging.

Using logger.info() instead of print for final metrics saving notification maintains consistent logging practices throughout the file.

scripts/convert_to_ESDS.py (6)

4-9: Good practice using stackprinter with a graceful fallback.

The script correctly implements a try-except block for the stackprinter import, allowing the script to continue running without this development dependency. This is a good pattern for optional dependencies.

26-31: Well-structured configuration class with appropriate type hints.

The ConversionConfig class is properly decorated with @hydra_dataclass and includes clear type hints with union types for string or Path parameters. The default values are sensible for optional parameters.

33-37: Good use of post-initialization to standardize path types.

The __post_init__ method ensures consistency by converting string path arguments to Path objects, which will make path operations more reliable throughout the code.

40-46: Configuration setup and validation is implemented correctly.

The main function is properly decorated with Hydra's decorator, initializes logging, and handles configuration instantiation if needed. The type check ensures the configuration is of the expected type.

47-55: Thorough check for existing files with clear error messages.

The script checks if output files already exist and either raises a descriptive error or handles overwriting as specified in the config. Good use of f-strings in the error messages for clarity.

58-73: Well-implemented dataset processing with progress tracking.

The code effectively:

Loads the dataset

Splits subjects into configurable chunk sizes

Uses tqdm for progress tracking (both for the overall process and individual chunks)

Saves each chunk as a Parquet file with properly nested directories

The approach is scalable and user-friendly with its progress visualization.

tests/data/test_dataset_base.py (2)

259-277: Good addition of test case for mandatory subject IDs.

This new test comprehensively verifies that the split functionality correctly handles mandatory subject IDs, ensuring they're assigned to the specified splits regardless of the random seed. The test asserts all key aspects:

The correct keys in the split

All subject IDs being accounted for

Mandatory sets appearing in the correct keys

Expected sizes of each split

491-498: Updated measurement metadata structure aligns with new implementation.

The change from outlier_model and normalizer columns to mean, std, thresh_small, and thresh_large reflects a shift in how measurement statistics are stored and used throughout the codebase. This update ensures the test correctly validates the new structure.

tests/data/test_config.py (6)

183-190: Updated measurement metadata structure is consistent with other changes.

The replacement of outlier_model and normalizer columns with mean, std, thresh_small, and thresh_large maintains consistency with the changes in other parts of the codebase. This ensures all tests properly validate the new metadata structure.

204-206: Consistent metadata structure update in Series-based metadata.

The change to the Series-based measurement metadata maintains the same pattern, ensuring that all metadata-related code and tests are using the same structure regardless of whether the metadata is stored in a DataFrame or Series.

253-257: Simplified univariate metadata structure.

The univariate measurement metadata has been simplified to only include the mean value, removing the previous dictionary-based structure that included outlier model and normalizer. This aligns with the broader changes to metadata handling.

317-324: Consistent metadata update in empty metadata tests.

The same pattern of replacing outlier_model and normalizer with mean, std, thresh_small, and thresh_large is applied to the tests for empty metadata initialization, ensuring consistency across all test cases.

341-343: Maintained consistency in test for functional time-dependent measurement metadata.

The Series-based metadata for functional time-dependent measurements is updated with the same structure, ensuring that all test cases reflect the changes to metadata formatting throughout the codebase.

454-455: Simplified outlier detector configuration.

The test for valid parameter validation now uses an empty dictionary for outlier_detector_config instead of a complex structure with a cls key, reflecting a simplification in how outlier detection is configured.

tests/data/test_pytorch_dataset.py (8)

11-13: Updated imports to use ragged tensor implementation.

The import of JointNestedRaggedTensorDict from nested_ragged_tensors.ragged_numpy replaces the previous direct import of torch, indicating a shift to a more specialized tensor handling implementation.

56-59: Added time delta calculations for better event sequence representation.

The addition of time delta calculations for each subject provides a more complete representation of event sequences, which is valuable for time-series analysis. The approach correctly handles the calculation of differences between consecutive event times and appends NaN for the last event.

Also applies to: 66-69, 77-79

85-86: Updated DataFrame to include time delta information and handle empty cases.

The DataFrame structure now includes the time_delta field and properly handles empty cases with empty lists. This allows the test to verify time delta handling behavior when no events exist.

315-359: Improved test setup with proper isolation and resource management.

The addition of setUp method initializes a temporary directory and prepares all necessary test files and directories. This is a significant improvement in test organization and isolation:

Creates temporary directories that will be automatically cleaned up

Prepares fake split data with proper directory structure

Writes necessary configuration and data files

Ensures tests run in an isolated environment

360-361: Proper resource cleanup in tearDown.

The addition of a tearDown method ensures that temporary resources are properly cleaned up after test execution, preventing resource leaks and maintaining test isolation.

363-374: Refactored get_pyd method for better reusability.

The get_pyd method has been refactored to use the temporary directory created in setUp and accept optional parameters. This makes the method more flexible and reusable across different test cases.

421-424: Improved assertion with better error messages.

The assertion in test_normalize_task has been updated to use assertTrue with a detailed error message that shows both the expected and actual values. This will make debugging failed tests much easier.

427-427: Updated test initialization to use refactored get_pyd method.

The test now uses the refactored get_pyd method with appropriate parameters, which is consistent with the other changes in the file.

EventStream/data/pytorch_dataset.py (7)

8-19: Replace print statements with structured logging.

These import changes align well with the switch to Loguru-based logging, improving flexibility and configurability over plain print statements.

57-57: Check for off-by-one or null edge cases.

The usage of unique().sort().search_sorted() is correct for building indices, but be cautious about columns with extremely large or all-null data. Ensure downstream code handles columns with no valid data gracefully.

Could you confirm if there's an upstream validation that ensures col contains at least some valid entries (non-null) before calling to_int_index?

61-61: Appropriate class inheritance.

Inheriting from both SeedableMixin and torch.utils.data.Dataset is a good approach for handling reproducible random operations. No concerns here.

295-320: Extend unit testing for newly introduced logic.

The new task logic (like normalize_task and stored vocabs) is well structured. Ensure this logic is thoroughly tested, particularly handling corner cases (e.g., empty or invalid columns).

Do you have tests covering tasks with missing labels or unexpected data types?

511-547: Robust dynamic-only collate function.

Logic for splitting dense vs. dynamic fields and converting each data type is sound. The usage of masks and .astype(...) conversions are consistent with torch's type system.

549-555: Conditionally included fields.

Enforcing checks before accessing collated fields (e.g., start_time, subject_id) is good practice. Confirm that the calling site sets the config flags appropriately to avoid potential key errors when fields aren't in the batch.

Would you like to run checks ensuring no key misalignment can occur if external code sets these flags incorrectly?

563-571: Explicit task handling.

Using Python's structural pattern matching for task_types is clean and explicit. The _ branch raises a NotImplementedError, preventing silent failures.

EventStream/data/config.py (12)

7-8: Enhanced tooling imports.

Imports for hashlib and json appear consistent with the new caching features and parameter hashing. No concerns here.

19-19: Loguru-based logging integration.

Replacing print statements with logger calls is consistent with the broader codebase shift, consolidating logging.

809-812: New subset parameters for tuning data.

Adding tuning_subset_size and tuning_subset_seed extends dataset configurability. Validate via the same logic used for train_subset_size to avoid mismatch or unexpected seeds.

896-907: Caching with random vs. non-random subsequences.

The logic disallowing multi-epoch caching for non-random sampling is helpful to avoid stale subsequences. The exception message is clear.

929-931: Auto-generation of seeds.

Randomly assigning subset seeds helps reproducibility. Confirm that logs clearly inform the user of the assigned seed for easy reference.

939-953: Consistent logic for tuning subset size.

Validation and fallback behavior here mirror the training subset logic. This consistency is excellent for maintainability and clarity.

967-974: Vocabulary config paths.

Accessible property-based approach for configuration paths is neatly organized. Ensure offline usage of these files handles missing file scenarios gracefully.

Do you handle missing or corrupted JSON gracefully in VocabularyConfig.from_json_file?

989-995: Derived paths for caching task dataframes.

Conditional logic is straightforward. If task_df_name is None, returning None is a good design to reduce confusion.

997-1002: Safely constructing file paths.

This ensures only relevant paths are used when a task name is present. It is consistent with the logic in .cached_task_dir.

1004-1009: Task info file path.

Returning None if task_df_name is not set is consistent with the property approach used elsewhere.

1051-1057: Safely persisting data parameters.

Storing data parameters as JSON fosters reproducibility. Ensure partial writes or file permission errors are handled gracefully to avoid a broken cache.

Are you handling potential file I/O exceptions anywhere in the codebase?

1059-1062: Tensorized cache discovery.

Glob-based scanning for pre-collated NPZ files is a neat approach. Implementation is straightforward and flexible.

coderabbitai · 2025-03-20T17:06:50Z

EventStream/utils.py

            # raising the below exception to avoid multirun failure
-            print(f"EXCEPTION: {ex}")
-            print(traceback.print_exc(), file=sys.stderr)
+            logger.error(f"EXCEPTION: {ex}\nTRACEBACK:\n{traceback.print_exc()}")


⚠️ Potential issue

Fix potential issue with traceback handling in logging.

The current implementation uses traceback.print_exc() within an f-string, but this function prints to stderr and returns None. This will result in the string "None" being included in your log message instead of the actual traceback.

Replace with traceback.format_exc() to properly capture the traceback as a string:

-logger.error(f"EXCEPTION: {ex}\nTRACEBACK:\n{traceback.print_exc()}") +logger.error(f"EXCEPTION: {ex}\nTRACEBACK:\n{traceback.format_exc()}")

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

logger.error(f"EXCEPTION: {ex}\nTRACEBACK:\n{traceback.print_exc()}")

logger.error(f"EXCEPTION: {ex}\nTRACEBACK:\n{traceback.format_exc()}")

coderabbitai · 2025-03-20T17:06:50Z

EventStream/data/time_dependent_functor.py

+        mean = float(measurement_metadata["mean"]) if "mean" in measurement_metadata else 0
+        std = float(measurement_metadata["std"]) if "std" in measurement_metadata else 1
+
+        if "thresh_large" in measurement_metadata:
+            thresh_large = float(measurement_metadata["thresh_large"])
+        else:
+            thresh_large = float("inf")

-        thresh_large = measurement_metadata["outlier_model"]["thresh_large_"]
-        thresh_small = measurement_metadata["outlier_model"]["thresh_small_"]
+        if "thresh_small" in measurement_metadata:
+            thresh_small = float(measurement_metadata["thresh_small"])
+        else:
+            thresh_small = float("-inf")



💡 Verification agent

🧩 Analysis chain

Handle zero standard deviation case.

While defaulting mean and std to 0 and 1 respectively is fine, watch out for scenarios where std might actually be 0 in real data, which can cause divisions by zero downstream.

Run a quick check across the codebase to ensure that zero std is handled or flagged appropriately.

🏁 Script executed:

#!/bin/bash # Search for code paths that do 1/std or similar rg --context 3 '1\s*/\s*std|\/\s*std'

Length of output: 138

ACTION: Verify Division-by-Zero Handling for std

The defaults of mean and std (0 and 1, respectively) are acceptable only if real data never results in a zero standard deviation. However, if std comes in as 0, downstream operations—especially divisions using std—can lead to runtime errors. Please verify that every code path performing a division by std has an appropriate check or fallback to avoid a division-by-zero error.

To facilitate this review, I've prepared a new verification script that searches the repository for any use of std in division contexts. Please run the following script:

If the command finds any divisions by std without proper zero-checks, then please update the relevant sections to handle the zero case (for example, by checking if std is 0 and providing a safe alternative).

🏁 Script executed:

#!/bin/bash rg --context 3 '/\s*std'

Length of output: 1641

ACTION REQUIRED: Handle Division-by-Zero on Standard Deviation

It appears that the file EventStream/data/time_dependent_functor.py contains a division by std without checking if it's zero (specifically, in the line new_age = (new_age - mean) / std). This poses a risk of division-by-zero errors if measurement_metadata["std"] is provided as zero.

Location: EventStream/data/time_dependent_functor.py

Affected Line: Calculation line new_age = (new_age - mean) / std

Suggestion: Add a condition to check if std is zero (or near zero), and either raise an error or set a safe fallback to prevent runtime exceptions.

coderabbitai · 2025-03-20T17:06:50Z

EventStream/data/dataset_polars.py

+                    .name.map(lambda c: f"dynamic/{m}/{c.replace(key_prefix, '')}/count"),
                    (
                        (cs.starts_with(val_prefix).is_not_null() & cs.starts_with(val_prefix).is_not_nan())
                        .sum()
-                        .map_alias(lambda c: f"dynamic/{m}/{c.replace(val_prefix, '')}/has_values_count")
+                        .name.map(lambda c: f"dynamic/{m}/{c.replace(val_prefix, '')}/has_values_count")
                    ),
-                    val_col.sum().map_alias(lambda c: f"dynamic/{m}/{c.replace(val_prefix, '')}/sum"),
+                    val_col.sum().name.map(lambda c: f"dynamic/{m}/{c.replace(val_prefix, '')}/sum"),
                    (val_col**2)
                    .sum()
-                    .map_alias(lambda c: f"dynamic/{m}/{c.replace(val_prefix, '')}/sum_sqd"),
-                    val_col.min().map_alias(lambda c: f"dynamic/{m}/{c.replace(val_prefix, '')}/min"),
-                    val_col.max().map_alias(lambda c: f"dynamic/{m}/{c.replace(val_prefix, '')}/max"),
+                    .name.map(lambda c: f"dynamic/{m}/{c.replace(val_prefix, '')}/sum_sqd"),
+                    val_col.min().name.map(lambda c: f"dynamic/{m}/{c.replace(val_prefix, '')}/min"),
+                    val_col.max().name.map(lambda c: f"dynamic/{m}/{c.replace(val_prefix, '')}/max"),
                ]
            else:
                column_cols = [m]
                values_cols = [m]
+                select_cols = [m]
                aggs = [
-                    pl.all()
-                    .is_not_null()
-                    .sum()
-                    .cast(count_type)
-                    .map_alias(lambda c: f"dynamic/{m}/{c}/count")
+                    pl.all().is_not_null().sum().cast(count_type).name.map(lambda c: f"dynamic/{m}/{c}/count")


⚠️ Potential issue

Lambda closure capturing loop variable in dynamic aggregator.

Using m inside .name.map(lambda c: ...) in a loop can be risky if referencing the loop variable; Python may capture the final value of m. Consider defining a small helper function or using a default argument trick to bind m at definition time to avoid potential closures.

- .name.map(lambda c: f"dynamic/{m}/{c.replace(key_prefix, '')}/count") + def make_col_name(prefix_m): + return lambda c: f"dynamic/{prefix_m}/{c.replace(key_prefix, '')}/count" + .name.map(make_col_name(m))

Also applies to: 1652-1652, 1654-1654, 1657-1657, 1658-1658, 1659-1659, 1666-1666

🧰 Tools

🪛 Ruff (0.8.2)

1648-1648: Function definition does not bind loop variable m

(B023)

1648-1648: Function definition does not bind loop variable key_prefix

(B023)

1652-1652: Function definition does not bind loop variable m

(B023)

1652-1652: Function definition does not bind loop variable val_prefix

(B023)

1654-1654: Function definition does not bind loop variable m

(B023)

1654-1654: Function definition does not bind loop variable val_prefix

(B023)

1657-1657: Function definition does not bind loop variable m

(B023)

1657-1657: Function definition does not bind loop variable val_prefix

(B023)

1658-1658: Function definition does not bind loop variable m

(B023)

1658-1658: Function definition does not bind loop variable val_prefix

(B023)

1659-1659: Function definition does not bind loop variable m

(B023)

1659-1659: Function definition does not bind loop variable val_prefix

(B023)

1666-1666: Function definition does not bind loop variable m

(B023)

coderabbitai · 2025-03-20T17:06:51Z

EventStream/data/dataset_polars.py

+                .alias("event_id")
+            )
+            .with_columns(
+                "event_id",


💡 Verification agent

🧩 Analysis chain

Potential no-op in with_columns.

The string literal "event_id" alone doesn’t look like a valid expression. Confirm it’s intended or specify an expression if needed.

🏁 Script executed:

#!/bin/bash rg -A 2 '\.with_columns\(\s*"event_id"'

Length of output: 41

I'm not seeing enough context from the initial search results. Let's gather more details about how "event_id" is used within the .with_columns() call. Please run the following shell script to print lines 660–680 from the file EventStream/data/dataset_polars.py so we can better understand the surrounding code context:

Once we have that output, we'll be in a better position to verify if "event_id" is intended as-is, or if it should be replaced with a valid expression (e.g., using a column accessor like pl.col("event_id")).

🏁 Script executed:

#!/bin/bash sed -n '660,680p' EventStream/data/dataset_polars.py

Length of output: 812

Action Required: Update the expression in .with_columns

The current usage:

.with_columns( "event_id", pl.col("event_type") .list.eval(pl.col("").cast(pl.Utf8)) .list.join("&") .cast(pl.Categorical) .alias("event_type"), )

appears problematic because the string literal "event_id" does not create a valid expression—it will likely insert a constant string rather than referencing the existing column. If the intention is to keep the existing "event_id" column unchanged, please replace the literal with an expression, for example using pl.col("event_id").

Please update this as follows:

.with_columns( pl.col("event_id"), # Referencing the existing column pl.col("event_type") .list.eval(pl.col("").cast(pl.Utf8)) .list.join("&") .cast(pl.Categorical) .alias("event_type"), )

coderabbitai · 2025-03-20T17:06:51Z

EventStream/data/dataset_base.py

+            if events is None:
+                logger.warning(f"Empty new events dataframe of type {event_type}!")


🛠️ Refactor suggestion

Switched from error to warning for empty events.
Allowing the pipeline to continue if no events exist may be beneficial for robust data handling, but verify that downstream logic gracefully handles empty frames.

coderabbitai

Actionable comments posted: 2

🧹 Nitpick comments (15)

EventStream/logger.py (1)
7-10: Consider adding additional logging configuration options

While the current implementation works well, you might want to consider adding more configuration options for the logger, such as:

Log rotation settings

Log level configuration

Formatting options

Error handling for cases where Hydra config isn't available

This would make the logging functionality more flexible and robust for different usage scenarios.
 def hydra_loguru_init() -> None:
     """Must be called from a hydra main!"""
     hydra_path = hydra.core.hydra_config.HydraConfig.get().runtime.output_dir
-    log.add(os.path.join(hydra_path, "main.log"))
+    log.add(
+        os.path.join(hydra_path, "main.log"),
+        rotation="10 MB",
+        level="INFO",
+        format="{time} | {level} | {message}",
+    )
sample_data/generate_synthetic_data.py (5)
118-119: Fix multiple statements on a single line

There are several instances of multiple statements on a single line using colons, which violates PEP 8 style guidelines.
-                if running_HR < 30: running_HR = 30
-                if running_HR > 300: running_HR = 300
+                if running_HR < 30:
+                    running_HR = 30
+                if running_HR > 300:
+                    running_HR = 300
🧰 Tools

🪛 Ruff (0.8.2)

118-118: Multiple statements on one line (colon)

(E701)

119-119: Multiple statements on one line (colon)

(E701)

122-123: Fix multiple statements on a single line

Similar to the previous comment, these lines also have multiple statements on a single line.
-                if running_temp < 95: running_temp = 95
-                if running_temp > 104: running_temp = 104
+                if running_temp < 95:
+                    running_temp = 95
+                if running_temp > 104:
+                    running_temp = 104
🧰 Tools

🪛 Ruff (0.8.2)

122-122: Multiple statements on one line (colon)

(E701)

123-123: Multiple statements on one line (colon)

(E701)

190-190: Remove unused variable assignment

The variable months is defined but never used in this function.
-    months = 30 * days
Similar unused variable also appears at line 250.

🧰 Tools

🪛 Ruff (0.8.2)

190-190: Local variable months is assigned to but never used

Remove assignment to unused variable months

(F841)

193-193: Remove unused variable assignment

The variable lab_ps is assigned but never used in this function.
-        lab_ps = np.random.dirichlet(alpha=[0.1 for _ in range(5)])
🧰 Tools

🪛 Ruff (0.8.2)

193-193: Local variable lab_ps is assigned to but never used

Remove assignment to unused variable lab_ps

(F841)

212-212: Simplify dictionary key lookup

When checking if a key exists in a dictionary, it's more efficient to use key in dict instead of key in dict.keys().
-            for lab in base_lab_gaps.keys():
+            for lab in base_lab_gaps:
🧰 Tools

🪛 Ruff (0.8.2)

212-212: Use key in dict instead of key in dict.keys()

Remove .keys()

(SIM118)
docs/MIMIC_IV_tutorial/data_extraction_processing.md (1)
28-33: Updated command examples to use ESGPT_PATH and fix hard tabs.

The command examples have been updated to reference the new ESGPT_PATH variable name. However, the markdown contains hard tabs which should be replaced with spaces for better consistency.
-PYTHONPATH="$ESGPT_PATH:$PYTHONPATH" python \
-	$ESGPT_PATH/scripts/build_dataset.py \
-	--config-path=$(pwd)/configs \
-	--config-name=dataset \
-	"hydra.searchpath=[$ESGPT_PATH/configs]" [configuration args...]
+PYTHONPATH="$ESGPT_PATH:$PYTHONPATH" python \
+    $ESGPT_PATH/scripts/build_dataset.py \
+    --config-path=$(pwd)/configs \
+    --config-name=dataset \
+    "hydra.searchpath=[$ESGPT_PATH/configs]" [configuration args...]
🧰 Tools

🪛 markdownlint-cli2 (0.17.2)

29-29: Hard tabs
Column: 1

(MD010, no-hard-tabs)

30-30: Hard tabs
Column: 1

(MD010, no-hard-tabs)

31-31: Hard tabs
Column: 1

(MD010, no-hard-tabs)

32-32: Hard tabs
Column: 1

(MD010, no-hard-tabs)
EventStream/data/config.py (1)
939-954: Use logger.warning instead of print.
When the tuning subset seed is auto-assigned, switch from print to a logger statement for a consistent logging style:
- print(f"WARNING! tuning_subset_size is set, but tuning_subset_seed is not. Setting to {seed}")
+ logger.warning(f"tuning_subset_size is set, but tuning_subset_seed is not. Setting to {seed}")
scripts/convert_to_ESDS.py (3)
4-10: Consider using a more robust error handling approach for stackprinter.

The current implementation silently passes if the stackprinter import fails. While this prevents failing on a missing dev dependency, it might make debugging harder since you lose enhanced error tracing without any notification.
try:
    import stackprinter

    stackprinter.set_excepthook(style="darkbg2")
except ImportError:
-    pass  # no need to fail because of missing dev dependency
+    import warnings
+    warnings.warn("stackprinter not available; using standard exception handling")
34-37: Use isinstance() instead of type() is for type checking.

Using type() is for type checking is less idiomatic and more brittle than isinstance(), which also handles subclasses appropriately.
-        if type(self.dataset_dir) is str:
+        if isinstance(self.dataset_dir, str):
            self.dataset_dir = Path(self.dataset_dir)
-        if type(self.ESDS_save_dir) is str:
+        if isinstance(self.ESDS_save_dir, str):
            self.ESDS_save_dir = Path(self.ESDS_save_dir)
44-46: Add more details to handle different configuration types.

The conversion from Hydra config to your custom config class is handled, but there's no information about what other config types might be expected or what might cause this conversion to be needed.
    if type(cfg) is not ConversionConfig:
+        # Hydra might return a generic OmegaConf object, convert it to our specific config class
        cfg = hydra.utils.instantiate(cfg, _convert_="object")
tests/data/test_pytorch_dataset.py (1)
421-424: Use more descriptive assertion messages.

While the assertion includes output values, a more descriptive message would make test failures easier to understand.
                    self.assertTrue(
                        (got_vals == want_vals).all(),
-                       f"want_vals:\n{want_vals.to_pandas()}\ngot_vals:\n{got_vals.to_pandas()}",
+                       f"Values mismatch for {task} normalization. \nExpected:\n{want_vals.to_pandas()}\nGot:\n{got_vals.to_pandas()}",
                    )
tests/test_e2e_runs.py (1)
48-64: Improved test command execution with subtest support.

The _test_command method has been enhanced to optionally execute commands within a subtest context, which provides better organization of test results. However, there is code duplication between the subtest and non-subtest cases.
def _test_command(self, command_parts: list[str], case_name: str, use_subtest: bool = True):
+    def run_and_check_command():
+        command_out = subprocess.run(" ".join(command_parts), shell=True, capture_output=True)
+        stderr = command_out.stderr.decode()
+        stdout = command_out.stdout.decode()
+        self.assertEqual(
+            command_out.returncode, 0, f"Command errored!\nstderr:\n{stderr}\nstdout:\n{stdout}"
+        )
+        return stdout, stderr
+
    if use_subtest:
        with self.subTest(case_name):
-            command_out = subprocess.run(" ".join(command_parts), shell=True, capture_output=True)
-            stderr = command_out.stderr.decode()
-            stdout = command_out.stdout.decode()
-            self.assertEqual(
-                command_out.returncode, 0, f"Command errored!\nstderr:\n{stderr}\nstdout:\n{stdout}"
-            )
+            return run_and_check_command()
    else:
-        command_out = subprocess.run(" ".join(command_parts), shell=True, capture_output=True)
-        stderr = command_out.stderr.decode()
-        stdout = command_out.stdout.decode()
-        self.assertEqual(
-            command_out.returncode, 0, f"Command errored!\nstderr:\n{stderr}\nstdout:\n{stdout}"
-        )
+        return run_and_check_command()
EventStream/data/dataset_polars.py (2)

646-646: Grouping by unique timestamps
Be mindful of performance if numerous timestamps exist; large data might require chunked approaches.

670-670: Concatenating event types
Joining multiple event types with '&' may complicate parsing. Consider a more structured approach or array-like storage.

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 7680461 and ce9e2c3.

⛔ Files ignored due to path filters (7)

.github/workflows/tests.yml is excluded by !**/*.yml
configs/dataset_base.yaml is excluded by !**/*.yaml
configs/normalizer_config/standard_scaler.yaml is excluded by !**/*.yaml
configs/outlier_detector_config/stddev_cutoff.yaml is excluded by !**/*.yaml
pyproject.toml is excluded by !**/*.toml
sample_data/dataset.yaml is excluded by !**/*.yaml
sample_data/pretrain_NA.yaml is excluded by !**/*.yaml

📒 Files selected for processing (52)

EventStream/baseline/FT_task_baseline.py (6 hunks)
EventStream/data/README.md (2 hunks)
EventStream/data/config.py (14 hunks)
EventStream/data/dataset_base.py (21 hunks)
EventStream/data/dataset_polars.py (44 hunks)
EventStream/data/preprocessing/README.md (0 hunks)
EventStream/data/preprocessing/__init__.py (0 hunks)
EventStream/data/preprocessing/preprocessor.py (0 hunks)
EventStream/data/preprocessing/standard_scaler.py (0 hunks)
EventStream/data/preprocessing/stddev_cutoff.py (0 hunks)
EventStream/data/pytorch_dataset.py (9 hunks)
EventStream/data/time_dependent_functor.py (3 hunks)
EventStream/data/types.py (1 hunks)
EventStream/data/visualize.py (7 hunks)
EventStream/evaluation/MCF_evaluation.py (0 hunks)
EventStream/evaluation/general_generative_evaluation.py (0 hunks)
EventStream/logger.py (1 hunks)
EventStream/tasks/profile.py (4 hunks)
EventStream/transformer/config.py (5 hunks)
EventStream/transformer/lightning_modules/embedding.py (2 hunks)
EventStream/transformer/lightning_modules/fine_tuning.py (7 hunks)
EventStream/transformer/lightning_modules/generative_modeling.py (6 hunks)
EventStream/transformer/lightning_modules/zero_shot_evaluator.py (3 hunks)
EventStream/transformer/model_output.py (7 hunks)
EventStream/transformer/transformer.py (4 hunks)
EventStream/utils.py (2 hunks)
README.md (1 hunks)
configs/README.md (0 hunks)
docs/MIMIC_IV_tutorial/data_extraction_processing.md (2 hunks)
sample_data/build_sample_task_DF.py (1 hunks)
sample_data/generate_synthetic_data.py (9 hunks)
scripts/build_dataset.py (6 hunks)
scripts/convert_to_ESDS.py (1 hunks)
scripts/finetune.py (1 hunks)
scripts/generate_trajectories.py (0 hunks)
scripts/get_embeddings.py (2 hunks)
scripts/launch_finetuning_wandb_hp_sweep.py (2 hunks)
scripts/launch_from_scratch_supervised_wandb_hp_sweep.py (2 hunks)
scripts/launch_pretraining_wandb_hp_sweep.py (2 hunks)
scripts/launch_sklearn_baseline_supervised_wandb_hp_sweep.py (2 hunks)
scripts/prepare_pretrain_subsets.py (3 hunks)
scripts/pretrain.py (2 hunks)
scripts/sklearn_baseline.py (1 hunks)
scripts/zeroshot.py (2 hunks)
setup.py (1 hunks)
tests/data/preprocessing/test_standard_scaler.py (0 hunks)
tests/data/preprocessing/test_stddev_cutoff.py (0 hunks)
tests/data/test_config.py (8 hunks)
tests/data/test_dataset_base.py (2 hunks)
tests/data/test_dataset_polars.py (0 hunks)
tests/data/test_pytorch_dataset.py (9 hunks)
tests/test_e2e_runs.py (5 hunks)

💤 Files with no reviewable changes (12)

tests/data/preprocessing/test_stddev_cutoff.py
configs/README.md
EventStream/data/preprocessing/README.md
EventStream/evaluation/general_generative_evaluation.py
EventStream/data/preprocessing/init.py
EventStream/data/preprocessing/standard_scaler.py
EventStream/data/preprocessing/stddev_cutoff.py
tests/data/preprocessing/test_standard_scaler.py
EventStream/data/preprocessing/preprocessor.py
EventStream/evaluation/MCF_evaluation.py
tests/data/test_dataset_polars.py
scripts/generate_trajectories.py

🧰 Additional context used

🧬 Code Definitions (15)

scripts/prepare_pretrain_subsets.py (6)

scripts/get_embeddings.py (1) (1)

main (24-28)

scripts/finetune.py (1) (1)

main (25-39)

scripts/pretrain.py (1) (1)

main (29-43)

scripts/zeroshot.py (1) (1)

main (25-29)

scripts/build_dataset.py (1) (1)

main (82-376)

scripts/sklearn_baseline.py (1) (1)

main (19-24)

scripts/sklearn_baseline.py (8)

scripts/launch_sklearn_baseline_supervised_wandb_hp_sweep.py (1) (1)

main (81-111)

scripts/get_embeddings.py (1) (1)

main (24-28)

scripts/finetune.py (1) (1)

main (25-39)

scripts/launch_finetuning_wandb_hp_sweep.py (1) (1)

main (77-107)

scripts/launch_from_scratch_supervised_wandb_hp_sweep.py (1) (1)

main (81-111)

scripts/launch_pretraining_wandb_hp_sweep.py (1) (1)

main (79-109)

scripts/pretrain.py (1) (1)

main (29-43)

scripts/zeroshot.py (1) (1)

main (25-29)

scripts/finetune.py (3)

EventStream/transformer/lightning_modules/fine_tuning.py (2) (2)

FinetuneConfig (273-442)

train (446-586)

scripts/get_embeddings.py (1) (1)

main (24-28)

scripts/zeroshot.py (1) (1)

main (25-29)

tests/data/test_dataset_base.py (2)

EventStream/data/pytorch_dataset.py (1) (1)

subject_ids (402-403)

EventStream/data/dataset_base.py (1) (1)

split (573-653)

EventStream/transformer/model_output.py (1)

EventStream/transformer/utils.py (1) (1)

str_summary (11-30)

EventStream/transformer/lightning_modules/generative_modeling.py (1)

EventStream/transformer/config.py (3) (3)

MetricsConfig (106-199)

Split (26-42)

MetricCategories (45-61)

EventStream/baseline/FT_task_baseline.py (2)

EventStream/data/dataset_base.py (1) (1)

split (573-653)

EventStream/data/config.py (1) (1)

drop (1391-1415)

EventStream/transformer/lightning_modules/fine_tuning.py (1)

EventStream/transformer/config.py (1) (1)

StructuredTransformerConfig (353-935)

EventStream/data/visualize.py (1)

EventStream/data/dataset_base.py (2) (2)

events_df (302-313)

events_df (316-317)

scripts/build_dataset.py (1)

EventStream/logger.py (1) (1)

hydra_loguru_init (7-10)

EventStream/transformer/transformer.py (1)

EventStream/data/types.py (1) (1)

batch_size (178-183)

tests/test_e2e_runs.py (1)

tests/data/test_pytorch_dataset.py (1) (1)

tearDown (360-361)

EventStream/data/pytorch_dataset.py (3)

EventStream/utils.py (1) (1)

count_or_proportion (26-88)

EventStream/data/config.py (1) (1)

PytorchDatasetConfig (788-1062)

EventStream/data/dataset_base.py (1) (1)

split (573-653)

EventStream/data/dataset_polars.py (2)

EventStream/data/config.py (3) (3)

drop (1391-1415)

measurement_configs (980-983)

vocabulary_config (972-973)

EventStream/data/types.py (3) (3)

items (306-308)

DataModality (832-868)

TemporalityType (808-829)

tests/data/test_pytorch_dataset.py (2)

EventStream/data/config.py (4) (4)

PytorchDatasetConfig (788-1062)

to_dict (955-959)

to_dict (1538-1554)

to_dict (1934-1944)

EventStream/data/pytorch_dataset.py (2) (2)

PytorchDataset (61-609)

max_seq_len (417-418)

🪛 Ruff (0.8.2)

sample_data/generate_synthetic_data.py

118-118: Multiple statements on one line (colon)

(E701)

119-119: Multiple statements on one line (colon)

(E701)

122-122: Multiple statements on one line (colon)

(E701)

123-123: Multiple statements on one line (colon)

(E701)

190-190: Local variable months is assigned to but never used

Remove assignment to unused variable months

(F841)

193-193: Local variable lab_ps is assigned to but never used

Remove assignment to unused variable lab_ps

(F841)

212-212: Use key in dict instead of key in dict.keys()

Remove .keys()

(SIM118)

250-250: Local variable months is assigned to but never used

Remove assignment to unused variable months

(F841)

EventStream/data/dataset_base.py

1488-1488: Loop control variable m not used within loop body

Rename unused m to _m

(B007)

EventStream/data/pytorch_dataset.py

209-209: Loop control variable n_events overrides iterable it iterates

(B020)

EventStream/data/dataset_polars.py

1648-1648: Function definition does not bind loop variable m

(B023)

1648-1648: Function definition does not bind loop variable key_prefix

(B023)

1652-1652: Function definition does not bind loop variable m

(B023)

1652-1652: Function definition does not bind loop variable val_prefix

(B023)

1654-1654: Function definition does not bind loop variable m

(B023)

1654-1654: Function definition does not bind loop variable val_prefix

(B023)

1657-1657: Function definition does not bind loop variable m

(B023)

1657-1657: Function definition does not bind loop variable val_prefix

(B023)

1658-1658: Function definition does not bind loop variable m

(B023)

1658-1658: Function definition does not bind loop variable val_prefix

(B023)

1659-1659: Function definition does not bind loop variable m

(B023)

1659-1659: Function definition does not bind loop variable val_prefix

(B023)

1666-1666: Function definition does not bind loop variable m

(B023)

2013-2013: Use key in dict instead of key in dict.keys()

Remove .keys()

(SIM118)

🪛 markdownlint-cli2 (0.17.2)

docs/MIMIC_IV_tutorial/data_extraction_processing.md

29-29: Hard tabs
Column: 1

(MD010, no-hard-tabs)

30-30: Hard tabs
Column: 1

(MD010, no-hard-tabs)

31-31: Hard tabs
Column: 1

(MD010, no-hard-tabs)

32-32: Hard tabs
Column: 1

(MD010, no-hard-tabs)

🔇 Additional comments (253)

EventStream/logger.py (1)

1-10: Good implementation of Loguru integration with Hydra

The implementation is clean and effective for integrating the Loguru logger with Hydra's output directory structure.

README.md (1)

34-36: Helpful compatibility note about polars version requirements

This note provides valuable information about version requirements and potential error symptoms, which will help users troubleshoot installation issues.

sample_data/generate_synthetic_data.py (1)

3-11: Improved docstring formatting

The docstring formatting changes improve readability and make the purpose of the script clearer.

EventStream/data/README.md (1)

393-395: Consistency update from np.NaN to float('nan')

Updating the documentation to use float('nan') instead of np.NaN improves consistency with the code implementation and makes the docs less dependent on NumPy-specific terminology.

setup.py (1)

27-27: Script name updated to be more specific.

The script entry has been renamed from "launch_wandb_hp_sweep.py" to "launch_pretraining_wandb_hp_sweep.py", which better reflects its specific purpose for pretraining tasks.

EventStream/transformer/lightning_modules/zero_shot_evaluator.py (2)

13-13: Added structured logging with loguru.

Adding the loguru logger import to support structured logging, which is more suitable for production environments than print statements.

384-384: Replaced print statement with structured logging.

Improved logging by replacing a print statement with logger.info(), enhancing log readability and consistency.

docs/MIMIC_IV_tutorial/data_extraction_processing.md (2)

24-25: Repository reference updated from EFGPT to ESGPT.

The environment variable has been renamed from EFGPT_PATH to ESGPT_PATH, maintaining consistency with the repository name change throughout the documentation.

46-47: Updated hydra searchpath reference to ESGPT_PATH.

The hydra configuration explanation has been updated to reflect the repository name change from EFGPT to ESGPT.

EventStream/transformer/config.py (11)

14-14: Added structured logging with loguru.

Adding the loguru logger import to support structured logging throughout the configuration handling.

575-578: Replaced print with logger.warning for categorical_embedding_dim.

Enhanced logging by replacing print warnings with structured logger.warning() calls, improving log readability and consistency.

581-584: Replaced print with logger.warning for numerical_embedding_dim.

Enhanced logging by replacing print warnings with structured logger.warning() calls, improving log readability and consistency.

629-634: Replaced print with logger.warning for measurements_per_dep_graph_level.

Enhanced logging by replacing print warnings with structured logger.warning() calls, improving log readability and consistency.

636-641: Replaced print with logger.warning for do_full_block_in_seq_attention.

Enhanced logging by replacing print warnings with structured logger.warning() calls, improving log readability and consistency.

643-649: Replaced print with logger.warning for do_full_block_in_dep_graph_attention.

Enhanced logging by replacing print warnings with structured logger.warning() calls, improving log readability and consistency.

651-654: Replaced print with logger.warning for dep_graph_attention_types.

Enhanced logging by replacing print warnings with structured logger.warning() calls, improving log readability and consistency.

656-659: Replaced print with logger.warning for dep_graph_window_size.

Enhanced logging by replacing print warnings with structured logger.warning() calls, improving log readability and consistency.

759-765: Replaced print with logger.warning for TTE_lognormal_generation_num_components.

Enhanced logging by replacing print warnings with structured logger.warning() calls, improving log readability and consistency.

767-772: Replaced print with logger.warning for mean_log_inter_event_time_min.

Enhanced logging by replacing print warnings with structured logger.warning() calls, improving log readability and consistency.

774-779: Replaced print with logger.warning for std_log_inter_event_time_min.

Enhanced logging by replacing print warnings with structured logger.warning() calls, improving log readability and consistency.

EventStream/transformer/lightning_modules/embedding.py (3)

9-9: New dependency added properly.

Added the loguru logger import, which aligns with the broader initiative to replace print statements with structured logging.

157-159: Good conversion from print to structured logging.

Replaced print statement with logger.info for better logging control and visibility when embeddings already exist.

161-161: Good conversion from print to structured logging.

Replaced print statement with logger.info for better logging control and visibility when saving embeddings.

EventStream/data/types.py (1)

753-753: Improved truncated representation in DataFrame output.

Enhanced the output formatting for dynamic indices, dynamic measurement indices, and dynamic values in the DataFrame representation by using ellipses to indicate truncated content for better readability.

EventStream/utils.py (2)

19-19: New dependency added properly.

Added the loguru logger import, which aligns with the broader initiative to replace print statements with structured logging.

383-383: Improved error logging.

Replaced print statements with structured logging using logger.error, which consolidates the error message and traceback into a single log entry for better error tracking and visibility.

EventStream/transformer/model_output.py (7)

12-12: New dependency added properly.

Added the loguru logger import, which aligns with the broader initiative to replace print statements with structured logging.

434-437: Good conversion from print to structured logging.

Replaced print statement with logger.warning for better logging control and visibility when detecting improper measurement access attempts.

461-461: Good conversion from print to structured logging.

Replaced print statement with logger.warning for better logging control and visibility when detecting improper measurement generation.

529-534: Enhanced error reporting with better context.

Improved error handling by preserving the original exception context with from e and providing more detailed information about the failure, including shapes and indices. This will make debugging easier.

1027-1031: Enhanced error reporting with better context.

Improved error handling by preserving the original exception context with from e and including shape information in the error message, which will help diagnose concatenation issues more efficiently.

1361-1361: Enhanced error reporting with better context.

Improved error handling by preserving the original exception context with from e and including tensor summary information using the str_summary utility, which will make debugging time-to-event calculation failures more straightforward.

1496-1504: Enhanced error reporting with better context.

Improved error handling with a more detailed error message that includes vocabulary information, label ranges, and tensor shapes. This will help diagnose classification validation failures more effectively.

EventStream/tasks/profile.py (5)

7-7: API update to match Polars library changes

The pl.enable_string_cache() call has been updated to remove the explicit True parameter, which is now the default behavior in newer versions of Polars.

85-85: Updated datatype specification for the start_time column

The datatype for start_time has been changed from a previous type to null, which better reflects that this column can contain null values as shown in the example data.

98-98: Updated datatype specification consistent with previous changes

Similar to the previous change, the datatype for start_time in the "foobar/bar" example has been updated to null to consistently handle null values.

154-154: API update: groupby → group_by

Updated method name from groupby to group_by to follow Polars' API changes and align with Python's snake_case naming convention.

156-156: API update: pl.count() → pl.len()

Changed the aggregation function from pl.count() to pl.len() to align with Polars' API changes or to use a more appropriate function for counting samples per subject.

EventStream/transformer/lightning_modules/generative_modeling.py (4)

15-15: Added structured logging with loguru

Added import for the loguru logger to replace basic print statements with structured logging, improving error traceability and log management.

283-286: Enhanced error reporting with structured logging

Replaced print statement with logger.error() for better error visibility and traceability when metric computation fails. This provides clearer context about the error and standardizes logging throughout the codebase.

523-528: Updated metrics configuration for validation

Modified final_validation_metrics_config to explicitly include loss part metrics for both tuning and held-out splits. This ensures consistent metric tracking across validation phases instead of skipping all metrics by default.

The change aligns with the MetricsConfig class behavior defined in EventStream/transformer/config.py, where setting specific include_metrics ensures only those metrics are tracked.

599-599: Standardized information logging

Replaced print statements with logger.info() calls to standardize logging throughout the codebase, making debug information more consistent and easier to filter.

Also applies to: 604-604, 627-627, 708-708

EventStream/data/visualize.py (7)

164-168: API update: group_by_dynamic replaces groupby_dynamic

Updated method name to follow Polars' API changes and Python's snake_case naming convention. This is part of a broader standardization of method naming in the Polars library.

182-182: API update: groupby → group_by

Changed method call from groupby to group_by to maintain consistency with Polars' updated API.

256-256: API update: groupby → group_by in plot_static_variables_breakdown

Updated method call in the static variables breakdown function to maintain consistency with Polars' API changes.

283-284: API update: groupby → group_by in plot_counts_over_age

Consistent update to the group_by method in the age-based plotting functionality.

300-300: API update: groupby → group_by in aggregation logic

Another instance of updating to the new method name within the counts_at_age aggregation logic.

341-341: API update: groupby → group_by in plot_events_per_patient

Consistent method name update in the events per patient plotting logic.

356-356: API update: groupby → group_by in plot method

Final instance of updating the method name in the plot function to maintain consistency throughout the file.

EventStream/transformer/lightning_modules/fine_tuning.py (7)

17-17: Added structured logging with loguru

Added import for the loguru logger to replace basic print statements with structured logging throughout the file.

187-190: Enhanced error reporting with structured logging

Replaced print statement with logger.error() for better error visibility and traceability when metric computation fails. This provides more context about the error and fits with the broader standardization of logging.

400-403: Improved warning messages with structured logging

Updated warning message about train_subset_seed using logger.warning() instead of print, providing better visibility for important configuration warnings.

415-418: Enhanced warning visibility for task_df_name conflicts

Replaced print with logger.warning() to highlight potential conflicts in task_df_name settings, making it easier to identify configuration issues.

420-422: Better logging for configuration overrides

Improved logging of configuration parameter overrides with logger.info(), providing clearer tracking of what's being changed during initialization.

470-470: Standardized information logging for file operations

Replaced print statements with logger.info() calls for consistent logging of file operations and configuration saving, improving traceability.

Also applies to: 475-475, 579-579

492-492: Consistent logging style in commented code

Updated the commented-out print statement about model compilation to use logger.info(), maintaining consistency with the rest of the codebase if this code is ever uncommented.

EventStream/data/time_dependent_functor.py (5)

147-149: Ensure correctness of nanosecond-based time conversion.

Switching from dt.nanoseconds() to dt.total_nanoseconds() offers finer granularity and improved precision in calculating age. This appears correct, but consider verifying any downstream code that might rely on the older approach to ensure consistency.

210-212: Validate default mean and standard deviation.

Defaulting to mean=0 and std=1 is typically safe, but verify that downstream logic can handle these defaults without unexpected outcomes (e.g., skipping or incorrectly normalizing).

213-221: Graceful handling of threshold boundaries.

Assigning float("inf") and float("-inf") is a sensible default when thresholds are absent. Confirm that the rest of the pipeline (e.g., outlier detection and any numeric stability checks) properly accommodates these infinite bounds.

613-631: Random initialization of sine/cosine parameters.

Defining sin_div_term and cos_div_term via torch.empty(...) and applying normal initialization allows flexible learnable frequencies. This approach is coherent with the described architecture.

678-694: Maintain clarity with separate sin/cos div terms.

Using identical expressions for sin_div_term and cos_div_term and separating them for even and odd embedding dims is consistent. This helps ensure correct shape handling for odd dimensions.

EventStream/transformer/transformer.py (12)

122-122: Prefer naming consistency for dropout probability.

Renaming self.attn_dropout to self.attn_dropout_p clarifies that it stores a probability rather than a Dropout layer. This naming improvement is welcome.

179-180: Document the unsupported head_mask parameter clearly.

The docstring now warns that head_mask is not supported, aligning with the raised exception below. Good job ensuring the documentation reflects actual functionality.

185-187: Explicitly raise error for unsupported head masking.

Raising an error clarifies that the feature is unavailable. This is helpful for quickly catching invalid usage.

188-189: Batch size extraction is straightforward.

Storing batch_size = query.shape[0] clarifies downstream usage. Ensure that inputs always follow the (batch, head, seq_len, head_dim) pattern as expected.

190-192: Mask value extraction.

Using torch.finfo(value.dtype).min then creating a singleton tensor clarifies that it is a scalar sentinel. This is a neat approach to avoid repeating finfo calls.

195-197: Casting query and key to match value’s dtype.

Ensuring all attention inputs share the same dtype helps avoid subtle numerical issues. Good consistency fix.

199-203: Reject single-batch usage with certain padding.

The logic here properly prevents a known edge case with padding='max_length' in single-batch scenarios. This explicit check is beneficial for user clarity.

204-206: Conditional dropout usage.

Applying dropout only in training mode is a standard approach. This ensures reproducibility and stable attention in evaluation mode.

209-217: Use scaled_dot_product_attention for minimal seq length.

Conditionally implementing is_causal=True or False is a neat trick for short sequences. Double-check that masks remain consistent in each scenario if you reintroduce attention masks here.

218-233: Construct and append causal mask properly.

Your approach to slicing self.bias ensures an accurate local or causal context. Combining it with attention_mask by summation is a succinct way to unify multiple masks.

234-236: Single point of scaled dot-product call highlights clarity.

Routing all logic into one final scaled_dot_product_attention call for this branch is simpler than multiple calls. This is an effective consolidation.

242-242: Return no explicit attention weights.

Returning None for attention weights indicates they’re not tracked. This is consistent with your docstring about not returning them in this path.

EventStream/data/pytorch_dataset.py (66)

8-19: Enhanced logging and imports.

Switching to loguru for logging and importing SeedableMixin improves maintainability and consistency across the codebase.

39-54: Informative code example.

The updated docstring example clearly illustrates how to_int_index works. It's helpful and consistent with polars usage.

57-57: Use of search_sorted for indexing.

This approach finds the correct insertion indices for unique values. Confirm that lexicographic order is always the intended default for your application.

62-63: Inheritance from SeedableMixin.

Adding SeedableMixin to PytorchDataset fosters reproducible sampling in _seeded_getitem. This is beneficial for debugging and consistent data splits.

102-102: Extended type checking with polars Categorical.

Including polars Categorical(ordering="physical") or (ordering="lexical") in TYPE_CHECKERS ensures consistent usage of to_int_index for multi-class classification. Nicely done.

133-133: Ensure proper superclass initialization.

Calling super().__init__() from within __init__ ensures the SeedableMixin logic is set up. Good practice.

136-137: Expose split as instance attribute.

Storing split allows referencing dataset segments clearly. This is straightforward and helpful during debugging.

138-139: Logging vocabulary loading.

These lines add transparency about which step is running. Such logs help track dataset preparation progress.

140-141: User-friendly logging for split and shards.

Communicating “Reading splits & patient shards” helps track where the code is in pipeline. Good practice.

142-142: Explicit descriptor reading step.

Logging the next major step clarifies the data loading flow.

145-149: Filter subjects by minimum sequence length.

Emitting a log line and then calling filter_to_min_seq_len clarifies the process. Ensure it’s not repeated inadvertently in multi-phase pipelines.

150-154: Optionally apply subset filtering for training.

Conditionally filtering the training split with a user-defined subset size is a nice feature. The logs further clarify when random sampling is triggered.

156-156: Define property for static directory.

This property-based approach organizes path references.

158-159: Clean direct path reference to “DL_reps”.

This ensures the code references the correct directory consistently.

164-164: Expose task_dir for clarity.

Similarly, referencing task_dfs via property clarifies the structure.

167-168: Property for “NRTs_dir”.

This aligns with the other path-based properties. Good consistent pattern.

169-172: Vocabulary reading method separated.

This dedicated function improves clarity and single responsibility.

173-174: Dedicated function for reading shards.

Separating shard reading from the constructor fosters modular design.

175-178: Load shard mapping from JSON.

Reading shard data once avoids repeated overhead. Good approach for large datasets.

179-182: measurement_configs property.

This property neatly surfaces measurement configuration definitions.

189-189: Progress bar for static shards.

Using tqdm for user feedback is helpful when reading many shards.

192-203: Selective column loading from parquet.

Reading minimal columns reduces memory overhead. This is an efficient approach for large parquet data.

206-206: Store shard DataFrame references.

Retaining them in self.static_dfs helps unify subsequent lookups.

214-214: Track sequence bounds for each subject.

Storing (0, n_events) is straightforward for subsequent slicing logic.

215-215: Handle dataset with/without tasks consistently.

Conditionally setting tasks, tasks types, and so forth keeps logic straightforward when task_df_name is absent.

217-221: Default fields when no tasks.

Setting index, labels, tasks, etc. to default ensures a uniform interface.

223-223: Generate path to task parquet.

Combining self.task_dir and config.task_df_name is consistent with the property-based approach.

224-242: Reading and validating task configurations.

Merging with existing static data, verifying columns, and checking for consistent task_info is thorough. This ensures correctness and alignment of tasks with the actual data.

322-343: Filtering to min sequence length.

Applying the user-defined minimum, logging the change, and removing short sequences is essential for data quality. The caution about comparability addresses potential differences across runs.

344-364: Random subset filtering for training.

Extracting a subset of subjects with np.random.default_rng is a clean approach. The logs highlight changes in row and subject counts, which is quite helpful.

365-365: Collecting inter-event time stats.

set_inter_event_time_stats organizes the pipeline.

366-366: Concatenate shards before computing stats.

Combining all data into one lazy frame is memory-friendly and consistent.

369-371: Select relevant column for inter-event time.

Dropping nulls and nans ensures robust log-based computations.

389-389: Write malformed records for debugging.

Logging them to parquet is a good fallback for investigating data anomalies.

395-395: Warn about non-positive inter-event times.

Aggregating these warnings helps debug data issues promptly.

396-396: Remove malformed subjects.

This ensures no zero or negative inter-event times remain, preserving model consistency.

401-401: Property-based approach for subject IDs.

Returning a list of [x[0] for x in self.index] is straightforward.

402-403: Getter method for subject_ids.

Exposing it as a property to unify usage throughout the dataset.

405-407: Length matches the index size.

Returning len(self.index) is consistent with typical dataset patterns.

409-410: Check presence of tasks.

has_task clarifies if the dataset is labeled or not.

413-414: Expose padding side.

Just returning the configuration’s seq_padding_side.

416-416: Expose maximum sequence length from config.

Encapsulating max_seq_len helps avoid referencing config in user code.

420-423: Indicate if it’s a subset dataset.

Checks if train_subset_size != "FULL". This is a simple, clear property.

425-425: Docstring clarifies output structure of items.

Transparently describing the shape and type is beneficial for debugging.

445-446: Additional clarifications on placeholder for numeric data.

Documenting float('nan') usage for missing numeric elements helps future maintainers.

456-457: Use seed context manager in _seeded_getitem.

Properly controlling randomness ensures reproducibility across fetch operations.

459-459: Clarify purpose for _seeded_getitem.

Short doc clarifies it’s essentially the random-sampling-aware version of __getitem__.

462-466: Retrieve subject setup for single index.

This chunk extracts the correct shard, subject index, and static row for the requested item.

469-471: Initialize dictionary with static data arrays.

This standard representation matches the docstring structure.

473-474: Conditionally include subject_id.

Exposing user IDs can be critical for debugging, so having this optional is good.

476-492: Implement subsequence sampling.

The random, end-based, or from-start approach covers typical scenarios. It's nice that it logs changes above.

497-499: Load dynamic slice from JointNestedRaggedTensorDict.

Efficiently reading partial data helps with large sequences.

502-505: Conditionally include start_time in minutes.

Summing partial time_delta ensures the correct offset. Good detail for time-based tasks.

507-508: Attach label(s) for tasks.

Populating out[t] from self.labels is straightforward for each requested task.

511-511: Dynamic-only collate function extracted.

Separating dynamic collation from overall logic improves maintainability.

512-517: Fetch dense keys separately.

Distinguishing “dynamic” from normal keys organizes the merging process.

521-523: VStack the dynamic ragged data.

Leveraging vstack simplifies merging variable-length data.

524-526: Separate event and values masks.

Renaming dim1/mask and dim2/mask is more explicit, clarifying which dimension is masked.

527-537: Carefully map numpy dtypes to torch.

Casting integer and float data is handled robustly. The TypeError ensures unknown dtypes are promptly flagged.

538-539: Merge dynamic and dense dictionaries.

Combining them into collated for straightforward post-processing is neat.

540-547: Populate the final batch dictionary.

This ensures consistent naming and shape for time deltas, dynamic indices, etc.

549-555: Optional fields for start_time and subsequence bounds.

Conditional logic avoids extraneous data when not needed, saving memory.

559-559: Early return if no tasks.

Skipping label prep when no tasks exist saves overhead.

563-572: Collect and cast labels by task type.

Handling multi-class, binary, and regression tasks distinctly ensures correct numeric transformations.

589-590: Collate dynamic data first, then handle static.

This separation clarifies the two major forms of data.

591-610: Efficiently pad static fields.

Looping once with torch.nn.functional.pad ensures consistent shapes. Catting them is typical for batching.
EventStream/data/config.py (11)

7-8: Imports for hashing & JSON handling look fine.
These imports are needed for computing and storing configuration hashes, and seem appropriate.

19-19: Introduce Loguru logger import.
This shift to structured logging is consistent with the broader logging refactor.

809-812: Docstring refinements for tuning subset size.
These lines clarify how the tuner subset is managed. The description is coherent and matches the pattern used for train subsets.

859-862: Demonstration snippet for Loguru in doctest.
This snippet showcases logger usage in a docstring example. Looks fine as an illustrative test block.

886-887: New parameters for tuning subset.
Introducing tuning_subset_size and tuning_subset_seed aligns well with the existing train subset logic and improves customization.

896-898: New configuration comment and field.
Declaring cache_for_epochs with a default of 1 clarifies caching behavior. The inline comment about speed/disk/mem trade-offs is helpful.

929-931: Logging seed assignment for training subset.
Using logger.warning instead of print preserves consistency with the new structured logging approach.

933-933: Clearing seed when subset size is “FULL” or None.
This logic is consistent with the train subset usage pattern and ensures predictable behavior.

1869-1870: Added new config fields for outlier detection and normalization.
These fields enable flexible outlier handling and toggling of numerical standardization. Well-structured addition.

1920-1923: Validation of outlier detector configuration.
Ensuring outlier_detector_config is either None or a dictionary helps maintain consistent usage of the new outlier detection feature.

900-908: ⚠️ Potential issue

Likely extraneous closing parenthesis causes syntax error.
There appears to be an unmatched closing parenthesis at line 907, which would cause a syntax error. Consider removing the extra parenthesis and ensuring proper indentation, for example:
902     if self.subsequence_sampling_strategy != "random" and self.cache_for_epochs > 1:
903         raise ValueError(
904             f"It does not make sense to cache for {self.cache_for_epochs} with non-random "
905             "subsequence sampling."
906         )
-907     )
Likely an incorrect or invalid review comment.
scripts/pretrain.py (2)

19-19: Add import for Hydra-Loguru initialization.
This import is part of the ongoing transition to Loguru logging.

30-30: Initialize Loguru within main.
Calling hydra_loguru_init() at the start ensures all subsequent logging is captured consistently.

scripts/launch_sklearn_baseline_supervised_wandb_hp_sweep.py (2)

18-19: Add Loguru import.
Importing hydra_loguru_init keeps logging consistent across scripts.

82-82: Hydra-Loguru initialization at script start.
Good practice to set up logging prior to reading config or performing logic.

scripts/get_embeddings.py (1)

14-14: Adding structured logging with Loguru - good improvement!

The addition of hydra_loguru_init() provides better logging capabilities compared to standard print statements. This change aligns with similar updates across other scripts in the codebase, establishing a consistent logging pattern.

Also applies to: 25-25

sample_data/build_sample_task_DF.py (1)

29-30: Method update for Polars API consistency

The change from .groupby() to .group_by() correctly aligns with the Polars DataFrame API (imported as pl on line 2). This ensures the code uses the proper method naming conventions for the Polars library rather than pandas-style methods.

scripts/sklearn_baseline.py (1)

15-15: Adding structured logging with Loguru - good improvement!

The addition of hydra_loguru_init() enhances the script's logging capabilities. This change is part of a consistent pattern implemented across multiple scripts in this PR, establishing a standardized approach to logging throughout the codebase.

Also applies to: 20-20

scripts/launch_finetuning_wandb_hp_sweep.py (1)

18-19: Adding structured logging with Loguru - good improvement!

The addition of hydra_loguru_init() enhances the script's logging capabilities and follows the same pattern implemented across other scripts in this PR. This creates a consistent logging approach throughout the codebase, which will improve debugging and monitoring capabilities.

Also applies to: 78-78

scripts/zeroshot.py (2)

15-15: Good addition of structured logging import.

This import adds the logging initialization function that will be used to configure Loguru, which is a structured logging library that offers better logging capabilities compared to simple print statements.

26-26: Good logging initialization placement.

The hydra_loguru_init() call is appropriately placed at the beginning of the main function, ensuring that logging is configured before any other operations occur. This follows the best practice of initializing logging early in the execution flow.

scripts/launch_pretraining_wandb_hp_sweep.py (2)

18-19: Good addition of structured logging import.

Adding the hydra_loguru_init import from EventStream.logger standardizes logging initialization across the codebase. The blank line also helps maintain code readability.

80-80: Good logging initialization placement.

The hydra_loguru_init() call is appropriately placed at the beginning of the main function, ensuring that logging is configured before any other operations occur, including the configuration handling and WandB sweep setup.

scripts/finetune.py (2)

18-18: Good addition of structured logging import.

This import adds the logging initialization function that will be used to configure Loguru for structured logging, aligning with the codebase-wide standardization of logging.

26-26: Good logging initialization placement.

The hydra_loguru_init() call is correctly placed at the beginning of the main function, before handling configurations and training. This ensures consistent logging behavior throughout the execution flow.

scripts/launch_from_scratch_supervised_wandb_hp_sweep.py (2)

18-19: Good addition of structured logging import.

Adding the hydra_loguru_init import standardizes logging initialization across the codebase, consistent with the changes in other scripts. The blank line maintains good code formatting.

82-82: Good logging initialization placement.

The hydra_loguru_init() call is appropriately placed at the beginning of the main function, ensuring logging is configured before any configuration processing or WandB sweep operations.

scripts/prepare_pretrain_subsets.py (5)

23-24: Added Loguru for structured logging

This import adds the Loguru library for structured logging, which is a more powerful alternative to the standard print statements previously used. This provides better log management, filtering, and formatting capabilities.

27-28: Added hydra_loguru_init import

This import brings in the logging initialization function that will configure Loguru to work with Hydra's output directory structure, ensuring logs are properly captured in the experiment directories.

32-33: Added logging initialization

The hydra_loguru_init() call initializes the logging system at the beginning of the main function, ensuring all subsequent logs are properly formatted and saved to Hydra's output directory.

63-63: Replaced print with structured logging

Properly migrated a print statement to Loguru's structured logging system. This provides consistency with other scripts like finetune.py and pretrain.py that also use this pattern.

255-255: Replaced print with structured logging

Converted a print statement about command file writing to a structured log message, maintaining the same informational content while benefiting from Loguru's features like log levels and formatting.

EventStream/baseline/FT_task_baseline.py (10)

18-19: Added Loguru for structured logging

This import adds the Loguru library for structured logging, consistent with the pattern applied across the codebase.

39-39: Updated Polars string cache enabling

The parameter True has been removed from pl.enable_string_cache(). This reflects an API change in the Polars library, where the default value is now True and specifying it explicitly is no longer necessary.

191-191: Added explicit column selection after caching

This line adds explicit column selection for subject_id, timestamp, and any window features when loading data from the cached parquet file. This ensures consistency in the column set and potentially improves performance by only loading needed columns.

654-654: Converted print to structured logging

Replaced print statement with Loguru's structured logging for saving configuration information.

711-712: Converted print to structured logging

Replaced print statement with Loguru's structured logging for dataset loading information.

716-718: Converted print to structured logging

Replaced print statement with Loguru's structured logging for dataset loading completion information, maintaining the same format string with elapsed time calculation.

721-721: Converted print to structured logging

Replaced print statement with Loguru's structured logging for model initialization information.

724-724: Converted print to structured logging

Replaced print statement with Loguru's structured logging for model fitting information.

726-726: Converted print to structured logging

Replaced print statement with Loguru's structured logging for model saving information.

730-730: Converted print to structured logging

Replaced print statement with Loguru's structured logging for model evaluation information.

scripts/build_dataset.py (6)

17-18: Added Loguru for structured logging

Added import for Loguru's logger module to enable structured logging throughout the script.

34-34: Added hydra_loguru_init import

Imported the logging initialization function to configure Loguru with Hydra's output directory.

54-57: Updated docstring example to use Logger

The docstring example for add_to_container has been updated to demonstrate how to use Loguru for logging in tests, including setup and configuration steps.

74-74: Converted warning to structured logging

Replaced a print warning with a proper Loguru warning call. This improves logging consistency and enables better log filtering and management.

82-84: Added logging initialization

Added hydra_loguru_init() call at the beginning of the main function with an empty line after it to initialize the logging system, consistent with other scripts in the codebase.

365-365: Converted print to structured logging

Replaced print statement about omitted configuration parameters with a Loguru structured logging call.

tests/data/test_dataset_base.py (2)

259-277: Added test for dataset splitting with mandatory IDs

A new test method test_split_mandatory_ids has been added to verify the functionality of the split method when mandatory subject IDs are specified for certain splits. The test confirms that:

The resulting split contains all expected keys

All subject IDs are included in at least one split

The mandatory IDs are correctly assigned to their respective splits

The remaining IDs are properly distributed among the random splits

This test improves coverage for the dataset splitting functionality, particularly for the case where some IDs must be assigned to specific splits.

494-497: Updated measurement metadata structure

The structure of the empty measurement metadata DataFrame has been updated to reflect changes in the MeasurementConfig class. The previous columns "outlier_model" and "normalizer" have been replaced with new columns "mean", "std", "thresh_small", and "thresh_large", all initialized with float data type.

This change indicates a shift in how measurement values are transformed and normalized in the system, moving from a model-based approach to a statistical thresholding approach.

scripts/convert_to_ESDS.py (3)

26-38: LGTM: Well-defined configuration class.

The ConversionConfig class properly uses the hydra_dataclass decorator and includes appropriate attributes with type hints. The __post_init__ method correctly handles string path conversion.

47-55: Well-implemented check for existing files.

The code properly checks for existing files in the output directory and raises a helpful error message when overwrite is not enabled. It also handles the overwrite case well by removing the directory.

61-73: Robust chunk-based processing approach.

The implementation properly handles large datasets by splitting subjects into chunks and processing each chunk separately, which is important for memory efficiency. The use of tqdm for progress tracking is also helpful.
tests/data/test_pytorch_dataset.py (7)

56-58: Good addition of explicit time delta calculations.

The addition of time delta calculations for events is valuable for testing time-based features of the system. The approach of calculating deltas and adding a NaN at the end follows a common pattern for time series data.

66-68: Consistent implementation across all event time delta calculations.

The implementation of delta calculations is consistent across all three subjects, which ensures uniform test data behavior.

Also applies to: 77-79

85-86: Ensure empty lists are handled consistently in the DataFrame.

The change to use empty lists instead of None for subjects with no events is appropriate, but ensure that all processing code handles empty lists consistently.

143-144: Replaced null arrays with empty lists for consistency.

The change from None to empty lists for subjects with no dynamic indices, measurement indices, and values ensures consistent handling of empty data throughout the test dataset.

Also applies to: 178-179, 185-186

315-345: Comprehensive test setup with proper resource management.

The new setUp method creates a well-structured test environment with all necessary files and directories. The test uses temporary directories properly and sets up realistic test data including NRT (Nested Ragged Tensor) representations.

However, consider breaking this large setup method into smaller, focused helper methods for better maintainability:
def setUp(self):
    self._create_temp_directories()
    self._setup_shards()
    self._setup_dl_representation()
    self._setup_nrt_representation()
    self._setup_task_data()
    self._setup_configs()

def _create_temp_directories(self):
    self.dir_obj = TemporaryDirectory()
    self.path = Path(self.dir_obj.name)
    self.split = "fake_split"

# Additional helper methods...
360-361: Good resource cleanup in tearDown.

The tearDown method properly cleans up the temporary directory, which is important to prevent resource leaks in tests.

363-374: Simplified method signature improves maintainability.

The simplified get_pyd method makes the tests more maintainable by focusing on essential parameters and using default values and kwargs for the rest.
tests/data/test_config.py (5)

185-190: Updated measurement metadata structure with new statistical fields.

The test has been updated to reflect changes in the metadata structure, replacing outlier_model and normalizer with new statistical fields (mean, std, thresh_small, and thresh_large). This aligns with changes to how measurement metadata is defined across the codebase.

204-206: Updated measurement metadata field order.

The field order in the measurement metadata has been changed to accommodate the new statistical fields. This is consistent with the changes to the MeasurementConfig class.

341-343: Updated metadata fields for univariate regression.

The test for the univariate regression case has been updated with the new metadata fields, ensuring consistency across test cases.

454-455: Simplified outlier detector configuration.

The outlier detector configuration now allows an empty dictionary, which is a simplification from the previous implementation. This suggests a change in how outlier detection is configured in the system.

477-478: Enhanced validation for outlier detector configuration.

The validation now checks that the outlier detector configuration is a dictionary, not a string, which improves type checking and prevents potential type-related issues.
tests/test_e2e_runs.py (6)

33-42: Added ESDS directory to test setup.

The addition of the "esds" directory to the test environment setup is necessary to support the new ESDS dataset building functionality. This aligns with the addition of the new script convert_to_ESDS.py.

73-74: Non-subtest approach for build_dataset is appropriate.

Using use_subtest=False for the dataset building command is appropriate since this is a prerequisite for other tests rather than a separate test case itself.

75-83: Well-implemented ESDS dataset building method.

The build_ESDS_dataset method properly constructs the command to convert the dataset to the ESDS format, including appropriate parameters. It correctly uses the new convert_to_ESDS.py script.

Consider parameterizing the method for more flexibility:
-def build_ESDS_dataset(self):
+def build_ESDS_dataset(self, chunk_size=25):
     command_parts = [
         "./scripts/convert_to_ESDS.py",
         f"dataset_dir={self.paths['dataset']}",
         f"ESDS_save_dir={self.paths['esds']}",
-        "ESDS_chunk_size=25",
+        f"ESDS_chunk_size={chunk_size}",
     ]
     self._test_command(command_parts, "Build ESDS Dataset", use_subtest=True)
110-111: Non-subtest approach for pretraining is consistent with dataset building.

Using use_subtest=False for pretraining is consistent with the approach used for dataset building, as both are prerequisites for subsequent tests.

201-208: Enhanced window size options and model parameters.

The addition of new window size options (-1h and -FULL) and the explicit setting of n_estimators for the random forest classifier provide more comprehensive testing scenarios while keeping the test efficient with a small number of estimators.

255-257: Complete end-to-end test flow with ESDS dataset building.

The addition of the ESDS dataset building step to the end-to-end test ensures that this new functionality is properly tested as part of the complete workflow.
EventStream/data/dataset_base.py (24)

21-22: Streamlined imports for data manipulation & logging
These imports facilitate the usage of polars for data processing and loguru for structured logging.

24-24: Ensure runtime availability
This new import from nested_ragged_tensors requires verification that the dependency is installed and resolvable in all environments.

187-191: Refactored _load_input_df usage
Confirm that prior parameters like subject_ids_map or subject_id_dtype are removed or replaced. This ensures consistency with the simplified method signature.

194-194: Check column collisions
Using {i: o for i, (o, _) in schema.unified_schema.items()} is convenient, but verify no duplicate out-names.

220-220: Consistent function call
The revised _load_input_df call aligns with the new signature. Good job keeping the codebase in sync.

226-226: Enhanced debugging
Logging “Filtering” provides better traceability during data preprocessing steps.

230-231: Timestamp resolution improvements
The debug message clarifies event handling, and _resolve_ts_col ensures consistent timestamp columns.

241-242: Range-based event processing
These debug/log steps clarify range event handling. Ensures start_time is properly resolved.

260-261: Guard for empty events
Logging a warning on an empty events dataframe aids in diagnosing unexpected data conditions.

264-264: Appending events
Ensure subsequent concatenation merges schemas consistently across event DataFrames.

266-266: Appending measurements
Likewise, confirm consistent measurement schema when building a unified DataFrame later.

293-293: Loading subjects info
This info-level log helps in diagnosing any file I/O issues.

310-310: Logging event loading
Keeping logs consistent fosters maintainability across I/O boundaries.

329-329: Loading dynamic measurements
Ensures clarity when reading dynamic data from disk, aligning with the broader logging pattern.

366-366: Updating config directory
Informs that the config.save_dir is changed during reload; improves clarity for stateful loading logic.

473-474: Building subjects & logging extraction step
Combining subject build logic with a debug statement keeps the dataflow transparent.

479-479: Events and measurements built
A debug log confirming final data construction is helpful for debugging.

509-510: Assigning main DataFrames
Storing events_df and dynamic_measurements_df ensures consistent usage across methods.

512-514: Conditional event preprocessing
Only if events_df is not None, _agg_by_time() and _sort_events() are applied, preventing accidental calls on missing data.

521-521: Validating initial dataframes
Destructuring the results of _validate_initial_dfs clarifies final subject, event, and measurement DataFrames.

722-730: Comprehensive preprocessing pipeline logs
Provides clear step-by-step updates: from subject filtering, time-derived measurement addition, to final transformations.

793-793: Warning on missing measure
Helps identify absent columns at runtime, triggering a drop to avoid silent failures.

805-805: Logging no events
Reporting that a measure has no observed events clarifies the reason behind subsequent config drops.

806-806: Dropping measure config
Dropping the measure config after detecting no events is consistent but verify it doesn’t cause unhandled references.

EventStream/data/dataset_polars.py (19)

13-13: Reason for defaultdict
Using collections.defaultdict can simplify code when grouping or accumulating values.

15-15: Time offset usage
timedelta likely supports rolling or dynamic grouping logic—ensure all edge cases are handled.

22-22: PyArrow integration
Facilitates interoperability with Arrow-based formats. Good for performance in columnar operations.

23-23: Structured logging
Consistent logging with loguru improves observability.

38-38: String cache activation
pl.enable_string_cache() may significantly optimize repetitive string operations.

40-55: Polars-to-PyArrow dtype map
This dictionary ensures controlled mapping between Polars and Arrow. Verify coverage for all used dtypes to prevent runtime errors.

116-119: Expanded METADATA_SCHEMA
Adding thresh_high, thresh_low, mean, and std supports robust outlier detection and normalization.

301-301: Method signature flexibility
Clarifying the return with None for optional data aligns with the code handling empty or missing frames.

312-312: Added debug statement
Logging event processing step helps isolate issues in the pipeline.

332-335: Hashing for event_id
Struct-based hashing is concise but confirm collision risk is minimal for large datasets or repeated timestamps.

635-635: Conditional type enforcement
Ensures dynamic_measurements_df["event_id"] aligns with events_df["event_id"] type.

637-639: Casting event_id
Repeated cast can cause overhead or data truncation if ID ranges are large. Validate the numeric limits.

640-640: Skipping streaming segfaults
This log clarifies a known issue with streaming in certain polars versions. Helpful for maintainers.

649-649: Temporal bucketing
group_by_dynamic(every=...) is powerful, but confirm memory usage is acceptable on large data.

664-667: Re-hashing events
Using .hash(...) again for final event IDs. Follow up on collision prevention if the data grows large.

683-683: Debug: Re-map measurements
The debug log clarifies re-mapping steps for dynamic_measurements_df.

685-687: Collect dynamic measurements
Disabling streaming ensures stability but may raise memory needs. Monitor large-file performance.

694-694: Collecting event types
Reinforces the discovered categories. Validate we handle unexpected categories gracefully upstream.

703-703: Event count storage
Maintaining a subject-to-event-count map speeds subsequent lookups and stats, if kept in sync.

coderabbitai · 2025-03-20T17:26:45Z

EventStream/data/pytorch_dataset.py

+            for i, (subj, n_events) in enumerate(zip(subject_ids, n_events)):
+                if subj in self.subj_indices or subj in self.subj_seq_bounds:
+                    raise ValueError(f"Duplicate subject {subj} in {shard}!")
+


🛠️ Refactor suggestion

Potential variable overshadowing in loop.

Your loop reads for i, (subj, n_events) in enumerate(zip(subject_ids, n_events)):. The variable n_events is used both in the zip and as the local loop variable, which might cause confusion. Renaming the loop variable could avoid overshadowing.

🧰 Tools

🪛 Ruff (0.8.2)

209-209: Loop control variable n_events overrides iterable it iterates

(B020)

coderabbitai · 2025-03-20T17:26:46Z

tests/data/test_config.py

+            center_and_scale=True,
            save_dir=None,


💡 Verification agent

🧩 Analysis chain

Added center_and_scale parameter to default configuration.

The addition of the center_and_scale parameter with a default value of True replaces the previous normalizer_config. This change should be documented to ensure users understand the new parameter's purpose and how it differs from the previous approach.

🌐 Web query:

What is the purpose of the center_and_scale parameter in data normalization?

💡 Result:

The center_and_scale parameter in data normalization determines whether to apply centering (adjusting the mean to zero) and scaling (adjusting the spread to unit variance or a specified range) to the dataset. Here's a breakdown of its purpose:

Core Functions

Centering:

Subtracts the mean (or median) from each feature, shifting the data distribution to center around zero[1][3][6].

Example: If a feature’s mean is 50, centering adjusts values to fluctuate around 0 instead of 50[3][13].

Scaling:

Adjusts the spread of data by dividing features by a metric like standard deviation, range, or median absolute deviation (MAD)[1][4][13].

Common methods:

Z-score: Divides by standard deviation (unit variance)[2][13].

Range scaling: Rescales data to a predefined range (e.g., [0, 1])[14][16].

Why Use center_and_scale?

Algorithm Compatibility: Many machine learning algorithms (e.g., SVM, PCA, k-NN) require features to have similar scales for accurate performance[2][12].

Gradient Descent Optimization: Centered and scaled gradients converge faster during model training[16][12].

Outlier Handling: Methods like MAD scaling reduce sensitivity to outliers compared to standard deviation[1][5].

Method Effect on Data Typical Use Case

Z-score Mean=0, Std=1 Gradient-based models (e.g., SVM)

Min-Max Scaling Range=[0,1] Neural networks, image processing

MAD Scaling Median=0, MAD=1 Robust scaling for outlier-prone data

Practical Example

Enabling center_and_scale=True in a function like MATLAB’s normalize or scikit-learn’s StandardScaler performs:

X_normalized = (X - μ) / σ

where:

(μ) = mean (centering)

(σ) = standard deviation (scaling)[2][16].

This ensures features are standardized (mean=0, variance=1), aligning with assumptions of many statistical models[6][13]. Without centering and scaling, features with larger ranges could dominate model outcomes[2][5].

Citations:

1: https://www.mathworks.com/help/matlab/ref/normalizedata.html

2: https://www.datacamp.com/tutorial/normalization-vs-standardization

3: https://www.stathelp.se/en/center_en.html

4: https://www.mathworks.com/help/matlab/ref/double.normalize.html

5: https://julielinx.github.io/blog/08_center_scale_and_latex/

6: https://www.myrelab.com/learn/normalization-standardization-and-data-transformation

7: https://blogs.sas.com/content/iml/2020/08/10/standardize-data-sas.html

8: https://www.datacamp.com/tutorial/normalization-in-machine-learning

9: https://uclouvain-cbio.github.io/WSBIM1322/sec-norm.html

10: https://www.youtube.com/watch?v=hOZfAFIQ4sg

11: https://www.splunk.com/en_us/blog/learn/data-normalization.html

12: https://towardsai.net/p/data-science/how-when-and-why-should-you-normalize-standardize-rescale-your-data-3f083def38ff

13: https://estuary.dev/blog/data-normalization/

14: https://www.kaggle.com/code/alexisbcook/scaling-and-normalization

15: https://developers.google.com/machine-learning/crash-course/numerical-data/normalization

16: https://scikit-learn.org/stable/modules/preprocessing.html

17: https://recipes.tidymodels.org/reference/step_normalize.html

Documentation Update Needed: Clarify center_and_scale Behavior

The new center_and_scale parameter (defaulted to True) now governs whether data is centered (shifting the mean to zero) and scaled (adjusting the spread to unit variance). This implementation replaces the old normalizer_config mechanism. To aid users in understanding the update, please ensure that the documentation (and any related changelogs) clearly explains:

What It Does:

Centering: Subtracts the mean so that the data distribution centers around zero.

Scaling: Divides by the standard deviation (or an equivalent metric) to standardize the data.

Rationale:

Aligns with common data normalization practices widely used in algorithms like SVM, PCA, and other gradient-based models.

Addresses potential issues where features with larger ranges might dominate model outcomes.

Migration Impact:

Highlights differences from the previous normalizer_config, ensuring users understand any behavioral changes they might need to account for.

Location:

File: tests/data/test_config.py

Lines: 492-493

mmcdermott added 30 commits September 6, 2023 21:20

Added flash attention.

a339171

Merge branch 'dev' into enable_flash_attention

f479082

Added a schema output to pytorch dataset.

c313c1c

Fixed small bug with schema and added test.

8437fe0

Small updates to dataset for testing compute costs.

8cbacb9

First attempt; likely very buggy.

0504580

More changes; not sure if it is 100% working yet or not. Tests certai…

49c4f02

…nly still failing.

Fixed some more typos; now can run up through getitem and collate at …

3d577ab

…least.

Fixed some small bugs; tests still not fully passing but sample data …

00a0305

…runs now.

Fixed a number of small bugs; tests closer to passing.

260161a

I think all tests are now passing.

6052b8a

Updated notebook.

7098db8

Starter code for nested caching and subsampling

070e711

Added doctest.

a7ab111

Merge branch 'improved_compute' into duplication_bug_in_task_caching

c2c4a2e

Added test for task caching bug.

6235bd0

Fixed test.

20ef6af

Lint fixes.

c78cb4d

Merge pull request #77 from mmcdermott/duplication_bug_in_task_caching

3a1821c

Duplication bug in task caching

Merge branch 'improved_compute' into nested_caching

e4f3e98

Added loguru

cf7b9e3

Trying logger changes.

16e0c87

Added some logging and got tests passing.

e327db2

Removed the old increment calls.

9c0eeb5

Added loguru throughout.

64df044

Merge pull request #78 from mmcdermott/remove_unnecessary_increment

63a838a

Removed the old increment calls to assign event IDs in favor of hashes of subject IDs and timestamps which can be run lazily.

Merge pull request #79 from mmcdermott/enable_flash_attention

33215e7

Enable flash attention

Merge branch 'v0.5' into proper_logs

866e5c1

Added loguru initialization code to all scripts.

8c4d60f

Fixed tests.

1b5c30b

mmcdermott added 24 commits April 22, 2024 12:43

Merge branch 'dev' into specify_test_set_exclusions

3545a40

Merge pull request #102 from mmcdermott/load_from_ckpt

71e4b6c

Fixes a metrics thing and enables tuning subset sizes as well for speed.

Fixed test set exclusion code and added tests

c959fd8

Merge branch 'dev' into specify_test_set_exclusions

2d00aec

Merge pull request #103 from mmcdermott/specify_test_set_exclusions

70c7174

Specify test set exclusions

Merge branch 'dev' into using_ragged_tensors

1dfb6ef

Partially tested attempt at adding future windows.

da2307f

Merged in dev changes.

f84069c

Added tests to touch new capability.

8da042a

Merge pull request #105 from mmcdermott/add_future_summary_windows_to…

5f9697b

…_flat_reps Add computation over future summary windows to flat reps

Merge branch 'dev' into using_ragged_tensors

2873344

Updated to write NRT files and use NRT files in a smarter, much less …

476661a

…resource intensive manner.

Some improvements to the test code; pytorch tests are currently failing.

cd9a661

Removing out-dated tests with new pytorch dataset format.

6a370bf

Fixed pytorch dataset tests (mostly by removing those that were faili…

79d41a4

…ng, but still).

Further updated polars and fixed a small test case that the polars ch…

6c20b5d

…ange modified

Makes measurement_configs a property instead of a static access.

330f6ac

Updated the docstring for the generate_synthetic_data script.

a3e4bd9

Merge branch 'dev' into using_ragged_tensors

2f433a6

Merge pull request #90 from mmcdermott/using_ragged_tensors

0350d7c

Fixes the slowdowns and bugs caused by the prior improved compute practices, but requires a nested tensor package.

Merge branch 'dev' of github.com:mmcdermott/EventStreamML into dev

e77b5c9

Updated to_int_index to not fail with newer versions of polars. Also …

a0550b1

…removed references to np.NaN as that is removed in np 2.0

Updated polars version in the pyproject.toml

5b4a279

Merge pull request #115 from mmcdermott/update_packages

ce9e2c3

Some small changes to update things for more recent versions of packages.

coderabbitai bot reviewed Mar 20, 2025

View reviewed changes

mmcdermott added 2 commits March 20, 2025 13:17

Merged main

fc42974

Fixed typo

c5909c3

coderabbitai bot reviewed Mar 20, 2025

View reviewed changes

Maybe fixed tests

6cb0906

	logger.error(f"EXCEPTION: {ex}\nTRACEBACK:\n{traceback.print_exc()}")
	logger.error(f"EXCEPTION: {ex}\nTRACEBACK:\n{traceback.format_exc()}")

		if events is None:
		logger.warning(f"Empty new events dataframe of type {event_type}!")

Method	Effect on Data	Typical Use Case
Z-score	Mean=0, Std=1	Gradient-based models (e.g., SVM)
Min-Max Scaling	Range=[0,1]	Neural networks, image processing
MAD Scaling	Median=0, MAD=1	Robust scaling for outlier-prone data

Last Dev Release into Main #124

Are you sure you want to change the base?

Last Dev Release into Main #124

Uh oh!

Conversation

mmcdermott commented Mar 20, 2025

Uh oh!

coderabbitai bot commented Mar 20, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram(s)

Possibly related PRs

Chat

CodeRabbit Commands (Invoked using PR comments)

Other keywords and placeholders

CodeRabbit Configuration File (.coderabbit.yaml)

Documentation and Community

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Mar 20, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Mar 20, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Mar 20, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Mar 20, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Mar 20, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Mar 20, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Mar 20, 2025

Choose a reason for hiding this comment

Core Functions

Why Use center_and_scale?

Practical Example

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

coderabbitai bot commented Mar 20, 2025 •

edited

Loading

CodeRabbit Configuration File (`.coderabbit.yaml`)

Why Use `center_and_scale`?