-
Notifications
You must be signed in to change notification settings - Fork 9
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
refactor: modules/datagraph.py #153
Conversation
WalkthroughThe changes in the Changes
Poem
Tip CodeRabbit's docstrings feature is now available as part of our Early Access Program! Simply use the command Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media? 🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments. CodeRabbit Commands (Invoked using PR comments)
Other keywords and placeholders
CodeRabbit Configuration File (
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
🔭 Outside diff range comments (1)
src/imgtools/modules/datagraph.py (1)
Line range hint
136-191
: Consider refactoring edge type handling for better maintainability.The edge type handling in
_form_edges
uses numeric constants and repeated code patterns. Consider using an enum or dataclass to define edge types and their relationships.+from enum import Enum +from dataclasses import dataclass +from typing import Callable, List + +class EdgeType(Enum): + RTDOSE_RTSTRUCT = 0 + RTDOSE_CT = 1 + RTSTRUCT_CT = 2 + # ... other edge types + +@dataclass +class EdgeDefinition: + type: EdgeType + source_modality: str + target_modality: str + join_key: str + join_target: str + +EDGE_DEFINITIONS = [ + EdgeDefinition(EdgeType.RTDOSE_RTSTRUCT, "RTSTRUCT", "RTDOSE", "instance_uid", "reference_rs"), + # ... other definitions +]
🧹 Nitpick comments (4)
src/imgtools/modules/datagraph.py (4)
30-31
: Consider handling path conversion more robustly.While accepting both
str
andPath
types improves flexibility, consider adding validation for the path existence and type before conversion.- self.edge_path = Path(edge_path) + try: + self.edge_path = Path(edge_path).resolve() + if not self.edge_path.parent.exists(): + raise ValueError(f"Parent directory does not exist: {self.edge_path.parent}") + except Exception as e: + raise ValueError(f"Invalid edge_path: {edge_path}") from eAlso applies to: 44-44
133-134
: Add error handling for file writing operations.The visualization file writing operation should include error handling to manage potential I/O issues.
- vis_path = self.edge_path.parent / "datanet.html" - data_net.show(vis_path) + vis_path = self.edge_path.parent / "datanet.html" + try: + data_net.show(vis_path) + except IOError as e: + logger.error(f"Failed to write visualization to {vis_path}: {e}") + raise
502-506
: Simplify component handling logic.The component handling logic could be simplified by using pandas' built-in methods more effectively.
- for _, study in enumerate(rel_studyids): # per study_id - df_temp = df_edges_processed.loc[df_edges_processed.study_x == study] - ct_locs = df_temp.loc[df_temp.modality_x.isin(['CT', 'MR'])] - ct_series = ct_locs.series_x.unique() + for study in rel_studyids: + df_temp = df_edges_processed.query("study_x == @study") + ct_locs = df_temp.query("modality_x in ['CT', 'MR']") + ct_series = ct_locs.series_x.unique()
586-594
: Consider using more specific type annotations.The type annotations in helper methods could be more specific to improve type safety.
- def _check_save(save_dict: dict, node, dest) -> Tuple[str, str]: + def _check_save(save_dict: Dict[str, Any], node: str, dest: str) -> Tuple[str, str]:
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
src/imgtools/modules/datagraph.py
(15 hunks)
🔇 Additional comments (1)
src/imgtools/modules/datagraph.py (1)
1-1
: LGTM: Import statements are appropriate.
The addition of Path
and type annotation imports aligns well with the type safety improvements.
Also applies to: 3-3
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #153 +/- ##
==========================================
+ Coverage 43.90% 44.03% +0.13%
==========================================
Files 48 48
Lines 3603 3540 -63
==========================================
- Hits 1582 1559 -23
+ Misses 2021 1981 -40 ☔ View full report in Codecov by Sentry. |
Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🔭 Outside diff range comments (2)
src/imgtools/modules/datagraph.py (2)
Line range hint
53-63
: Add parent directory creation for edge_pathThe code checks if edge_path exists but doesn't ensure its parent directory exists. This could cause file creation to fail.
if not self.edge_path.exists(): + self.edge_path.parent.mkdir(parents=True, exist_ok=True) logger.info("Edge table not present. Forming the edge table based on the crawl data...") self.form_graph()
Line range hint
240-358
: Refactor parser method for better maintainabilityThe parser method is complex with multiple responsibilities. Consider breaking it down into smaller, focused methods:
- Query validation
- Single modality handling
- Pair modality handling
- Multi-modality handling
Example refactor for query validation:
def _validate_query(self, query_string: str) -> None: """Validate the query string against supported modalities and combinations.""" supported_modalities = {"RTDOSE", "RTSTRUCT", "CT", "PT", 'MR', 'SEG'} queried_modalities = set(query_string.split(",")) if not queried_modalities.issubset(supported_modalities): raise ValueError(f"Unsupported modalities in query: {queried_modalities - supported_modalities}")
🧹 Nitpick comments (3)
src/imgtools/modules/datagraph.py (3)
1-4
: Consider enhancing type annotations with collections.abcWhile the basic type annotations are good, consider using more specific types from
collections.abc
for better type safety.-from typing import List, Tuple +from typing import List, Tuple, Dict, Union +from collections.abc import Sequence
98-133
: Consider implementing graph visualization paginationThe visualization method loads all nodes and edges into memory at once, which could be problematic for large graphs. Consider implementing pagination or limiting the number of nodes displayed.
422-424
: Optimize string concatenation in list_edgesUsing reduce for string concatenation is less efficient than joining strings directly.
-def list_edges(series) -> str: - return reduce(lambda x, y:str(x) + str(y), series) +def list_edges(series) -> str: + return ''.join(map(str, series))
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
src/imgtools/modules/datagraph.py
(10 hunks)
🔇 Additional comments (1)
src/imgtools/modules/datagraph.py (1)
357-357
:
Ensure path safety when resolving relative paths
The current path resolution could be vulnerable to path traversal. Consider using resolve()
and validating the final path is within expected boundaries.
-final_df[col] = final_df[col].apply(lambda x: (self.edge_path.parent.parent / x).as_posix() if isinstance(x, str) else x)
+final_df[col] = final_df[col].apply(lambda x: (self.edge_path.parent.parent / x).resolve().as_posix() if isinstance(x, str) else x)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 2
🔭 Outside diff range comments (1)
src/imgtools/modules/datagraph.py (1)
Line range hint
204-317
: Consider breaking down the parser method into smaller methods.The parser method is complex with multiple nested conditions. Consider extracting the single-modality, pair-modality, and multi-modality handling into separate methods for better maintainability.
Example refactor:
+ def _handle_single_modality(self, query_string: str) -> pd.DataFrame: + final_df = self.df.loc[ + self.df["modality"] == query_string, + ["study", "patient_ID", "series", "folder", "subseries"] + ] + final_df.rename( + columns={ + "series": f"series_{query_string}", + "study": f"study_{query_string}", + "folder": f"folder_{query_string}", + "subseries": f"subseries_{query_string}" + }, + inplace=True + ) + return final_df def parser(self, query_string: str) -> pd.DataFrame: # ... if query_string in supported_modalities: - final_df = self.df.loc[...] # existing code + final_df = self._handle_single_modality(query_string)
🧹 Nitpick comments (2)
src/imgtools/modules/datagraph.py (2)
1-4
: Consider adding type hints for class variables.While type hints have been added to method parameters and return types, consider also adding type hints for class variables in the class docstring or as class variable annotations. This would improve code maintainability and IDE support.
class DataGraph: + df: pd.DataFrame + edge_path: Path + df_new: pd.DataFrame | None + df_edges: pd.DataFrame + queried_modalities: List[str] + final_dict: List[dict]
356-358
: Add validation for resolved paths.While the path resolution has been improved, consider adding validation to ensure the resolved paths exist and are accessible.
for col in final_df.columns: if col.startswith("folder"): - final_df[col] = final_df[col].apply(lambda x: (self.edge_path.parent.parent / x).resolve().as_posix() if isinstance(x, str) else x) + final_df[col] = final_df[col].apply( + lambda x: ( + path := (self.edge_path.parent.parent / x).resolve(), + path.as_posix() if path.exists() else None + )[1] if isinstance(x, str) else x + )
self.df = pd.read_csv(path_crawl, index_col=0) | ||
self.edge_path = edge_path | ||
self.edge_path = Path(edge_path) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion
Add path validation for edge_path.
The code should validate that the parent directory of edge_path
exists before attempting to save files there.
def __init__(self,
path_crawl: str | Path,
edge_path: str | Path = "./patient_id_full_edges.csv",
visualize: bool = False,
update: bool = False) -> None:
self.df = pd.read_csv(path_crawl, index_col=0)
self.edge_path = Path(edge_path)
+ self.edge_path.parent.mkdir(parents=True, exist_ok=True)
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
self.df = pd.read_csv(path_crawl, index_col=0) | |
self.edge_path = edge_path | |
self.edge_path = Path(edge_path) | |
def __init__(self, | |
path_crawl: str | Path, | |
edge_path: str | Path = "./patient_id_full_edges.csv", | |
visualize: bool = False, | |
update: bool = False) -> None: | |
self.df = pd.read_csv(path_crawl, index_col=0) | |
self.edge_path = Path(edge_path) | |
self.edge_path.parent.mkdir(parents=True, exist_ok=True) |
Summary by CodeRabbit