Skip to content

Conversation

@fuzhaoyuan
Copy link
Contributor

@fuzhaoyuan fuzhaoyuan commented Aug 15, 2025

Built on top of #11425

Refactor Clinical Data Enrichments

Summary

Refactor clinical data enrichments to follow clean architecture principles, migrating from legacy service layer to domain layer with ClickHouse performance optimizations.

Note

In the event of responses largely differ from legacy responses, it's due to the legacy implementation doesn't unify casing of attribute values, causing inaccurate number of groups, for example:
image
This is because legacy has both 'PRIMARY_SITE' attribute values of "Ascending Colon" and "Ascending colon" as separate groups, having one more group than needed, thus led to inaccurate calculation results.

Key Changes

Architecture Migration

  • Before: Logic in org.cbioportal.legacy.web.ClinicalDataEnrichmentController + ClinicalDataEnrichmentUtil
  • After: Clean separation across layers
    • REST: ColumnStoreClinicalDataEnrichmentController
    • Domain: FetchClinicalDataEnrichmentsUseCase (business logic)
    • Domain: ClinicalDataEnrichmentUtil (pure statistical functions)

Performance Improvements

Legacy approach - Multiple queries per group:

// For each group, separate DB calls
groupedSamples.stream()
    .map(group -> clinicalDataService.fetchClinicalData(...))

New approach - Single optimized query:

// Fetch ALL data for ALL groups in ONE query, regroup in-memory
fetchClinicalDataSummaryForEnrichments(allSampleIds, allPatientIds, ...)

Impact: Reduces database queries from O(groups × attributes) to O(1)

New Domain Layer

  • ClinicalDataEnrichment - immutable record (was mutable bean)
  • EnrichmentTestMethod - type-safe enum (was magic strings)
  • FetchClinicalDataEnrichmentsUseCase - encapsulates enrichment workflow
  • ClinicalDataEnrichmentUtil - pure statistical functions

Repository Enhancements

  • fetchClinicalDataSummaryForEnrichments() - batch fetch with UNION query
  • getClinicalDataCountsForEnrichments() - optimized categorical counts
  • New ClickHouse SQL mapper handling sample/patient/conflicting attributes

Other Improvements

  • Comprehensive test coverage
  • MapStruct DTO mapping
  • Fixed PreAuthorize for ClinicalData and ClinicalDataEnrichment
  • Better integration test organization

@fuzhaoyuan fuzhaoyuan self-assigned this Aug 15, 2025
@fuzhaoyuan fuzhaoyuan force-pushed the master-clean-arch-clinical-data-enrichments branch 4 times, most recently from 5caf22f to 86c83fe Compare October 16, 2025 15:16
@fuzhaoyuan fuzhaoyuan force-pushed the master-clean-arch-clinical-data-enrichments branch 5 times, most recently from 3ff5559 to ad2a5d8 Compare October 23, 2025 21:53
@fuzhaoyuan fuzhaoyuan marked this pull request as ready for review October 23, 2025 21:53
@fuzhaoyuan fuzhaoyuan force-pushed the master-clean-arch-clinical-data-enrichments branch from ad2a5d8 to 1176aff Compare October 24, 2025 14:40
Signed-off-by: Zhaoyuan (Ryan) Fu <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant