-
-
Notifications
You must be signed in to change notification settings - Fork 789
Fixes #11583 — Optimize molecular data multi-profile fetch for ClickHouse (reduce N+1 queries) #11840
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Fixes #11583 — Optimize molecular data multi-profile fetch for ClickHouse (reduce N+1 queries) #11840
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
This PR optimizes molecular data fetching for ClickHouse by eliminating N+1 query patterns when retrieving data across multiple profiles. Instead of querying per gene, the implementation now fetches all requested genes in a single ClickHouse query and aggregates per-sample rows into the legacy CSV format expected by the service layer.
Key Changes
- Introduced ClickHouse-specific repository that queries
genetic_alteration_derivedand aggregates results intoGeneMolecularAlterationobjects - Modified service layer to batch all entrez gene IDs into a single repository call
- Added comprehensive unit tests for both service and repository aggregation logic
Reviewed changes
Copilot reviewed 7 out of 7 changed files in this pull request and generated 1 comment.
Show a summary per file
| File | Description |
|---|---|
MolecularDataServiceImpl.java |
Replaced per-gene streaming queries with single batched repository call |
MolecularDataMyBatisClickhouseRepository.java |
New repository implementation that aggregates per-sample ClickHouse rows into CSV format |
MolecularDataMapper.java |
New mapper interface for ClickHouse queries |
MolecularDataMapper.xml |
MyBatis XML query definition for fetching per-sample molecular data |
MolecularDataRowPerSample.java |
New model class representing individual sample-level molecular data rows |
MolecularDataServiceImplTest.java |
Added test verifying multi-profile molecular data fetch with single repository call |
MolecularDataMyBatisClickhouseRepositoryTest.java |
Added test verifying aggregation logic from per-sample rows to CSV format |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
...bioportal/legacy/persistence/mybatisclickhouse/MolecularDataMyBatisClickhouseRepository.java
Outdated
Show resolved
Hide resolved
onursumer
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure if master is the best target for this PR. It might be better if we try optimizing the rc-7.0-clickhouse-only branch because eventually we will switch to clickhouse only implementation.
|
|
||
| @Repository | ||
| @ConditionalOnProperty(name = "clickhouse_mode", havingValue = "test") | ||
| public class MolecularDataMyBatisClickhouseRepository implements MolecularDataRepository { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should probably just modify MolecularDataMyBatisRepository instead of introducing another legacy repository class.
|
|
||
| import java.io.Serializable; | ||
|
|
||
| public class MolecularDataRowPerSample implements Serializable { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we really need to introduce a new legacy model? Can't we achieve the same thing by just using a map and the existing GeneMolecularAlteration model?
| <?xml version="1.0" encoding="UTF-8"?> | ||
| <!DOCTYPE mapper PUBLIC "-//mybatis.org//DTD Mapper 3.0//EN" "http://mybatis.org/dtd/mybatis-3-mapper.dtd"> | ||
|
|
||
| <mapper namespace="org.cbioportal.legacy.persistence.mybatisclickhouse.MolecularDataMapper"> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can probably just modify the existing mapper src/main/resources/org/cbioportal/legacy/persistence/mybatis/MolecularDataMapper.xml instead of introducing a new one.
| import org.apache.ibatis.annotations.Param; | ||
| import org.cbioportal.legacy.model.MolecularDataRowPerSample; | ||
|
|
||
| public interface MolecularDataMapper { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would just modify the existing legacy mapper instead of introducing another legacy mapper
|
@onursumer |
Refactoring Complete per Reviewer FeedbackI've updated this PR to address all the feedback from @onursumer Changes Made:
The optimization still eliminates N+1 queries by fetching all genes in a single ClickHouse query, but now integrates cleanly into existing codebase without introducing parallel legacy classes. |
|
@onursumer waiting for the review !! and further guidance |
…ioPortal#11761 -- Created MolecularDataCountItem model to represent per-profile counts -- Added fetchMolecularDataCountsInMultipleMolecularProfiles method to service layer -- Implemented new /api/molecular-data/counts POST endpoint - Returns JSON array with count per molecular profile in single database query -- Leverages existing getMolecularDataInMultipleMolecularProfiles optimization from PR cBioPortal#11840 -- Added unit tests for service and controller layers -- Includes implementation plan document for reference
|
@zainasir @inodb @onursumer @sheridancbio I've pushed a fix (commit 97a0773) that resolves the circular dependency issue causing the build failures. The fix:
Could you please approve the pending workflows so the new builds can run with the fixed code? Thanks! |
|
Friendly ping on this PR. All feedback from @onursumer has been implemented, circular dependency fixed in 97a0773, and tests pass locally. When you have time, could you please take another look ? |
|
Hi @immortal71, we recently merged rc-7.0-clickhouse-only into master. Can you change the base branch back to master and rebase your PR? Thanks! |
|
@onursumer done !! |
|
@onursumer Can you review it ? |
|
@immortal71 can you also rebase your branch on master ( |
@onursumer done !! |
|
@onursumer |
|
@immortal71 your branch is still 22 commits behind the master branch. Can you rebase it on the latest
|
…gle query to reduce N+1 queries (ClickHouse perf)
…ry that aggregates per-sample rows into gene-profile values
…nto existing repository/mapper - Remove separate ClickHouse classes - Change conditional property to true
Per reviewer feedback from @onursumer: --> Removed separate ClickHouse-specific repository and mapper classes --> Moved optimization into existing MolecularDataMyBatisRepository - Updated existing MolecularDataMapper.xml with conditional ClickHouse query --> Changed @ConditionalOnProperty havingValue from 'test' to 'true' --> Reuses existing GeneMolecularAlteration model instead of new legacy classes The ClickHouse path queries genetic_alteration_derived table and aggregates per-sample rows into CSV format in the repository layer.
…ization - Replaced SampleService injection with SampleMapper to avoid circular dependency - Added null safety check for optional SampleMapper dependency - Added try-catch with fallback to standard method for database compatibility - Added SLF4J logger for debugging and error tracking - Ensures tests pass in both MySQL and ClickHouse environments Fixes cBioPortal#11583
97a0773 to
d4c610f
Compare
|
@onursumer done!! |

Fixes #11583
This PR addresses the performance bottleneck when using ClickHouse in multi-profile molecular data fetches. Instead of repeated per-gene queries (N+1), the ClickHouse repository now fetches per-sample rows from the
genetic_alteration_derivedtable and aggregates them into the legacyvaluesCSV format expected by the service layer. The service now requests all entrez gene IDs in a single call.Key changes:
genetic_alteration_derived.GeneMolecularAlteration.Notes & next steps: Add
entrez_gene_idto the derived table to avoid a join togeneduring the ClickHouse query for better perf.