Skip to content

Conversation

@immortal71
Copy link

Fixes #11583

This PR addresses the performance bottleneck when using ClickHouse in multi-profile molecular data fetches. Instead of repeated per-gene queries (N+1), the ClickHouse repository now fetches per-sample rows from the genetic_alteration_derived table and aggregates them into the legacy values CSV format expected by the service layer. The service now requests all entrez gene IDs in a single call.

Key changes:

  • New ClickHouse mapper & XML to query genetic_alteration_derived.
  • ClickHouse repository that aggregates per-sample rows into GeneMolecularAlteration.
  • Service change to call repository once with the full set of requested entrez IDs.
  • Unit tests for clickhouse aggregation.

Notes & next steps: Add entrez_gene_id to the derived table to avoid a join to gene during the ClickHouse query for better perf.

Copilot AI review requested due to automatic review settings December 5, 2025 23:34
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR optimizes molecular data fetching for ClickHouse by eliminating N+1 query patterns when retrieving data across multiple profiles. Instead of querying per gene, the implementation now fetches all requested genes in a single ClickHouse query and aggregates per-sample rows into the legacy CSV format expected by the service layer.

Key Changes

  • Introduced ClickHouse-specific repository that queries genetic_alteration_derived and aggregates results into GeneMolecularAlteration objects
  • Modified service layer to batch all entrez gene IDs into a single repository call
  • Added comprehensive unit tests for both service and repository aggregation logic

Reviewed changes

Copilot reviewed 7 out of 7 changed files in this pull request and generated 1 comment.

Show a summary per file
File Description
MolecularDataServiceImpl.java Replaced per-gene streaming queries with single batched repository call
MolecularDataMyBatisClickhouseRepository.java New repository implementation that aggregates per-sample ClickHouse rows into CSV format
MolecularDataMapper.java New mapper interface for ClickHouse queries
MolecularDataMapper.xml MyBatis XML query definition for fetching per-sample molecular data
MolecularDataRowPerSample.java New model class representing individual sample-level molecular data rows
MolecularDataServiceImplTest.java Added test verifying multi-profile molecular data fetch with single repository call
MolecularDataMyBatisClickhouseRepositoryTest.java Added test verifying aggregation logic from per-sample rows to CSV format

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@immortal71
Copy link
Author

@heuermh @jorvis @alisman , does this need any changes ?, I am little confused here can you guide me and review this pr ??

@dippindots dippindots requested review from dippindots, fuzhaoyuan and onursumer and removed request for dippindots December 9, 2025 16:20
Copy link
Member

@onursumer onursumer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure if master is the best target for this PR. It might be better if we try optimizing the rc-7.0-clickhouse-only branch because eventually we will switch to clickhouse only implementation.


@Repository
@ConditionalOnProperty(name = "clickhouse_mode", havingValue = "test")
public class MolecularDataMyBatisClickhouseRepository implements MolecularDataRepository {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should probably just modify MolecularDataMyBatisRepository instead of introducing another legacy repository class.


import java.io.Serializable;

public class MolecularDataRowPerSample implements Serializable {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we really need to introduce a new legacy model? Can't we achieve the same thing by just using a map and the existing GeneMolecularAlteration model?

<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE mapper PUBLIC "-//mybatis.org//DTD Mapper 3.0//EN" "http://mybatis.org/dtd/mybatis-3-mapper.dtd">

<mapper namespace="org.cbioportal.legacy.persistence.mybatisclickhouse.MolecularDataMapper">
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can probably just modify the existing mapper src/main/resources/org/cbioportal/legacy/persistence/mybatis/MolecularDataMapper.xml instead of introducing a new one.

import org.apache.ibatis.annotations.Param;
import org.cbioportal.legacy.model.MolecularDataRowPerSample;

public interface MolecularDataMapper {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would just modify the existing legacy mapper instead of introducing another legacy mapper

@immortal71
Copy link
Author

@onursumer
Thanks a lot for the detailed review and guidance. I understand that the preferred approach is to optimize the existing ClickHouse path rather than adding separate legacy classes and mappers. I’ll update this PR to: ~
--Target the rc-7.0-clickhouse-only branch instead of master.
-- Modify 'MolecularDataMyBatisRepository' and the existing MolecularDataMapper.xml instead of introducing new legacy repository/mapper classes and models.
I’ll push an updated version based on this direction and ping you for another review.

@immortal71 immortal71 changed the base branch from master to rc-7.0-clickhouse-only December 11, 2025 09:41
@immortal71
Copy link
Author

immortal71 commented Dec 11, 2025

Refactoring Complete per Reviewer Feedback

I've updated this PR to address all the feedback from @onursumer

Changes Made:

  1. Removed separate ClickHouse-specific classes:
  • Deleted MolecularDataMyBatisClickhouseRepository
  • Deleted MolecularDataRowPerSample model
  • Deleted ClickHouse-specific mapper files in mybatisclickhouse package
  1. Integrated optimization into existing infrastructure:
  • Modified MolecularDataMyBatisRepository to include ClickHouse-specific logic using @Value("${clickhouse_mode:false}") instead of @ConditionalOnProperty
  • Added conditional path in getGeneMolecularAlterationsInMultipleMolecularProfiles() that switches based on clickhouseMode flag
  • Reuses existing GeneMolecularAlteration model throughout
  1. Updated existing mapper:
  • Added getGeneMolecularAlterationsInMultipleMolecularProfilesClickHouse() method to existing MolecularDataMapper interface
  • Added conditional ClickHouse query in existing MolecularDataMapper.xml that queries genetic_alteration_derived table
  • Repository aggregates per-sample rows into CSV format as expected by service layer
  1. Changed property value:
  • Changed from havingValue="test" to proper boolean flag check as suggested by Copilot
  1. Updated PR target:
  • Changed base branch from master to rc-7.0-clickhouse-only as requested

The optimization still eliminates N+1 queries by fetching all genes in a single ClickHouse query, but now integrates cleanly into existing codebase without introducing parallel legacy classes.

@immortal71
Copy link
Author

@onursumer waiting for the review !! and further guidance

immortal71 added a commit to immortal71/cbioportal that referenced this pull request Dec 11, 2025
…ioPortal#11761

-- Created MolecularDataCountItem model to represent per-profile counts
-- Added    fetchMolecularDataCountsInMultipleMolecularProfiles method to service layer
-- Implemented new /api/molecular-data/counts POST endpoint
- Returns JSON array with count per molecular profile in single database query
-- Leverages existing getMolecularDataInMultipleMolecularProfiles optimization from PR cBioPortal#11840
-- Added unit tests for service and controller layers
-- Includes implementation plan document for reference
@immortal71
Copy link
Author

immortal71 commented Dec 28, 2025

@zainasir @inodb @onursumer @sheridancbio I've pushed a fix (commit 97a0773) that resolves the circular dependency issue causing the build failures. The fix:

  • Replaced SampleService injection with SampleMapper to eliminate circular dependency
  • Added null safety checks and error handling
  • Ensures compatibility with both MySQL and ClickHouse environments

Could you please approve the pending workflows so the new builds can run with the fixed code? Thanks!

@immortal71
Copy link
Author

immortal71 commented Jan 2, 2026

Friendly ping on this PR. All feedback from @onursumer has been implemented, circular dependency fixed in 97a0773, and tests pass locally. When you have time, could you please take another look ?

@dippindots dippindots requested a review from onursumer January 3, 2026 00:19
@onursumer
Copy link
Member

Hi @immortal71, we recently merged rc-7.0-clickhouse-only into master. Can you change the base branch back to master and rebase your PR? Thanks!

@immortal71 immortal71 changed the base branch from rc-7.0-clickhouse-only to master January 8, 2026 13:32
@immortal71
Copy link
Author

@onursumer done !!

@immortal71
Copy link
Author

@onursumer Can you review it ?

@onursumer
Copy link
Member

@immortal71 can you also rebase your branch on master (git rebase master) so that it includes latest changes from master?

@immortal71
Copy link
Author

@immortal71 can you also rebase your branch on master (git rebase master) so that it includes latest changes from master?

@onursumer done !!

@immortal71
Copy link
Author

immortal71 commented Jan 9, 2026

@onursumer
Can you review this !!

@onursumer
Copy link
Member

@immortal71 your branch is still 22 commits behind the master branch. Can you rebase it on the latest cBioPortal/cbioportal:master?

image

…gle query to reduce N+1 queries (ClickHouse perf)
…ry that aggregates per-sample rows into gene-profile values
…nto existing repository/mapper - Remove separate ClickHouse classes - Change conditional property to true
Per reviewer feedback from @onursumer:
--> Removed separate ClickHouse-specific repository and mapper classes
--> Moved optimization into existing MolecularDataMyBatisRepository
- Updated existing MolecularDataMapper.xml with conditional ClickHouse query
--> Changed @ConditionalOnProperty havingValue from 'test' to 'true'
--> Reuses existing GeneMolecularAlteration model instead of new legacy classes

The ClickHouse path queries genetic_alteration_derived table and aggregates
per-sample rows into CSV format in the repository layer.
…ization

- Replaced SampleService injection with SampleMapper to avoid circular dependency
- Added null safety check for optional SampleMapper dependency
- Added try-catch with fallback to standard method for database compatibility
- Added SLF4J logger for debugging and error tracking
- Ensures tests pass in both MySQL and ClickHouse environments

Fixes cBioPortal#11583
@immortal71 immortal71 force-pushed the issue-11583-clickhouse-performance branch from 97a0773 to d4c610f Compare January 10, 2026 04:40
@immortal71
Copy link
Author

@onursumer done!!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

molecular data endpoint is much slower against Clickhouse

2 participants