Skip to content

RFC: Enhance AWS HealthOmics MCP Server with Data Store Management Capabilities #1421

@peterbb148

Description

@peterbb148

RFC: Enhance AWS HealthOmics MCP Server with Data Store Management Capabilities

Summary

Propose enhancing the existing aws-healthomics-mcp-server with comprehensive data store management capabilities to create a complete genomic research solution covering both workflow management and data operations.

Problem Statement

The current AWS HealthOmics MCP server provides excellent workflow management capabilities (create, execute, analyze WDL/CWL/Nextflow workflows) but lacks data access and management functionality. Genomic researchers need both capabilities for complete workflows:

  1. Data Management: Access sequence stores, variant stores, reference stores, annotation stores
  2. Data Import: Seamless local file → S3 → HealthOmics import workflows
  3. Data Analysis: Search variants, fetch sequences, get coverage profiles
  4. Access Control: IAM role management and validation

This creates a gap where researchers must use separate tools or write custom scripts for data operations while using the MCP server for workflow management.

Proposed Solution

Enhance the existing aws-healthomics-mcp-server by adding complementary data store management tools, creating a comprehensive solution that covers:

Current Capabilities (Workflow Management)

  • ✅ Create and version workflows
  • ✅ Execute and monitor workflow runs
  • ✅ Performance analysis and troubleshooting
  • ✅ Workflow linting and validation

Proposed New Capabilities (Data Management)

  • 🆕 Sequence Store Operations: List, search, and retrieve genomic sequences
  • 🆕 Variant Store Operations: Search variants by gene, position, impact, frequency
  • 🆕 Reference Store Operations: Manage and access reference genomes
  • 🆕 Annotation Store Operations: Search annotations and submit VCF files
  • 🆕 Data Import Workflows: Local file → S3 → HealthOmics orchestration
  • 🆕 S3 Integration: Discover genomic files, manage bucket access

Implementation Approach

New Tool Modules

Following the existing server structure, add these new tool modules:

awslabs/aws_healthomics_mcp_server/tools/
├── sequence_store_tools.py     # Sequence data operations
├── variant_store_tools.py      # Variant search and analysis
├── reference_store_tools.py    # Reference genome management  
├── annotation_store_tools.py   # Annotation operations
└── data_import_tools.py        # S3 integration and import

Design Principles

  • Non-breaking: Preserve all existing functionality
  • Consistent: Follow existing patterns and conventions
  • Comprehensive: Add full test coverage
  • Documented: Update README and documentation

Tool Integration

The new tools will integrate seamlessly with existing workflow tools:

  • Import data using new tools → Execute workflows using existing tools
  • Analyze results using both data access and workflow analysis tools

Value Proposition

For Genomic Researchers

  • Complete Solution: One MCP server for all HealthOmics operations
  • Streamlined Workflows: No need for separate tools or custom scripts
  • Natural Language Interface: AI assistants can orchestrate complex multi-step operations

For the AWS MCP Ecosystem

  • Enhanced Adoption: More comprehensive HealthOmics coverage
  • Community Value: Addresses real user needs in genomic research
  • Reference Implementation: Demonstrates best practices for complex AWS service integration

Example Use Cases

Complete Research Workflow

User: "Import my local FASTQ files, run my variant calling workflow, and analyze the results"

AI Assistant orchestrates:
1. [NEW] Upload files to S3 using data_import_tools
2. [NEW] Import to HealthOmics using sequence_store_tools  
3. [EXISTING] Execute workflow using workflow_execution
4. [NEW] Analyze variants using variant_store_tools
5. [EXISTING] Generate performance report using run_analysis

Multi-Store Data Discovery

User: "What genomic data do I have available across all my HealthOmics stores?"

AI Assistant provides:
- [NEW] Sequence store inventory
- [NEW] Variant store summaries
- [NEW] Reference genome catalog
- [NEW] Available annotations

Implementation Details

Backward Compatibility

  • Zero breaking changes to existing functionality
  • All existing tools and APIs remain unchanged
  • New tools are additive enhancements

Code Quality Standards

  • Follow existing code patterns and structure
  • Comprehensive unit and integration tests
  • Type hints and documentation
  • Pass all pre-commit hooks

Documentation Updates

  • Expand main README with new capabilities
  • Add tool documentation following existing format
  • Update configuration examples
  • Create usage examples for common workflows

Open Questions

  1. Tool Naming Convention: Should we follow the existing AHO prefix pattern (e.g., ListAHOSequences) or use a different convention for data tools?

  2. Configuration: Should data store IDs be configured via environment variables (like workflow execution) or discovered dynamically?

  3. Error Handling: Should we extend the existing error handling patterns or create new ones for data operations?

Request for Feedback

This RFC proposes a significant enhancement that would make the AWS HealthOmics MCP server the definitive solution for genomic research workflows. The implementation:

  • ✅ Fills a real gap in current functionality
  • ✅ Follows AWS MCP patterns and guidelines
  • ✅ Maintains backward compatibility
  • ✅ Adds significant value for users

We would appreciate feedback on:

  • Overall approach and feasibility
  • Implementation strategy and structure
  • Any concerns or alternative approaches
  • Timeline and coordination preferences

References


Proposed by: Peter (CarlsbergGBS)
Implementation Ready: Full working implementation available for review
Timeline: Ready to start immediately upon approval

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    Status

    To triage

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions