-
Notifications
You must be signed in to change notification settings - Fork 988
Description
RFC: Enhance AWS HealthOmics MCP Server with Data Store Management Capabilities
Summary
Propose enhancing the existing aws-healthomics-mcp-server
with comprehensive data store management capabilities to create a complete genomic research solution covering both workflow management and data operations.
Problem Statement
The current AWS HealthOmics MCP server provides excellent workflow management capabilities (create, execute, analyze WDL/CWL/Nextflow workflows) but lacks data access and management functionality. Genomic researchers need both capabilities for complete workflows:
- Data Management: Access sequence stores, variant stores, reference stores, annotation stores
- Data Import: Seamless local file → S3 → HealthOmics import workflows
- Data Analysis: Search variants, fetch sequences, get coverage profiles
- Access Control: IAM role management and validation
This creates a gap where researchers must use separate tools or write custom scripts for data operations while using the MCP server for workflow management.
Proposed Solution
Enhance the existing aws-healthomics-mcp-server
by adding complementary data store management tools, creating a comprehensive solution that covers:
Current Capabilities (Workflow Management)
- ✅ Create and version workflows
- ✅ Execute and monitor workflow runs
- ✅ Performance analysis and troubleshooting
- ✅ Workflow linting and validation
Proposed New Capabilities (Data Management)
- 🆕 Sequence Store Operations: List, search, and retrieve genomic sequences
- 🆕 Variant Store Operations: Search variants by gene, position, impact, frequency
- 🆕 Reference Store Operations: Manage and access reference genomes
- 🆕 Annotation Store Operations: Search annotations and submit VCF files
- 🆕 Data Import Workflows: Local file → S3 → HealthOmics orchestration
- 🆕 S3 Integration: Discover genomic files, manage bucket access
Implementation Approach
New Tool Modules
Following the existing server structure, add these new tool modules:
awslabs/aws_healthomics_mcp_server/tools/
├── sequence_store_tools.py # Sequence data operations
├── variant_store_tools.py # Variant search and analysis
├── reference_store_tools.py # Reference genome management
├── annotation_store_tools.py # Annotation operations
└── data_import_tools.py # S3 integration and import
Design Principles
- Non-breaking: Preserve all existing functionality
- Consistent: Follow existing patterns and conventions
- Comprehensive: Add full test coverage
- Documented: Update README and documentation
Tool Integration
The new tools will integrate seamlessly with existing workflow tools:
- Import data using new tools → Execute workflows using existing tools
- Analyze results using both data access and workflow analysis tools
Value Proposition
For Genomic Researchers
- Complete Solution: One MCP server for all HealthOmics operations
- Streamlined Workflows: No need for separate tools or custom scripts
- Natural Language Interface: AI assistants can orchestrate complex multi-step operations
For the AWS MCP Ecosystem
- Enhanced Adoption: More comprehensive HealthOmics coverage
- Community Value: Addresses real user needs in genomic research
- Reference Implementation: Demonstrates best practices for complex AWS service integration
Example Use Cases
Complete Research Workflow
User: "Import my local FASTQ files, run my variant calling workflow, and analyze the results"
AI Assistant orchestrates:
1. [NEW] Upload files to S3 using data_import_tools
2. [NEW] Import to HealthOmics using sequence_store_tools
3. [EXISTING] Execute workflow using workflow_execution
4. [NEW] Analyze variants using variant_store_tools
5. [EXISTING] Generate performance report using run_analysis
Multi-Store Data Discovery
User: "What genomic data do I have available across all my HealthOmics stores?"
AI Assistant provides:
- [NEW] Sequence store inventory
- [NEW] Variant store summaries
- [NEW] Reference genome catalog
- [NEW] Available annotations
Implementation Details
Backward Compatibility
- Zero breaking changes to existing functionality
- All existing tools and APIs remain unchanged
- New tools are additive enhancements
Code Quality Standards
- Follow existing code patterns and structure
- Comprehensive unit and integration tests
- Type hints and documentation
- Pass all pre-commit hooks
Documentation Updates
- Expand main README with new capabilities
- Add tool documentation following existing format
- Update configuration examples
- Create usage examples for common workflows
Open Questions
-
Tool Naming Convention: Should we follow the existing
AHO
prefix pattern (e.g.,ListAHOSequences
) or use a different convention for data tools? -
Configuration: Should data store IDs be configured via environment variables (like workflow execution) or discovered dynamically?
-
Error Handling: Should we extend the existing error handling patterns or create new ones for data operations?
Request for Feedback
This RFC proposes a significant enhancement that would make the AWS HealthOmics MCP server the definitive solution for genomic research workflows. The implementation:
- ✅ Fills a real gap in current functionality
- ✅ Follows AWS MCP patterns and guidelines
- ✅ Maintains backward compatibility
- ✅ Adds significant value for users
We would appreciate feedback on:
- Overall approach and feasibility
- Implementation strategy and structure
- Any concerns or alternative approaches
- Timeline and coordination preferences
References
- Existing AWS HealthOmics MCP Server
- AWS HealthOmics Documentation
- Model Context Protocol Specification
Proposed by: Peter (CarlsbergGBS)
Implementation Ready: Full working implementation available for review
Timeline: Ready to start immediately upon approval
Metadata
Metadata
Assignees
Labels
Type
Projects
Status