RFC: Enhance AWS HealthOmics MCP Server with Data Store Management Capabilities

# RFC: Enhance AWS HealthOmics MCP Server with Data Store Management Capabilities

## Summary

Propose enhancing the existing `aws-healthomics-mcp-server` with comprehensive data store management capabilities to create a complete genomic research solution covering both workflow management and data operations.

## Problem Statement

The current AWS HealthOmics MCP server provides excellent **workflow management** capabilities (create, execute, analyze WDL/CWL/Nextflow workflows) but lacks **data access and management** functionality. Genomic researchers need both capabilities for complete workflows:

1. **Data Management**: Access sequence stores, variant stores, reference stores, annotation stores
2. **Data Import**: Seamless local file → S3 → HealthOmics import workflows
3. **Data Analysis**: Search variants, fetch sequences, get coverage profiles
4. **Access Control**: IAM role management and validation

This creates a gap where researchers must use separate tools or write custom scripts for data operations while using the MCP server for workflow management.

## Proposed Solution

Enhance the existing `aws-healthomics-mcp-server` by adding complementary data store management tools, creating a comprehensive solution that covers:

### **Current Capabilities** (Workflow Management)
- ✅ Create and version workflows
- ✅ Execute and monitor workflow runs
- ✅ Performance analysis and troubleshooting
- ✅ Workflow linting and validation

### **Proposed New Capabilities** (Data Management)
- 🆕 **Sequence Store Operations**: List, search, and retrieve genomic sequences
- 🆕 **Variant Store Operations**: Search variants by gene, position, impact, frequency
- 🆕 **Reference Store Operations**: Manage and access reference genomes
- 🆕 **Annotation Store Operations**: Search annotations and submit VCF files
- 🆕 **Data Import Workflows**: Local file → S3 → HealthOmics orchestration
- 🆕 **S3 Integration**: Discover genomic files, manage bucket access

## Implementation Approach

### **New Tool Modules**
Following the existing server structure, add these new tool modules:

```
awslabs/aws_healthomics_mcp_server/tools/
├── sequence_store_tools.py     # Sequence data operations
├── variant_store_tools.py      # Variant search and analysis
├── reference_store_tools.py    # Reference genome management  
├── annotation_store_tools.py   # Annotation operations
└── data_import_tools.py        # S3 integration and import
```

### **Design Principles**
- **Non-breaking**: Preserve all existing functionality
- **Consistent**: Follow existing patterns and conventions
- **Comprehensive**: Add full test coverage
- **Documented**: Update README and documentation

### **Tool Integration**
The new tools will integrate seamlessly with existing workflow tools:
- Import data using new tools → Execute workflows using existing tools
- Analyze results using both data access and workflow analysis tools

## Value Proposition

### **For Genomic Researchers**
- **Complete Solution**: One MCP server for all HealthOmics operations
- **Streamlined Workflows**: No need for separate tools or custom scripts
- **Natural Language Interface**: AI assistants can orchestrate complex multi-step operations

### **For the AWS MCP Ecosystem**
- **Enhanced Adoption**: More comprehensive HealthOmics coverage
- **Community Value**: Addresses real user needs in genomic research
- **Reference Implementation**: Demonstrates best practices for complex AWS service integration

## Example Use Cases

### **Complete Research Workflow**
```
User: "Import my local FASTQ files, run my variant calling workflow, and analyze the results"

AI Assistant orchestrates:
1. [NEW] Upload files to S3 using data_import_tools
2. [NEW] Import to HealthOmics using sequence_store_tools  
3. [EXISTING] Execute workflow using workflow_execution
4. [NEW] Analyze variants using variant_store_tools
5. [EXISTING] Generate performance report using run_analysis
```

### **Multi-Store Data Discovery**
```
User: "What genomic data do I have available across all my HealthOmics stores?"

AI Assistant provides:
- [NEW] Sequence store inventory
- [NEW] Variant store summaries
- [NEW] Reference genome catalog
- [NEW] Available annotations
```

## Implementation Details

### **Backward Compatibility**
- Zero breaking changes to existing functionality
- All existing tools and APIs remain unchanged
- New tools are additive enhancements

### **Code Quality Standards**
- Follow existing code patterns and structure
- Comprehensive unit and integration tests
- Type hints and documentation
- Pass all pre-commit hooks

### **Documentation Updates**
- Expand main README with new capabilities
- Add tool documentation following existing format
- Update configuration examples
- Create usage examples for common workflows

## Open Questions

1. **Tool Naming Convention**: Should we follow the existing `AHO` prefix pattern (e.g., `ListAHOSequences`) or use a different convention for data tools?

2. **Configuration**: Should data store IDs be configured via environment variables (like workflow execution) or discovered dynamically?

3. **Error Handling**: Should we extend the existing error handling patterns or create new ones for data operations?

## Request for Feedback

This RFC proposes a significant enhancement that would make the AWS HealthOmics MCP server the definitive solution for genomic research workflows. The implementation:

- ✅ Fills a real gap in current functionality
- ✅ Follows AWS MCP patterns and guidelines  
- ✅ Maintains backward compatibility
- ✅ Adds significant value for users

We would appreciate feedback on:
- Overall approach and feasibility
- Implementation strategy and structure
- Any concerns or alternative approaches
- Timeline and coordination preferences

## References

- [Existing AWS HealthOmics MCP Server](https://github.com/awslabs/mcp/tree/main/src/aws-healthomics-mcp-server)
- [AWS HealthOmics Documentation](https://docs.aws.amazon.com/omics/)
- [Model Context Protocol Specification](https://modelcontextprotocol.io/)

---

**Proposed by**: Peter (CarlsbergGBS)  
**Implementation Ready**: Full working implementation available for review  
**Timeline**: Ready to start immediately upon approval

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

RFC: Enhance AWS HealthOmics MCP Server with Data Store Management Capabilities #1421

RFC: Enhance AWS HealthOmics MCP Server with Data Store Management Capabilities

Summary

Problem Statement

Proposed Solution

Current Capabilities (Workflow Management)

Proposed New Capabilities (Data Management)

Implementation Approach

New Tool Modules

Design Principles

Tool Integration

Value Proposition

For Genomic Researchers

For the AWS MCP Ecosystem

Example Use Cases

Complete Research Workflow

Multi-Store Data Discovery

Implementation Details

Backward Compatibility

Code Quality Standards

Documentation Updates

Open Questions

Request for Feedback

References

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

RFC: Enhance AWS HealthOmics MCP Server with Data Store Management Capabilities #1421

Description

RFC: Enhance AWS HealthOmics MCP Server with Data Store Management Capabilities

Summary

Problem Statement

Proposed Solution

Current Capabilities (Workflow Management)

Proposed New Capabilities (Data Management)

Implementation Approach

New Tool Modules

Design Principles

Tool Integration

Value Proposition

For Genomic Researchers

For the AWS MCP Ecosystem

Example Use Cases

Complete Research Workflow

Multi-Store Data Discovery

Implementation Details

Backward Compatibility

Code Quality Standards

Documentation Updates

Open Questions

Request for Feedback

References

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions