Skip to content

scomri/marvel-gene-knowledge-graph-qa

Repository files navigation

Project Gene-Forge

A knowledge graph and AI-powered question-answering system for Marvel character genetic mutations and powers. This system combines a Neo4j knowledge graph with Google Gemini LLM integration to provide fact-grounded answers about characters, genes, powers, and team affiliations.

Overview

Project Gene-Forge is a S.H.I.E.L.D. intelligence system that:

  • Stores Marvel character data in a Neo4j knowledge graph
  • Uses deterministic query routing to extract facts from the graph
  • Generates natural language responses using Google Gemini LLM
  • Provides both a web interface and REST API for querying

Key Components

  • Neo4j Knowledge Graph: Stores characters, genes, powers, and teams with their relationships
  • Graph Query Engine: Routes natural language questions to Cypher queries
  • LLM Integration: Google Gemini API for generating fact-grounded responses
  • FastAPI Web Application: REST API and web UI for interactive queries
  • Response Caching: In-memory cache to reduce API calls and improve performance

Project Structure

├── app/
│   ├── api.py              # FastAPI application with REST endpoints
│   └── index.html          # Web UI for interactive queries
├── data/
│   ├── characters.json     # Character data (12 Marvel characters)
│   ├── gene_power_relationships.json  # Gene-power mappings
│   └── DATA_README.md      # Dataset documentation
├── graph/
│   ├── setup_graph.py      # Script to build Neo4j knowledge graph
│   └── GRAPH_SCHEMA.md     # Graph schema documentation
├── queries/
│   ├── graph_qa.py         # Graph query engine (entity resolution, intent classification)
│   └── graph_query_selection_layer.md  # Query routing documentation
├── llm_queries/
│   ├── llm_graph_qa.py     # Integrated LLM + Graph QA service
│   ├── llm_integration.py  # Gemini API integration
│   └── cache.py            # Response caching implementation
├── example_usage.py        # Example script demonstrating usage
└── requirements.txt        # Python dependencies

Prerequisites

Before setting up the project, ensure you have:

  1. Python 3.8+ installed

    • Check with: python --version or python3 --version
  2. Neo4j Database (Community Edition or Desktop)

    • Download from: https://neo4j.com/download/
    • Neo4j Desktop is recommended for local development
    • Ensure Neo4j is running before proceeding
  3. Google Gemini API Key

Installation Steps

1. Install Python Dependencies

# Create a virtual environment (recommended)
python -m venv venv

# Activate virtual environment
# On Windows:
venv\Scripts\activate
# On macOS/Linux:
source venv/bin/activate

# Install dependencies
pip install -r requirements.txt

2. Install and Start Neo4j

Neo4j Desktop

  1. Download Neo4j Desktop
  2. Install and launch Neo4j Desktop
  3. Create a new database (or use default)
  4. Start the database
  5. Note the connection details (URI, username, password)

3. Set Up Environment Variables

Create a .env file in the project root directory:

# Neo4j Configuration
NEO4J_URI=bolt://localhost:7687
NEO4J_USERNAME=neo4j
NEO4J_PASSWORD=your_neo4j_password_here
NEO4J_DATABASE=gene-forge

# Gemini API Configuration
GEMINI_API_KEY=your_gemini_api_key_here
GEMINI_MODEL=gemini-2.5-flash-lite

# Optional Configuration
LLM_TEMPERATURE=0.0
CACHE_TTL=86400
PORT=8000
HOST=127.0.0.1

Important: Replace your_neo4j_password_here with your actual Neo4j password and your_gemini_api_key_here with your Gemini API key.

Note: The project uses python-dotenv to load environment variables.

Configuration

Required Environment Variables

Variable Description Default
NEO4J_URI Neo4j connection URI bolt://localhost:7687
NEO4J_USERNAME Neo4j username neo4j
NEO4J_PASSWORD Neo4j password Required
NEO4J_DATABASE Database name gene-forge
GEMINI_API_KEY Google Gemini API key Required
GEMINI_MODEL Gemini model name gemini-2.5-flash-lite

Optional Environment Variables

Variable Description Default
LLM_TEMPERATURE LLM temperature (0.0-1.0) 0.0
CACHE_TTL Cache TTL in seconds 86400 (24 hours)
PORT FastAPI server port 8000
HOST FastAPI server host 127.0.0.1

Setting Up the Knowledge Graph

Once Neo4j is running and environment variables are set, populate the knowledge graph:

Step 1: Verify Data Files

Ensure the data files exist:

  • data/characters.json - Contains 12 Marvel characters
  • data/gene_power_relationships.json - Contains gene-power mappings

Step 2: Run the Graph Setup Script

python graph/setup_graph.py

This script will:

  1. Connect to Neo4j
  2. Create constraints and indexes
  3. Load character data
  4. Create nodes (Characters, Genes, Powers, Teams)
  5. Create relationships (MEMBER_OF, HAS_MUTATION, CONFERS, POSSESSES_POWER)
  6. Display statistics

Expected Output:

INFO - Connected to Neo4j at bolt://localhost:7687 (database: gene-forge)
INFO - Loaded 12 characters from data/characters.json
INFO - Loaded 31 gene relationships from data/gene_power_relationships.json
INFO - Constraints and indexes created
INFO - Created 4 team nodes
INFO - Created 25 power nodes
INFO - Created 31 gene nodes
INFO - Created 12 character nodes
INFO - Created 12 MEMBER_OF relationships
INFO - Created 36 HAS_MUTATION relationships
INFO - Created 45 CONFERS relationships
INFO - Created 60 POSSESSES_POWER relationships
INFO - Graph construction completed!

Step 3: Verify the Graph

You can verify the graph was created successfully by:

  1. Opening Neo4j Browser (usually at http://localhost:7474)
  2. Running a test query:
    MATCH (c:Character)-[:HAS_MUTATION]->(g:Gene)
    RETURN c.name, g.gene_name
    LIMIT 5

Running the Application

1) FastAPI Web Application (Recommended)

Start the web server:

python app/api.py

Or using uvicorn directly:

uvicorn app.api:app --host 127.0.0.1 --port 8000

Access the Application:

API Endpoints:

  • GET / - Web UI for interactive queries
  • POST /question - Answer questions using knowledge graph and LLM
  • GET /graph/{character} - Get character's graph neighbors

2) Python Script (CLI Usage)

Run the example script to see the system in action:

python example_usage.py

This demonstrates:

  1. Basic graph querying (GraphQAEngine)
  2. Direct LLM integration
  3. Integrated service with caching

Additional Resources

Project Documentation

  • Dataset Documentation: See data/DATA_README.md for data format and structure
  • Graph Schema: See graph/GRAPH_SCHEMA.md for detailed schema documentation
  • Query Selection Layer: See queries/graph_query_selection_layer.md for query routing details

If you encounter issues not covered here:

  1. Check the logs for error messages
  2. Verify all prerequisites are installed
  3. Ensure environment variables are set correctly
  4. Review the example script (example_usage.py) for usage patterns

About

Knowledge graph–grounded question answering using an LLM over a Neo4j knowledge graph modeling Marvel characters, genetic mutations, powers, and team affiliations.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors