Introducing Supa-Crawl-Chat: A Comprehensive Web Crawling, Semantic Search, and AI-Driven Chat Solution with Supabase & Crawl4AI.
Seamlessly crawl websites, transform content into vector embeddings, and enable advanced semantic search. Supa-Crawl-Chat utilizes Supabase for reliable data storage and incorporates AI-powered chat with long-term memory features.
-
🕷️ High-Performance Web Crawling
- Harness the power of Crawl4AI to efficiently index websites and sitemaps with configurable depth and scope
- Advanced crawling algorithms adapt to different website structures and content types for optimal data extraction
- Seamless handling of JavaScript-rendered content and dynamic websites
-
🔍 Advanced Semantic Search Engine
- Leverage cutting-edge vector similarity and OpenAI embeddings for context-aware search capabilities
- Achieve up to 95% more relevant search results compared to traditional keyword-based approaches
- Fine-tuned ranking algorithms that understand semantic relationships between concepts
-
📝 AI-Powered Content Intelligence
- Transform raw web content into structured, actionable data using terminal or UI.
- Generate human-quality titles, summaries, and site descriptions with remarkable accuracy
- Automatic content categorization and entity extraction for enhanced data organization
-
📊 Interactive Data Visualization
- Explore your data ecosystem through an intuitive Streamlit-based interface
- Real-time analytics and insights into your content repository
- Customizable dashboards for monitoring crawl performance and content metrics
-
🐳 Scalable Deployment Architecture
- Deploy with confidence using our Docker configurations:
- Lightweight: App-only deployment for integration with existing infrastructure
- Standard: App + Crawl4AI for complete content processing capabilities
- Full-Stack: End-to-end solution with App + Crawl4AI + Supabase for maximum autonomy
- Deploy with confidence using our Docker configurations:
-
🌐 Comprehensive API Ecosystem
- RESTful API with comprehensive documentation for seamless integration
- Webhook support for event-driven architectures
- Comprehensive access management and multi-factor authentication for enhanced security
- Python 3.10+
- Node 18+
- A running Crawl4AI instance (self-hosted or provided)
- A Supabase instance (self-hosted or provided)
- OpenAI API key for generating embeddings, content summaries and chat
- Docker (optional)
-
Clone this repository:
git clone https://github.com/bigsk1/supa-crawl-chat.git cd supa-crawl-chat
-
Install the required dependencies:
pip install -r requirements.txt
-
Changed directory to frontend and install dependencies:
cd frontend
npm install
-
Create a
.env
file with your configuration:
Click to expand
# Crawl4AI Configuration
# Locally ran in docker or external service - easily setup with docker compose
CRAWL4AI_API_TOKEN=your_crawl4ai_api_token
# Local Docker
# CRAWL4AI_BASE_URL=http://crawl4ai:11235
# External Service
CRAWL4AI_BASE_URL=your_crawl4ai_base_url
# Supabase Configuration
SUPABASE_URL=your_supabase_host:port
# Database credentials
SUPABASE_DB=postgres
SUPABASE_KEY=postgres
SUPABASE_PASSWORD=postgres
# OpenAI Configuration
OPENAI_API_KEY=sk-proj-
# Model to use for embeddings
OPENAI_EMBEDDING_MODEL=text-embedding-3-small
# Model to use for title and summary generation and chat analysis
OPENAI_CONTENT_MODEL=gpt-4o-mini
# Crawl Configuration
# Set to 'url' for regular website or 'sitemap' for sitemap crawling, will crawl child pages from the sitemap
CRAWL_TYPE=url
# URL to crawl (can be a website URL or sitemap URL)
CRAWL_URL=https://example.com
# Maximum number of URLs to crawl from a sitemap (set to 0 for unlimited)
MAX_URLS=30
# Optional name for the site (if not provided, one will be generated)
CRAWL_SITE_NAME=
# Optional description for the site (if not provided, one will be generated)
CRAWL_SITE_DESCRIPTION=
# Chat Configuration
# Model to use for the chat interface
CHAT_MODEL=gpt-4o
# Number of results to retrieve for each query
CHAT_RESULT_LIMIT=5
# Similarity threshold for vector search (0-1)
CHAT_SIMILARITY_THRESHOLD=0.4
# Default session ID (if not provided, a new one will be generated) you can use a random string
CHAT_SESSION_ID=
# Default user ID (optional, name, user, i.e. larry)
CHAT_USER_ID=
# Default chat profile (default, pydantic, technical, concise, scifi, pirate, supabase_expert, medieval, etc.)
CHAT_PROFILE=default
# Directory containing profile YAML files
CHAT_PROFILES_DIR=profiles
# Verbose mode (true, false) - enable to see more during chat
CHAT_VERBOSE=false
To run the backend API and the frontend UI, follow these steps:
-
Start the Backend API: Open a terminal and navigate to the root directory of the project. Then run:
python run_api.py
-
Start the Frontend UI: Open a separate terminal, navigate to the
frontend
directory, and run:npm run dev
-
Access the Web UI: Open your web browser and go to:
http://localhost:3000/
This will start the backend API on port 8001 and the frontend UI on port 3000.
If you need a complete solution - crawl4ai with or without a local Supabase all in Docker see Docker Deployment section of the README
Click to close or open images
- As you chat the AI will add preferences based on your conversation and remeber them or you can add manually
The project supports two ways to connect to your Supabase database:
- Single URL (Option 1): Use this for both local and remote connections. The URL can be specified with or without protocol.
# With protocol (for remote instances) SUPABASE_URL=https://your-project.supabase.co:5432 # Without protocol (for local instances) SUPABASE_URL=192.168.xx.xx:54322
You'll need to provide the database credentials:
SUPABASE_DB=postgres
SUPABASE_KEY=postgres
SUPABASE_PASSWORD=postgres
Click to expand
The system automatically breaks down large content into smaller, more manageable chunks for better LLM interaction and more precise search results. This provides several benefits:-
Improved Search Precision: Instead of matching against entire pages, the system can find the specific chunk that best answers a query.
-
Efficient Token Usage: When interacting with LLMs, only the relevant chunks are sent, reducing token usage and costs.
-
Better Context Management: Each chunk maintains a reference to its parent page, preserving the full context.
-
Automatic Token Limit Handling: Content is automatically chunked to stay within the token limits of the embedding model (8,192 tokens for text-embedding-3-small).
Click to expand chunking details
The system uses a sophisticated semantic chunking strategy:
-
Semantic Boundary Detection: Content is first split along natural semantic boundaries:
- Markdown headers (e.g.,
# Section Title
) - Paragraph breaks
- This preserves the meaning and context of each chunk
- Markdown headers (e.g.,
-
Token-Based Sizing: Each section is then analyzed to ensure it fits within token limits:
- Sections that fit are kept together
- Sections that exceed limits are further split with token-based chunking
- A 200-token overlap is maintained between chunks for context continuity
-
Smart Overlap: When creating overlaps between chunks, the system looks for natural break points:
- Paragraph breaks
- Sentence endings
- Clause breaks
- Word boundaries
-
Metadata Preservation: Each chunk maintains references to:
- Its parent document
- Its position in the sequence (chunk index)
- Its token count
This approach ensures that chunks are not only sized appropriately for LLMs but also maintain semantic coherence, making them more useful for search and retrieval.
Click to expand chunking configuration
You can adjust the chunking parameters in the code:
# In crawler.py, enhance_pages method
enhanced_pages = asyncio.run(self.enhance_pages(pages, max_tokens_per_chunk=4000))
The default settings are:
max_tokens_per_chunk
: 4,000 tokens (half of the 8,192 token limit for safety)overlap_tokens
: 200 tokens (overlap between chunks to maintain context)
Before using the crawler, you can test your setup:
-
Test the database connection:
python tests/test_db_connection.py
-
Test the Crawl4AI API:
python tests/test_crawl_api.py
Before using the crawler, you need to set up the database:
python main.py setup
This will create the necessary tables and extensions in your Supabase database.
Click to expand website crawling options
You can crawl a website in two ways:
-
Using the command-line interface:
python main.py crawl https://example.com --name "Example Site" --description "An example website"
To crawl a sitemap:
python main.py crawl https://example.com/sitemap.xml --sitemap --name "Example Site"
You can limit the number of URLs to crawl from the sitemap:
python main.py crawl https://example.com/sitemap.xml --sitemap --max-urls 20
Note: If you don't provide a description, the system will automatically generate one based on the content of the homepage or main page.
-
Using the
.env
file configuration: ( recommended )First, update the
.env
file with your crawl settings:CRAWL_TYPE=url # or 'sitemap' for sitemap crawling CRAWL_URL=https://example.com CRAWL_SITE_NAME=Example Site CRAWL_SITE_DESCRIPTION=An example website # Optional - will be auto-generated if empty
Then run:
python run_crawl.py
The crawler automatically generates titles and summaries for crawled content using OpenAI. You can configure the model used for this in the .env
file:
OPENAI_CONTENT_MODEL=gpt-4o-mini
Click to expand content updating options
If you have existing pages without titles or summaries, or if you want to regenerate them with a different model, you can use the update_content.py
script:
# Update all sites
python update_content.py
# Update a specific site
python update_content.py --site-id 1
# Limit the number of pages to update
python update_content.py --limit 50
# Force update all pages, even if they already have titles and summaries
python update_content.py --force
Click to expand search options
To search the crawled content using semantic search:
python main.py search "your search query"
To use text-based search instead of semantic search:
python main.py search "your search query" --text-only
To adjust the similarity threshold and limit the number of results:
python main.py search "your search query" --threshold 0.8 --limit 2
To save the search results to a file:
python main.py search "your search query" --output results.json
To list all the sites that have been crawled:
python main.py list-sites
By default, this only counts parent pages (not chunks). To include chunks in the page count:
python main.py list-sites --include-chunks
Click to expand details on working with chunks
When retrieving or searching content, you can control whether chunks are included:
# Get pages for a site (parent pages only)
pages = crawler.get_site_pages(site_id, limit=100)
# Get pages for a site including chunks
pages_with_chunks = crawler.get_site_pages(site_id, limit=100, include_chunks=True)
When searching, chunks are automatically included and prioritized for more precise results. Each chunk includes context about its parent document:
python main.py search "your search query"
The search results will include:
- The content snippet that matched your query
- Which document it came from
- Which part of the document it represents (e.g., "Part 2 of 5")
This makes it easier to understand the context of each search result, even when it's a small chunk of a larger document.
The project includes a chat interface in the terminal that uses an LLM to answer questions based on the crawled data. The chat interface now supports persistent conversation history, allowing the LLM to remember previous interactions even after restarting the application.
You can start the terminal chat interface using either the dedicated script or the main CLI:
# Using the dedicated script
python chat.py
# Using the main CLI
python main.py chat
Click to expand chat interface options
You can customize the chat interface with various options:
# Specify a different OpenAI model
python main.py chat --model gpt-4
# Set the maximum number of search results to retrieve when chatting
python main.py chat --limit 10
# Adjust the similarity threshold for vector search (0-1)
python main.py chat --threshold 0.6
# Use a specific session ID for persistent conversations
python main.py chat --session my-chat-session
# Associate the conversation with a specific user
python main.py chat --user John
# Enable verbose debug output
python main.py chat --verbose
# Combined
python main.py chat --model gpt-4 --limit 15 --threshold 0.3 --session 12123111111 --user John --verbose
Click to expand search functionality details
The chat interface uses a sophisticated hybrid search approach that combines vector similarity with text matching:
- Vector Search: Uses OpenAI's embeddings to find semantically similar content
- Text Search: Enhances results with keyword matching for better precision
- Hybrid Approach: Combines both methods to provide the most relevant results
This approach ensures that even when vector similarity might not find exact matches, the text search component can still retrieve relevant information. The system automatically adjusts the search strategy based on the query type and available content.
Click to expand conversation history details
The chat interface stores all conversation history in the database, allowing the LLM to remember previous interactions. This enables more natural and contextual conversations over time.
Key features:
- Session-based conversations: Each conversation gets a unique session ID
- User identification: Optionally associate conversations with specific users
- Conversation continuity: Continue conversations where you left off, even after restarting
- Chat commands:
- Type
clear
to clear the conversation history - Type
history
to view the conversation history - Type
exit
orbye
orexit
to quit the chat interface
- Type
Important: To maintain the same conversation across multiple chat sessions, you must use the same session ID. The session ID is displayed when you start the chat interface. You can specify it before starting a new chat session:
# Start a new chat session
python chat.py --user Joe
# Note the session ID displayed (e.g., "Session ID: a24b6b72-e526-4a09-b662-0f85e82f78a7")
# Later, continue the same conversation by specifying the session ID
python chat.py --user Joe --session a24b6b72-e526-4a09-b662-0f85e82f78a7
You can also set a default session ID in your .env
file:
CHAT_SESSION_ID=your-session-id
This way, the chat interface will always use the same session ID unless you explicitly specify a different one with the --session
parameter.
Click to expand user preferences and memory details
The chat interface can remember user preferences and information shared during conversations, as long as you use the same session ID. For example:
- If you tell the assistant "I like Corvettes" in one session
- Then in a later session (using the same session ID), ask "What cars do I like?"
- The assistant will remember and respond with "You like Corvettes"
This memory persistence works by:
- Storing all messages in the database with the session ID
- Analyzing conversation history when relevant questions are asked
- Extracting user preferences and information from previous messages
To get the most out of this feature, always use the same session ID and user ID when you want the assistant to remember previous conversations.
The chat interface includes several commands for managing user preferences directly from the command line:
Viewing Preferences
preferences
Displays a table of all active preferences for the current user, including ID, type, value, confidence, context, and last used timestamp.
Adding Preferences
add preference <type> <value> [confidence]
Manually adds a new preference for the current user. If confidence is not specified, it defaults to 0.9.
Examples:
add preference like Python
add preference expertise JavaScript 0.85
add preference goal "Learn machine learning"
Deleting Preferences
delete preference <id>
Deletes a specific preference by ID.
Clearing All Preferences
clear preferences
Deletes all preferences for the current user after confirmation.
Important: Preference commands are only available when a user ID is provided (using --user
when starting the chat). For more detailed information about the user preference system, see the preferences documentation.
Click to expand chat profiles details
The chat interface supports different profiles that customize the behavior of the assistant. Each profile has its own system prompt, search settings, and site filtering capabilities. So ideally crawl the sitemap for a doc site and then use or create a profile with an additional system prompt to be an expert about those docs.
Built-in profiles:
- default: General-purpose assistant that searches all sites
- pydantic: Specialized for Pydantic documentation, focusing on technical details and code examples
- technical: Provides detailed technical explanations with step-by-step instructions
- concise: Gives brief, to-the-point answers without unnecessary details
You can switch profiles during a chat session:
profile pydantic
Or start with a specific profile:
python main.py chat --profile technical
You can also view all available profiles:
profiles
Click to expand site filtering details
The sites
array in each profile's search_settings
controls which sites the assistant searches through when answering questions:
search_settings:
sites: ["pydantic"] # Only search in sites with "pydantic" in the name
threshold: 0.6
limit: 8
Here's how the filtering works:
- Empty array (
sites: []
): Searches across ALL sites in the database - Site patterns: Filters to only include sites where the site name contains any of the specified patterns
- Pattern matching: Uses case-insensitive partial matching, so
"bigsk1"
would match site names like "Bigsk1 Com", "bigsk1.com", etc. - Multiple patterns: You can include multiple patterns to search across several related sites
The filtering process:
- When a user asks a question, the system looks at the current profile's
sites
setting - It queries the
crawl_sites
table to find site IDs where the name contains any of the patterns - It then only searches for content in pages associated with those site IDs
- This allows profiles to focus on specific content sources, making responses more relevant
You can switch profiles during a chat session:
profile pydantic
Or start with a specific profile:
python main.py chat --profile technical
You can also view all available profiles:
profiles
Click to expand custom profiles details
You can create your own custom profiles by adding YAML files to the profiles
directory. Each profile file should include:
name
: The name of the profile (used to select it)description
: A brief description of the profilesystem_prompt
: The system prompt that defines the assistant's behaviorsearch_settings
: Configuration for search behaviorsites
: List of site name patterns to filter by (empty list means search all sites)threshold
: Similarity threshold for vector search (0-1)limit
: Maximum number of results to return
Example profile file (profiles/custom_expert.yaml
):
name: custom_expert
description: Custom expert for specific documentation
system_prompt: |
You are an expert on [specific topic].
Your expertise includes:
- [Area of expertise 1]
- [Area of expertise 2]
- [Area of expertise 3]
When answering questions:
- [Instruction 1]
- [Instruction 2]
- [Instruction 3]
search_settings:
sites: ["site1", "site2"] # Only search in sites containing these terms
threshold: 0.6 # Higher threshold for more precise matches
limit: 8 # Number of results to return
You can specify a custom profiles directory:
python main.py chat --profiles-dir my_profiles
Click to expand .env configuration details
You can set default values for the chat interface in your .env
file:
# Chat Configuration
CHAT_MODEL=gpt-4o
CHAT_RESULT_LIMIT=5
CHAT_SIMILARITY_THRESHOLD=0.5
CHAT_SESSION_ID=default-session
CHAT_USER_ID=default-user
CHAT_PROFILE=default
CHAT_PROFILES_DIR=profiles
CHAT_VERBOSE=false
This allows you to maintain consistent settings and continue the same conversation across multiple sessions.
Click to expand Resetting the database
If you want to start fresh and delete all data or recreate the tables, you can use the `reset_database.py` script:python tests/reset_database.py
This script provides two options:
- Delete all data (keep tables) - This will delete all data from the tables but keep the table structure.
- Drop and recreate tables - This will drop the tables and recreate them, effectively starting from scratch.
You can also use the crawler programmatically in your own Python code. See tests/example.py
for a demonstration.
Click to expand Project Structure
main.py
: Main script with command-line interfacecrawler.py
: Main crawler class that ties everything togethercrawl_client.py
: Client for interacting with the Crawl4AI APIembeddings.py
: Module for generating OpenAI embeddingscontent_enhancer.py
: Module for generating titles and summaries using OpenAIdb_client.py
: Client for interacting with the Supabase databasedb_setup.py
: Script for setting up the databasechat.py
: Chat interface for interacting with crawled data using an LLMrun_api.py
: Script to run the APIrun_crawl.py
: Script to run a crawl using the configuration from the.env
fileupdate_content.py
: Script to update existing pages with titles and summariesutils.py
: Utility functions for the CLIrequirements.txt
: List of dependencies for the backend.env.example
: Example environment file for the backendapi/
: Directory containing the FastAPI implementationmain.py
: FastAPI application entry pointrouters/
: Directory containing API route definitionscrawl.py
: Endpoints for crawling websites and sitemapssearch.py
: Endpoints for searching crawled contentsites.py
: Endpoints for managing and retrieving site informationchat.py
: Endpoints for interacting with the chat interfacepages.py
: Endpoints for managing and retrieving page information
README.md
: Comprehensive API documentation
docker/
: Directory containing Docker-related filesDockerfile
: Docker image definition for the backend applicationfrontend.Dockerfile
: Docker image definition for the frontend applicationdocker-compose.yml
: Docker Compose configuration for the API service onlycrawl4ai-docker-compose.yml
: Docker Compose configuration for integrated API and Crawl4AI servicesfull-stack-compose.yml
: Docker Compose configuration for the complete stack (API, Crawl4AI, Supabase, Frontend)setup.sh
: Script to set up the full stack environmentreset.sh
: Script to reset the full stack environmentstatus.sh
: Script to check the status of the full stack environment.env
: Environment variables for Docker deployment.env.example
: Example environment file for Docker deploymentfull-stack/
: Documentation and utilities for the full stack setupREADME.md
: Documentation for the full stack setupENV_GUIDE.md
: Guide for configuring environment variablescheck_db_connections.sh
: Script to verify database connections
volumes/
: Directory for Docker volumes.dockerignore
: Specifies files to exclude from Docker builds
supabase_explorer/
: Directory containing the Supabase Explorer Streamlit appsupabase_explorer.py
: Interactive Streamlit app for database explorationsupabase_queries.md
: Collection of useful SQL queriesdatabase_explorer_readme.md
: Documentation for the Supabase Explorer
profiles/
: Directory containing chat profile configurations- Various YAML files defining different chat personalities and behaviors
tests/
: Directory containing test scriptsexample.py
: Example script demonstrating programmatic usagetest_db_connection.py
: Script to test the database connectiontest_crawl_api.py
: Script to test the Crawl4AI APIreset_database.py
: Script to delete tables or reset the database
frontend/
: Directory containing the React-based web UIsrc/
: Source code for the frontend applicationapi/
: API client for communicating with the backendapiService.ts
: Service for making API requestsapiWrapper.ts
: Wrapper for API endpoints with type definitions
components/
: Reusable UI componentsLayout.tsx
: Main layout component with Sidebar and NavbarNavbar.tsx
: Top navigation barSidebar.tsx
: Side navigation menuNotificationCenter.tsx
: Notification system for user alertsPageListItem.tsx
: Component for displaying page items in listsUserProfileModal.tsx
: Modal for user profile managementui/
: Shadcn UI component library- Various UI components like buttons, inputs, dialogs, etc.
context/
: React context providers for state managementhooks/
: Custom React hookslib/
: Utility libraries and configurationspages/
: Main application viewsHomePage.tsx
: Landing pageChatPage.tsx
: AI chat interfaceCrawlPage.tsx
: Web crawling interfaceSearchPage.tsx
: Search interfaceSitesPage.tsx
: Site managementSiteDetailPage.tsx
: Detailed view of a crawled siteNotificationInfo.tsx
: Notification settings and informationUserProfileModal.tsx
: User profile managementUserPreferencesPage.tsx
: User preferences management
styles/
: CSS and styling filesutils/
: Utility functionsApp.tsx
: Main application componentmain.tsx
: Entry point for the React application
public/
: Static assetsindex.html
: HTML entry pointvite.config.ts
: Vite configurationtailwind.config.js
: Tailwind CSS configurationtsconfig.json
: TypeScript configurationpackage.json
: NPM dependencies and scripts
Click to expand Database Structure
The project uses the following tables in the Supabase database:
-
crawl_sites
: Stores information about the sites you've crawledid
: Primary keyname
: Name of the siteurl
: URL of the sitedescription
: Optional description of the sitecreated_at
: Timestamp when the site was added
-
crawl_pages
: Stores the actual content, embeddings, titles, and summaries for each pageid
: Primary keysite_id
: Foreign key referencing thecrawl_sites
tableurl
: URL of the page (unique)title
: Title of the pagecontent
: Content of the pagesummary
: Summary of the pageembedding
: Vector embedding of the contentmetadata
: Additional metadata about the pageis_chunk
: Boolean indicating if this is a chunk of a larger pagechunk_index
: Index of the chunk within the parent pageparent_id
: Foreign key referencing the parent pagecreated_at
: Timestamp when the page was addedupdated_at
: Timestamp when the page was last updated
-
chat_conversations
: Stores conversation history for the chat interfaceid
: Primary keysession_id
: Unique identifier for the conversation sessionuser_id
: Optional identifier for the usertimestamp
: Timestamp when the message was sentrole
: Role of the message sender (user, assistant, system)content
: Content of the messagemetadata
: Additional metadata about the message
When you crawl a site multiple times, the system will update existing pages rather than creating duplicates, ensuring you always have the most recent content. Similarly, the chat interface will maintain conversation history across sessions, allowing for more natural and contextual interactions.
The project includes a powerful Streamlit-based Supabase Explorer app that allows you to interactively explore and analyze your database. This tool makes it easy to run SQL queries, visualize results, and gain insights from your crawled data.
- Interactive Query Interface: Run predefined or custom SQL queries with a single click
- Data Visualization: Create bar charts, line charts, and pie charts from your query results
- Database Overview: View statistics about your database, including site counts and page distribution
- Export Functionality: Download query results as CSV files for further analysis
- Predefined Queries: Access a comprehensive collection of useful SQL queries organized by category:
- Site queries
- Page queries
- Chunk queries
- Metadata queries
- Conversation history queries
- Statistics queries
- Embedding analysis queries
- Content quality queries
- Advanced conversation analysis
- Performance queries
- Search performance analysis
To launch the Supabase Explorer:
cd supabase_explorer
pip install -r requirements.txt
streamlit run supabase_explorer.py
The app will automatically connect to your Supabase database using the credentials in your root .env
file.
The Supabase Explorer is also available as part of the Docker setup. When you run either of the Docker Compose configurations, the Streamlit app will be accessible at:
http://localhost:8501
This allows you to explore your database directly from the Docker container without having to install Streamlit locally.
# Start the Docker containers including the Supabase Explorer
docker-compose -f docker/docker-compose.yml up -d
# Or with the integrated Crawl4AI setup
docker-compose -f docker/crawl4ai-docker-compose.yml up -d
Click to expand Adding Custom Queries
You can add your own custom queries to the predefined list by editing the supabase_explorer/supabase_queries.md
file. Follow the existing format:
Your Category
Your Query Name
```sql
SELECT * FROM your_table WHERE your_condition;
After adding your queries, restart the Streamlit app to load the new queries.
# Build and start the container
docker-compose -f docker/docker-compose.yml up -d
# View logs
docker-compose -f docker/docker-compose.yml logs -f
This setup includes:
- API backend on port 8001
- Frontend UI on port 3000
- Streamlit Explorer on port 8501
If you want to run both the API and Crawl4AI in Docker containers, this is when you already have a supabase locally or externally, you can use the provided crawl4ai-docker-compose.yml
file:
# Build and start both containers
docker-compose -f docker/crawl4ai-docker-compose.yml up -d
# View logs
docker-compose -f docker/crawl4ai-docker-compose.yml logs -f
This setup will:
- Start a Crawl4AI container using the official image from Docker Hub
- Start your API container with the correct configuration to connect to Crawl4AI
- Start the frontend UI container for the web interface
- Start the Streamlit Explorer for database exploration
- Create a network for the containers to communicate with each other
Make sure your .env
file in root includes the necessary Crawl4AI configuration:
# Crawl4AI Configuration
CRAWL4AI_API_TOKEN=your_crawl4ai_api_token
# This will be automatically set to the Docker service name in the container
# CRAWL4AI_BASE_URL=http://crawl4ai:11235
Access the services:
- API: http://localhost:8001
- Frontend UI: http://localhost:3000
- Streamlit Explorer: http://localhost:8501
- Crawl4AI: http://localhost:11235
We provide a comprehensive Docker setup that includes everything you need to run the entire application stack:
- Supa Chat API Backend
- Frontend UI
- Supabase Docker images (Database, Kong, Realtime, etc.)
- Crawl4AI Docker image for web crawling
This setup comes with everything you need to run the complete application without any external dependencies.
Click to expand more images of the UI
The full-stack Docker setup requires careful configuration of environment variables:
-
SUPABASE_URL: This should be commented out or left empty to ensure the API connects directly to the database:
# SUPABASE_URL=http://kong:8002
If this is set, the API will try to connect to Kong for database operations, which will cause SSL negotiation errors.
-
Direct Database Connection: Ensure these database connection parameters are set correctly:
SUPABASE_HOST=db SUPABASE_PORT=5432 SUPABASE_KEY=supabase_admin SUPABASE_PASSWORD=${POSTGRES_PASSWORD}
To use the full stack Docker setup:
-
Navigate to the docker directory:
cd docker
-
Run the setup script to create necessary configuration files:
chmod +x setup_update.sh ./setup_update.sh
This script will:
- Check for the existence of the
.env
file - Create SQL scripts for database initialization
- Download Supabase initialization scripts
- Create application tables and functions
- Generate the Kong configuration file
- Check for the existence of the
-
Edit the Docker-specific
.env
file with your actual values:nano .env
-
Start the services:
docker-compose -f full-stack-compose.yml up -d
-
Access the services:
- API: http://localhost:8001
- API Documentation: http://localhost:8001/docs
- Frontend UI: http://localhost:3000
- Supabase Studio: http://localhost:3001 (username: supabase, password: from your .env file)
- Kong API Gateway: http://localhost:8002
- Crawl4AI: http://localhost:11235
-
Monitor or manage the stack:
# Check status of all services ./status.sh # Reset the stack (removes all data) ./reset.sh
-
Database Connection Issues:
- If you see SSL negotiation errors, make sure
SUPABASE_URL
is commented out or empty in your.env
file - Verify the database credentials in the
.env
file - Restart the API service after making changes:
docker-compose -f full-stack-compose.yml restart api
- If you see SSL negotiation errors, make sure
-
REST Service Issues:
- If the REST service is not connecting properly, run the fix script:
./fix_rest.sh
- If the REST service is not connecting properly, run the fix script:
-
Checking Logs:
- View logs for a specific service:
docker logs supachat-api docker logs supachat-kong docker logs supachat-frontend
- View logs for a specific service:
For more detailed instructions, see the Docker README and System Flows Documentation.
The project includes a FastAPI-based REST API that allows you to integrate the Supa-Crawl-Chat functionality with other applications or build custom frontends. The API provides endpoints for searching, crawling, managing sites, and chatting.
To start the API server:
python run_api.py
or use:
cd api
uvicorn api.main:app --host 0.0.0.0 --port 8001 --reload
The API will be available at http://localhost:8001
The interactive API documentation is available at:
http://localhost:8001/docs
The API provides the following endpoints:
Click to expand APi Endpoints
GET /api/search
: Search for content using semantic search or text search- Parameters:
query
: The search querythreshold
: Similarity threshold (0-1)limit
: Maximum number of resultstext_only
: Use text search instead of embeddingssite_id
: Optional site ID to filter results by
- Parameters:
-
POST /api/crawl
: Crawl a website or sitemap- Body:
url
: URL to crawlsite_name
: Optional name for the sitesite_description
: Optional description of the siteis_sitemap
: Whether the URL is a sitemapmax_urls
: Maximum number of URLs to crawl from a sitemap
- Body:
-
GET /api/crawl/status/{site_id}
: Get the status of a crawl by site ID
-
GET /api/sites
: List all crawled sites- Parameters:
include_chunks
: Whether to include chunks in the page count
- Parameters:
-
GET /api/sites/{site_id}
: Get a site by ID- Parameters:
include_chunks
: Whether to include chunks in the page count
- Parameters:
-
GET /api/sites/{site_id}/pages
: Get pages for a specific site- Parameters:
include_chunks
: Whether to include chunks in the resultslimit
: Maximum number of pages to return
- Parameters:
-
POST /api/chat
: Send a message to the chat bot and get a response- Body:
message
: The user's messagesession_id
: Optional session ID for persistent conversationsuser_id
: Optional user IDprofile
: Optional profile to use
- Parameters:
model
: Optional model to useresult_limit
: Optional maximum number of search resultssimilarity_threshold
: Optional similarity threshold (0-1)include_context
: Whether to include search context in the responseinclude_history
: Whether to include conversation history in the response
- Body:
-
GET /api/chat/profiles
: List all available profiles- Parameters:
session_id
: Optional session ID to get active profileuser_id
: Optional user ID
- Parameters:
-
POST /api/chat/profiles/{profile_name}
: Set the active profile for a session- Parameters:
session_id
: Session IDuser_id
: Optional user ID
- Parameters:
-
GET /api/chat/history
: Get conversation history for a session- Parameters:
session_id
: Session IDuser_id
: Optional user ID
- Parameters:
-
DELETE /api/chat/history
: Clear conversation history for a session- Parameters:
session_id
: Session IDuser_id
: Optional user ID
- Parameters:
Here's an example of how to use the API with curl:
# Search for content
curl -X GET "http://localhost:8001/api/search?query=pydantic&threshold=0.3&limit=5" -H "accept: application/json"
# Start a chat session
curl -X POST "http://localhost:8001/api/chat" \
-H "Content-Type: application/json" \
-d '{"message": "Tell me about pydantic", "user_id": "example_user"}'
# Continue the conversation with the same session ID
curl -X POST "http://localhost:8001/api/chat" \
-H "Content-Type: application/json" \
-d '{"message": "How do I use BaseModel?", "session_id": "SESSION_ID_FROM_PREVIOUS_RESPONSE", "user_id": "example_user"}'
Finished crawl example
This project is licensed under the MIT License - see the LICENSE file for details.