Skip to content

QueryQuack transforms how you interact with PDF documents by allowing you to have conversations with your documents instead of manually searching through them

License

Notifications You must be signed in to change notification settings

negativenagesh/QueryQuack

Repository files navigation

Check Live - HuggingFace space - Get Started by clicking on Start Quacking on landing page

Resumai Logo

QueryQuack

Quack the query, crack the PDF!

Stars Badge Forks Badge Pull Requests Badge Issues Badge License Badge

QueryQuack transforms how you interact with PDF documents by allowing you to have conversations with your documents instead of manually searching through them. QueryQuack is a document query engine that allows users to upload PDFs and query them using natural language. The system uses vector embeddings and retrieval-based question answering to provide relevant responses based on the document content.

Setup Instructions

Follow these steps to set up QueryQuack locally

```
  1. Clone the Repository
git clone https://github.com/negativenagesh/QueryQuack.git
cd QueryQuack
  1. Create a venv
python3 -m venv .venv
source .venv/bin/activate
  1. Install Dependencies
pip install -r pkgs.txt
  1. Set up Gemini API
# Create a .env file in the project directory
touch .env

# Add your Gemini API key to the .env file
echo "api_key=your_api_key_here" >> .env
  1. Set up Pinecone API
# Create a free Pinecone account at https://www.pinecone.io/
# Create a new project and index in the Pinecone console

# Add your Pinecone API key and environment to the .env file
echo "PINECONE_API_KEY=your_pinecone_api_key_here" >> .env
echo "PINECONE_ENVIRONMENT=your_environment_here" >> .env
echo "PINECONE_INDEX=your_index_name_here" >> .env

Landing page

Landing page

Main page

Main page

Data Flow

  1. Document Ingestion:
PDF Upload → Text Extraction → Chunking → Embedding Generation → Pinecone Storage
  1. Query Processing:
User Query → Query Processing → Embedding Generation → Vector Search → Chunk Retrieval → Response Generation → Display

How does QueryQuack work?

Document Processing Pipeline

  1. Text Extraction: PDFs are processed page by page to extract raw text Document structure (headings, paragraphs) is preserved where possible Images and non-textual elements are noted but not processed

  2. Chunking: Extracted text is divided into smaller, semantically meaningful segments Each chunk maintains metadata about its source document and location Chunks overlap slightly to preserve context across boundaries

  3. Vector Embedding: Each text chunk is transformed into a high-dimensional vector representation These embeddings capture the semantic meaning of the text Similar concepts will have similar vector representations, even if using different words

  4. Storage in Pinecone: All vector embeddings are stored in a Pinecone vector database Documents are organized by user session to maintain privacy The database enables ultra-fast similarity searching

Query Processing Pipeline

  1. Query Understanding: Your natural language question is analyzed for intent and key concepts The system considers conversation context from previous questions The query is transformed into a vector embedding using the same process as documents

  2. Semantic Search: The query vector is compared against all document chunk vectors Pinecone performs this similarity search in milliseconds The most relevant chunks are retrieved based on semantic similarity, not just keyword matching

  3. Context Assembly: The top matching chunks are compiled into a comprehensive context This context represents the most relevant parts of your documents to answer the query Source metadata is preserved for attribution

  4. Response Generation: The Gemini API uses the assembled context to generate a coherent answer The response is formulated to directly address your question Citations link back to the specific parts of your original documents.

Multi-Document Queries

  • Ask questions that span multiple uploaded documents
  • The system automatically finds connections between different sources
  • Compare and contrast information across documents

Conversation Memory

  • References to previous questions are understood (e.g., "Tell me more about that")
  • The system maintains the conversation context throughout your session
  • No need to repeat context in follow-up questions

Source Attribution

  • Every answer shows exactly which documents contributed to the response
  • Navigate directly to specific sections in source documents
  • Verify information against the original content

Models and Tech stack Used in QueryQuack

  1. Embedding Model: all-MiniLM-L6-v2 This model is used for generating vector embeddings from text chunks:
  • Efficient 384-dimensional embeddings that capture semantic meaning
  • Very Lightweight to run on CPU without requiring GPU resources
  • Good balance between performance and computational requirements
  • Well-supported through HuggingFace's ecosystem
  • Strong semantic understanding for accurate retrieval
  1. LLM for Response Generation: Gemini 1.5 Flash
  • Used for generating coherent responses based on retrieved context:
  • Free
  • Maintains good performance while being more cost-effective than other models available
  1. Vector Database: Pinecone Used for storing and retrieving vector embeddings:
  • Purpose-built for vector similarity search at scale
  • Millisecond query times even with large vector collections
  • Supports namespaces for organizing data by user session
  • Cloud-based with simple API integration
  • Specialized indexing for high-dimensional vectors
  • Optimized for semantic search rather than keyword matching
  • Supports metadata filtering for more targeted retrieval

The system architecture uses specialized components for each part of the pipeline, creating an efficient, scalable solution for document question-answering without requiring specialized hardware.

License

QueryQuack is released under the Apache License.

About

QueryQuack transforms how you interact with PDF documents by allowing you to have conversations with your documents instead of manually searching through them

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published