This project is a web scraping and LLM-powered content analysis system built with NestJS. It scrapes articles from various sources, processes them using LLM (Language Learning Model), and stores them in a vector database for semantic search capabilities.
The application follows a multi-step processing flow as shown in the diagram above:
-
CSV Upload & Initial Processing
- Users upload a CSV file containing article URLs through the
/scraper/upload-csvendpoint - The system processes the CSV and extracts URLs and sources
- Users upload a CSV file containing article URLs through the
-
Message Queue Processing
- URLs are published to a RabbitMQ queue for asynchronous processing
- This ensures reliable handling of large numbers of URLs
-
Web Scraping
- The system fetches content from each URL
- HTML content is cleaned and converted to markdown format
- Scraped data is stored in MongoDB
-
LLM Processing
- The scraped content is processed by the LLM (Google Gemini)
- Content is analyzed and transformed into knowledge representations
-
Vector Storage
- Processed content is stored in the Qdrant vector database
- This enables semantic search capabilities
-
Query Processing
- Users can interact with the system through the
/agentendpoint - The system uses the stored vector embeddings to provide relevant responses
- Users can interact with the system through the
- Web scraping of articles
- LLM-powered content analysis
- Vector database integration for semantic search
- RabbitMQ for message queuing
- MongoDB for data storage
- RESTful API endpoints
- Node.js (v16 or higher)
- pnpm package manager
- MongoDB
- RabbitMQ
- Qdrant vector database
- Google Gemini API key
- Clone the repository:
git clone <repository-url>
cd develops-today-llm-challenge- Install dependencies:
pnpm install- Create a
.envfile based on.env.example:
cp .env.example .env- Update the
.envfile with your configuration:
- Set your Gemini API key
- Configure MongoDB connection details
- Set RabbitMQ URL
- Configure Qdrant URL
The following environment variables are required for the application to function properly:
GEMINI_API_KEY: Your Google Gemini API key for LLM operations
DB_URI: MongoDB connection URI (e.g.,mongodb://localhost:27017)DB_NAME: Name of the MongoDB databaseDB_USER: MongoDB usernameDB_PASSWORD: MongoDB passwordDB_PORT: MongoDB port (default: 27017)
RABBITMQ_URL: RabbitMQ connection URL (e.g.,amqp://localhost)
QDRANT_URL: Qdrant vector database URL (e.g.,http://localhost:6333)
GEMINI_API_KEY=your_gemini_api_key
RABBITMQ_URL=amqp://localhost
DB_URI=mongodb://localhost:27017
DB_NAME=scraper
DB_USER=admin
DB_PASSWORD=admin
DB_PORT=27017
QDRANT_URL=http://localhost:6333pnpm start:devpnpm build
pnpm start:proddocker compose up -dPOST /scraper/upload-csv- Upload a CSV file containing article URLs to be scraped- Accepts a multipart form-data with a 'file' field containing the CSV
- CSV should contain 'URL' and 'Source' columns
POST /agent- Generate a prompt using the LLM- Request body should contain a 'query' field with the text to process
- Returns the generated prompt
src/scraper/- Web scraping functionalitysrc/llm/- Language Learning Model integrationsrc/vectors/- Vector database operationssrc/rabbitmq/- Message queue handlingsrc/database/- Database operations and models
Run the test suite:
pnpm testRun tests with coverage:
pnpm test:cov- Fork the repository
- Create your feature branch
- Commit your changes
- Push to the branch
- Create a Pull Request
This project is licensed under the UNLICENSED license.
