This project is a web scraping and LLM-powered content analysis system built with NestJS. It scrapes articles from various sources, processes them using LLM (Language Learning Model), and stores them in a vector database for semantic search capabilities.
The application follows a multi-step processing flow as shown in the diagram above:
-
CSV Upload & Initial Processing
- Users upload a CSV file containing article URLs through the
/scraper/upload-csv
endpoint - The system processes the CSV and extracts URLs and sources
- Users upload a CSV file containing article URLs through the
-
Message Queue Processing
- URLs are published to a RabbitMQ queue for asynchronous processing
- This ensures reliable handling of large numbers of URLs
-
Web Scraping
- The system fetches content from each URL
- HTML content is cleaned and converted to markdown format
- Scraped data is stored in MongoDB
-
LLM Processing
- The scraped content is processed by the LLM (Google Gemini)
- Content is analyzed and transformed into knowledge representations
-
Vector Storage
- Processed content is stored in the Qdrant vector database
- This enables semantic search capabilities
-
Query Processing
- Users can interact with the system through the
/agent
endpoint - The system uses the stored vector embeddings to provide relevant responses
- Users can interact with the system through the
- Web scraping of articles
- LLM-powered content analysis
- Vector database integration for semantic search
- RabbitMQ for message queuing
- MongoDB for data storage
- RESTful API endpoints
- Node.js (v16 or higher)
- pnpm package manager
- MongoDB
- RabbitMQ
- Qdrant vector database
- Google Gemini API key
- Clone the repository:
git clone <repository-url>
cd develops-today-llm-challenge
- Install dependencies:
pnpm install
- Create a
.env
file based on.env.example
:
cp .env.example .env
- Update the
.env
file with your configuration:
- Set your Gemini API key
- Configure MongoDB connection details
- Set RabbitMQ URL
- Configure Qdrant URL
The following environment variables are required for the application to function properly:
GEMINI_API_KEY
: Your Google Gemini API key for LLM operations
DB_URI
: MongoDB connection URI (e.g.,mongodb://localhost:27017
)DB_NAME
: Name of the MongoDB databaseDB_USER
: MongoDB usernameDB_PASSWORD
: MongoDB passwordDB_PORT
: MongoDB port (default: 27017)
RABBITMQ_URL
: RabbitMQ connection URL (e.g.,amqp://localhost
)
QDRANT_URL
: Qdrant vector database URL (e.g.,http://localhost:6333
)
GEMINI_API_KEY=your_gemini_api_key
RABBITMQ_URL=amqp://localhost
DB_URI=mongodb://localhost:27017
DB_NAME=scraper
DB_USER=admin
DB_PASSWORD=admin
DB_PORT=27017
QDRANT_URL=http://localhost:6333
pnpm start:dev
pnpm build
pnpm start:prod
docker compose up -d
POST /scraper/upload-csv
- Upload a CSV file containing article URLs to be scraped- Accepts a multipart form-data with a 'file' field containing the CSV
- CSV should contain 'URL' and 'Source' columns
POST /agent
- Generate a prompt using the LLM- Request body should contain a 'query' field with the text to process
- Returns the generated prompt
src/scraper/
- Web scraping functionalitysrc/llm/
- Language Learning Model integrationsrc/vectors/
- Vector database operationssrc/rabbitmq/
- Message queue handlingsrc/database/
- Database operations and models
Run the test suite:
pnpm test
Run tests with coverage:
pnpm test:cov
- Fork the repository
- Create your feature branch
- Commit your changes
- Push to the branch
- Create a Pull Request
This project is licensed under the UNLICENSED license.