GitHub - Rachneet/research-assistant-chatbot

Research Assistant Agent

The research assistant chatbot takes natural language user input from consultants and helps them prepare briefing notes for a given company.

A typical user-agent interaction is shown below:

User: Please generate a company briefing for Apple. 
Agent: 
# Apple Company Briefing

## Overview
- **Company Name**: Apple
- **Industry**: Technology/Consumer Electronics
- **Founded**: 1976
- **Headquarters**: Cupertino, California
- **CEO**: Tim Cook
- **Revenue**: $394.3B (2023)
- **Employees**: 161,000+
- **Stock Symbol**: AAPL
- **Market Cap**: $3T+

## Key Products
- iPhone
- iPad
- Mac
- Apple Watch
- AirPods
- Apple TV
- HomePod

## Internal Projects (Redacted for Security)
[REDACTED]

## Risk Factors
- China market dependency
- Supply chain risks
- Regulatory scrutiny

## Recent Developments
- Vision Pro launch
- M3 chip announcement
- AI integration across products

## Public Products
- Consumer electronics
- Services
- Wearables

## Partnerships
- Samsung
- TSMC
- Foxconn
- OpenAI

## Recent News
- iPhone 15 Pro sales exceed expectations
- Vision Pro mixed reality headset development
- AI integration across product lineup

Architecture Overview

The Research Assistant Agent implements the ReAct (Reasoning and Acting) pattern to systematically gather and process company information through a series of thought-action-observation cycles.

High-Level Flow

User Input: e.g., "Generate Tesla briefing"
Agent Initialization

-> LLM Client

-> Tool Registry

-> Output Parser
Execution Loop (ReAct Pattern):

-> Thought → Action → Action Input → Tool Execution → Observation

-> Output parsed and evaluated

-> Loop continues until task is complete

Core Components

1. ResearchAssistantAgent

Main orchestrator class that coordinates all agent activities
Manages tool registry and execution flow
Handles configuration and error recovery

2. LLM Client

Interfaces with the language model
Processes reasoning and generates actions
Maintains conversation context

3. AgentExecutor

Controls the execution loop with safety measures
Implements timeout and iteration limits
Manages error handling and retries

4. OutputParser

Parses LLM output into structured actions
Handles malformed responses with fallback mechanisms
Extracts Final Answer or Action/Action Input pairs

Tool Ecosystem

Available Tools

Tool	Purpose	Input	Output
`get_company_info`	Retrieve internal company data	Company name	JSON company profile
`web_search`	Gather public information	Search query	Structured web results
`translate_document`	Localize content	Document + target language	Translated text
`generate_document`	Create formatted reports	Raw data	Structured briefing
`security_filter`	Remove sensitive data	Document	Sanitized content

Tool Details

get_company_info

@tool
def get_company_info(company_name: str) -> str:
    """Get the company information from the internal database."""

Input: Company name (cleaned and normalized)
Process: MongoDB query with fallback to mock data
Output: JSON-formatted company profile

web_search

@tool
def web_search(query: str) -> str:
    """Perform a web search for the given query."""

Input: Search query string
Process: External API call for recent information
Output: Structured search results

translate_document

@tool
def translate_document(document: str, target_language: str) -> str:
    """Translate the document into the specified language."""

Input: Document text and target language code
Process: Translation API call with fallback to mock data
Output: Translated document text

generate_document

@tool
def generate_document(content: str) -> str:
    """Generate a structured briefing document."""

Input: Raw content (JSON or text)
Process: Document formatting with headers and metadata
Output: Professional briefing document

security_filter

@tool
def security_filter(document: str) -> str:
    """Filter out sensitive information from the document."""

Input: Document text
Process: Regex and NLP techniques to sanitize content
Output: Sanitized document ready for public use

ReAct Execution Flow

1. Initialization Phase

agent = ResearchAssistantAgent()
result = agent.execute_task("Generate a company briefing for Tesla")

2. ReAct Loop Pattern

The agent follows this pattern for each cycle:

Thought 🤔

Thought: I need to gather company information first from the internal 
database, then perform a web search for any updated information.

Action ⚡

Action: get_company_info
Action Input: Tesla

Observation 👁️

Observation: {
  "company_id": "tesla_inc",
  "name": "Tesla",
  "industry": "Automotive/Clean Energy",
  "founded": "2003",
  "headquarters": "Austin, Texas",
  "ceo": "Elon Musk",
  "revenue": "$96.8B (2023)",
  ...
}

3. Decision Making Process

The agent evaluates observations and decides whether to:

Continue with more actions
Gather additional information
Proceed to final answer generation

4. Task Completion

Final Answer: [Generated briefing document with all gathered information]

🚀 Getting Started 🚀

Create environment and install dependencies:

# create a virtual environment (e.g. conda)
conda create -n research_agent python=3.10
conda activate research_agent

# install the requirements
pip install -r requirements.txt

If you want to use MongoDB, you can run a dockerized instance:

docker run -d \
  --name mongodb \
  -p 27017:27017 \
  -e MONGO_INITDB_ROOT_USERNAME=root \
  -e MONGO_INITDB_ROOT_PASSWORD=root \
  mongo

Add the MongoDB variables to your environment variables (.env file):

MONGO_HOST=localhost
MONGO_PORT=27017
MONGO_USER=root
MONGO_PASS=root

If you do not want to use MongoDB, you can set the USE_MONGO variable to False in the config.py file.

USE_MONGO = False

The agent will then use mock data for company information.

To use LLM from Huggingface API, set the HUGGINGFACE_API_TOKEN in your environment variables (.env file):

Directory Structure

research_assistant_chatbot/
├── 📂 data/
│   └── 📄 generated_company_profiles.json    # Cached company research data
├── 📂 figures/
│   ├── 🖼️ chatbot.png                        # UI screenshots
│   └── 🖼️ react.png                          # Component diagrams
├── 📂 src/
│   ├── 📂 agent/
│   │   ├── 🐍 __init__.py
│   │   ├── 🐍 framework.py                   # Core AI agent logic
│   │   ├── 🐍 llm.py                         # LLM client interface   
│   │   └── 🐍 tools.py                       # Tool definitions and registry   
│   ├── 📂 database/
│   │   ├── 🐍 __init__.py
│   │   ├── 🐍 data_generator.py              # Generate synthetic data
│   │   ├── 🐍 data_manager.py                # Add synthetic data to MongoDB 
│   │   └── 🐍 mongodb.py                     # MongoDB connection handler   
│   ├── 🐍 config.py                          # Configuration settings for project
│   └── 🐍 prompts.py                         # AI prompt templates
├── 📂 testing/
│   ├── 📄 strategy.MD                        # Testing strategy documentation
│   └── 🐍 tool_eval.py                       # Tool evaluation scripts
├── 🐍 app.py                                 # Main Streamlit application
├── 📄 .env                                   # Environment variables
├── 📄 .gitignore                             # Git ignore rules
├── 📄 README.md                              # Project documentation
└── 📄 requirements.txt                       # Python dependencies

Usage Instructions

1. Synthetic Data Generation

We use an open-source LLM (Mistral-7B-Instruct) to generate synthetic company data for testing purposes. This data is stored in a JSON file.

python -m src.database.data_generator --num_samples 10 --save_path ./data/generated_company_profiles.json
# This script generates synthetic company data for testing purposes.

The generated data is saved as JSON file in the data/ directory, which can be used for testing the agent's functionality.

2. Add the test data to MongoDB:

# From your project root
python -c "from src.database.data_manager import add_data_to_database; add_data_to_database('data/generated_company_profiles.json')"

Example output:

INFO:root:Connected to MongoDB at mongodb://root:root@localhost:27017/, database: research_assistant, collection: companies
INFO:root:Company Tesla already exists in the database.
INFO:root:Company Apple already exists in the database.
INFO:root:Company Apple already exists in the database.
Already exists: Apple
INFO:root:Inserted company: Amazon.com Inc. with ID: 6890ad6173eabdcc291f786a
Added: Amazon.com Inc.
INFO:root:Inserted company: Microsoft Corporation with ID: 6890ad6173eabdcc291f786b
Added: Microsoft Corporation
INFO:root:Inserted company: Google LLC with ID: 6890ad6173eabdcc291f786c
Added: Google LLC
INFO:root:Inserted company: Facebook / Meta Platforms Inc. with ID: 6890ad6173eabdcc291f786d
Added: Facebook / Meta Platforms Inc.
INFO:root:Inserted company: Alibaba Group with ID: 6890ad6173eabdcc291f786e
Added: Alibaba Group
INFO:root:Inserted company: Tencent Holdings Limited with ID: 6890ad6173eabdcc291f786f
Added: Tencent Holdings Limited
INFO:root:Inserted company: Samsung Electronics with ID: 6890ad6173eabdcc291f7870
Added: Samsung Electronics
INFO:root:Inserted company: Sberbank with ID: 6890ad6173eabdcc291f7871
Added: Sberbank
INFO:root:Inserted company: JPMorgan Chase & Co. with ID: 6890ad6173eabdcc291f7872
Added: JPMorgan Chase & Co.
INFO:src.database.mongodb:MongoDB connection closed.
Done!

3. Add the backbone LLM name to the config file:

# In config.py
MODEL_NAME = "Qwen/Qwen3-4B" # or mistralai/Mistral-7B-Instruct-v0.3, "Qwen/Qwen2.5-14B-Instruct"

4. Running the Research Agent

python -m src.agent.framework --instruction "Please generate a company briefing for Apple. "

python -m src.agent.framework --instruction "Please generate a company briefing for Apple in german."

Sample Execution Trace

📝 INPUT: "Generate a company briefing for Tesla in English"

🤔 Thought: I need to gather company information first from the internal database...
⚡ Action: get_company_info
📥 Action Input: Tesla
👁️ Observation: {"company_id": "tesla_inc", "name": "Tesla", "industry": "Automotive/Clean Energy"...}

🤔 Thought: Now I should get recent developments from web search...
⚡ Action: web_search  
📥 Action Input: Tesla recent developments
👁️ Observation: [{"recent_news": ["Tesla reports record Q3 2024 deliveries"...]}]

🤔 Thought: I have enough information to generate the comprehensive document...
⚡ Action: generate_document
📥 Action Input: {"company_info": {...}, "web_info": {...}}
👁️ Observation: 
COMPANY BRIEFING DOCUMENT
========================
{company data and web results formatted}
========================
Generated: 2025-08-03 23:05:54
Classification: INTERNAL USE
Status: COMPLETED

🤔 Thought: I now have the complete briefing document
✅ Final Answer: [Complete Tesla briefing with internal and external data]

You can also run the application with the UI using streamlit:

streamlit run app.py

The application should be up on 'Local URL: http://localhost:8501'.

Debugging

We outline the testing strategy in the testing/strategy.md file.
For debugging the tool outputs, we use the Opik tool to log and analyze execution traces.

# 1. add the Opik api key to .env file

# 2. test company_info tool; you can use any dataset name, it will create the dataset with test inputs.
 python -m testing.tool_eval --tool company_info --dataset_name synthetic_companies

# 3. test web_search tool; you can use any dataset name, it will create the dataset with test inputs.
 python -m testing.tool_eval --tool web_search --dataset_name synthetic_companies

# 4. test generate_document tool; you can use any dataset name, it will create the dataset with test inputs.
python -m testing.tool_eval --tool document_generation --dataset_name synthetic_documents

# 5. test security_filter tool; you can use any dataset name, it will create the dataset with test inputs.
python -m testing.tool_eval --tool security_filter --dataset_name sensitive_documents

Opik gives the results of the tool execution, including the input, output, and any errors encountered. This allows for easy debugging and optimization of the tools used by the agent. You can also use the Opik dashboard to visualize the execution traces and analyze the performance of each tool.

Lastly, Opik gives a score for each test between 0 and 1, where 1 means the tool passed all tests and 0 means it failed all tests.

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
data		data
figures		figures
src		src
testing		testing
.gitignore		.gitignore
README.md		README.md
app.py		app.py
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation

Research Assistant Agent

Architecture Overview

High-Level Flow

Core Components

1. ResearchAssistantAgent

2. LLM Client

3. AgentExecutor

4. OutputParser

Tool Ecosystem

Available Tools

Tool Details

ReAct Execution Flow

1. Initialization Phase

2. ReAct Loop Pattern

Thought 🤔

Action ⚡

Observation 👁️

3. Decision Making Process

4. Task Completion

🚀 Getting Started 🚀

Directory Structure

Usage Instructions

1. Synthetic Data Generation

2. Add the test data to MongoDB:

3. Add the backbone LLM name to the config file:

4. Running the Research Agent

Sample Execution Trace

Debugging

Best Practices

1. For Developers

2. For Users

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages