🧠 StockMind

Where social intelligence meets market prediction.

StockMind bridges the gap between online public discourse and financial markets. By tapping into what people are actively talking about on Reddit and pairing that with real stock movement data, it builds a system capable of forecasting whether a stock will rise, fall, or hold steady — for companies like Tesla, Apple, and Amazon.

✨ What Makes This Different

🔁 Fully Automated Pipeline: From raw internet discussions to trained prediction models — everything runs inside a single notebook.
🧠 Finance-Aware NLP: Uses FinBERT, a transformer model pre-trained on financial text, to understand market-relevant language.
📉 Data Fusion: Goes beyond raw price history by layering in crowd sentiment as a predictive signal.
⚙️ Multi-Model Benchmarking: Several algorithms compete head-to-head so the strongest one earns the job.

🗂 Repository Structure

📂 stock_trends_prediction/
├── stock_trends_prediction.ipynb   # The main Colab notebook
├── data/                          # Directory for datasets
│   ├── stock_data_raw.csv         # Raw Reddit data
│   ├── stock_cleaned.csv          # Cleaned Reddit data
│   ├── stock_preprocessed.csv     # Preprocessed sentiment data
│   ├── all_companies_classification_data.csv  # Stock data with labels
│   └── merged_stock_sentiment_data.csv        # Final merged dataset
├── results/                       # Directory for outputs
│   └── results.txt                # Final result matrix (model evaluation)
└── README.md                      # Project documentation
└── Report.pdf                     # Overview of project

⚙️ How It Works

1️⃣ Gathering the Raw Material

StockMind pulls from two sources simultaneously — community posts from Reddit's financial communities (r/stocks, r/wallstreetbets) via the Reddit API, and historical price data from Yahoo Finance. Together, they form the foundation of the dataset.

2️⃣ Cleaning & Structuring

Raw text is messy. Duplicate posts, irrelevant symbols, and noise are stripped out first. FinBERT then reads through the cleaned posts and assigns each one a sentiment label — Positive, Neutral, or Negative — based on financial context. This sentiment data is then aligned with stock prices by date and company.

3️⃣ Training the Models

Five algorithms are trained and compared against each other:

Logistic Regression — the reliable baseline
Random Forest — robust against overfitting
Gradient Boosting — sequential error correction
Support Vector Machine (SVM) — effective in high-dimensional space
LightGBM — fast, gradient-based boosting

GridSearchCV is used to fine-tune each model, with performance measured across accuracy, precision, recall, and F1-score.

4️⃣ Delivering Predictions

The top-performing model is surfaced with its full configuration and outputs a three-class prediction for stock movement: Increase, Decrease, or No Change.

📦 Datasets

All data files live inside the data/ folder. Here's what each one contains:

File	Description
`stock_data_raw.csv`	Original scraped Reddit posts, unprocessed
`stock_cleaned.csv`	Posts after noise removal and deduplication
`stock_preprocessed.csv`	Sentiment-labelled, feature-ready data
`all_companies_classification_data.csv`	Stock price history with movement labels
`merged_stock_sentiment_data.csv`	Final unified dataset for model training

📥 Download the Data

The datasets are too large for GitHub. Grab them from Google Drive:

👉 Access Datasets

📊 Evaluation Output

Once training is complete, results/results.txt captures a full breakdown of the winning model:

Model name and chosen hyperparameters
Accuracy score
Precision, Recall, and F1-Score per class

This file serves as the ground truth for comparing future iterations of the model.

🚀 Getting Started

1. Get the Notebook Download stock_trends_prediction.ipynb and save it to your machine or Google Drive.

2. Launch in Colab Open the notebook via Google Colab directly or by uploading from Drive.

3. Install Dependencies Paste and run the following in the first cell:

!pip install praw transformers torch pandas scikit-learn yfinance matplotlib lightgbm

4. Set Up Reddit API Access

Visit Reddit App Preferences and create a new Script-type application.
Copy your Client ID and Client Secret into the notebook:

reddit = praw.Reddit(
    client_id="YOUR_CLIENT_ID",
    client_secret="YOUR_CLIENT_SECRET",
    user_agent="YOUR_USER_AGENT"
)

5. Execute the Notebook Run cells top to bottom. The pipeline will handle everything — scraping, processing, sentiment tagging, model training, and evaluation.

6. Check the Output Review results/results.txt for model metrics and explore the data/ folder for intermediate datasets.

💡 Takeaways

StockMind demonstrates that financial markets don't move in isolation — they move with people. By capturing how investors and enthusiasts talk about stocks online, the system adds a layer of predictive intelligence that pure price-based models miss. The project also shows how NLP and classical ML can be effectively combined without needing deep learning for the final prediction step.

💬 Open to Collaboration

Found a bug? Have an idea to improve accuracy? Pull requests and suggestions are welcome. Let's make market prediction more accessible and transparent.

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
Data		Data
README.md		README.md
Results.txt		Results.txt
Stock_trends_Prediction.ipynb		Stock_trends_Prediction.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

🧠 StockMind

Where social intelligence meets market prediction.

✨ What Makes This Different

🗂 Repository Structure

⚙️ How It Works

1️⃣ Gathering the Raw Material

2️⃣ Cleaning & Structuring

3️⃣ Training the Models

4️⃣ Delivering Predictions

📦 Datasets

📥 Download the Data

📊 Evaluation Output

🚀 Getting Started

💡 Takeaways

💬 Open to Collaboration

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

🧠 StockMind

Where social intelligence meets market prediction.

✨ What Makes This Different

🗂 Repository Structure

⚙️ How It Works

1️⃣ Gathering the Raw Material

2️⃣ Cleaning & Structuring

3️⃣ Training the Models

4️⃣ Delivering Predictions

📦 Datasets

📥 Download the Data

📊 Evaluation Output

🚀 Getting Started

💡 Takeaways

💬 Open to Collaboration

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages