StockMind bridges the gap between online public discourse and financial markets. By tapping into what people are actively talking about on Reddit and pairing that with real stock movement data, it builds a system capable of forecasting whether a stock will rise, fall, or hold steady — for companies like Tesla, Apple, and Amazon.
- 🔁 Fully Automated Pipeline: From raw internet discussions to trained prediction models — everything runs inside a single notebook.
- 🧠 Finance-Aware NLP: Uses FinBERT, a transformer model pre-trained on financial text, to understand market-relevant language.
- 📉 Data Fusion: Goes beyond raw price history by layering in crowd sentiment as a predictive signal.
- ⚙️ Multi-Model Benchmarking: Several algorithms compete head-to-head so the strongest one earns the job.
📂 stock_trends_prediction/
├── stock_trends_prediction.ipynb # The main Colab notebook
├── data/ # Directory for datasets
│ ├── stock_data_raw.csv # Raw Reddit data
│ ├── stock_cleaned.csv # Cleaned Reddit data
│ ├── stock_preprocessed.csv # Preprocessed sentiment data
│ ├── all_companies_classification_data.csv # Stock data with labels
│ └── merged_stock_sentiment_data.csv # Final merged dataset
├── results/ # Directory for outputs
│ └── results.txt # Final result matrix (model evaluation)
└── README.md # Project documentation
└── Report.pdf # Overview of project
StockMind pulls from two sources simultaneously — community posts from Reddit's financial communities (r/stocks, r/wallstreetbets) via the Reddit API, and historical price data from Yahoo Finance. Together, they form the foundation of the dataset.
Raw text is messy. Duplicate posts, irrelevant symbols, and noise are stripped out first. FinBERT then reads through the cleaned posts and assigns each one a sentiment label — Positive, Neutral, or Negative — based on financial context. This sentiment data is then aligned with stock prices by date and company.
Five algorithms are trained and compared against each other:
- Logistic Regression — the reliable baseline
- Random Forest — robust against overfitting
- Gradient Boosting — sequential error correction
- Support Vector Machine (SVM) — effective in high-dimensional space
- LightGBM — fast, gradient-based boosting
GridSearchCV is used to fine-tune each model, with performance measured across accuracy, precision, recall, and F1-score.
The top-performing model is surfaced with its full configuration and outputs a three-class prediction for stock movement: Increase, Decrease, or No Change.
All data files live inside the data/ folder. Here's what each one contains:
| File | Description |
|---|---|
stock_data_raw.csv |
Original scraped Reddit posts, unprocessed |
stock_cleaned.csv |
Posts after noise removal and deduplication |
stock_preprocessed.csv |
Sentiment-labelled, feature-ready data |
all_companies_classification_data.csv |
Stock price history with movement labels |
merged_stock_sentiment_data.csv |
Final unified dataset for model training |
The datasets are too large for GitHub. Grab them from Google Drive:
Once training is complete, results/results.txt captures a full breakdown of the winning model:
- Model name and chosen hyperparameters
- Accuracy score
- Precision, Recall, and F1-Score per class
This file serves as the ground truth for comparing future iterations of the model.
1. Get the Notebook
Download stock_trends_prediction.ipynb and save it to your machine or Google Drive.
2. Launch in Colab Open the notebook via Google Colab directly or by uploading from Drive.
3. Install Dependencies Paste and run the following in the first cell:
!pip install praw transformers torch pandas scikit-learn yfinance matplotlib lightgbm4. Set Up Reddit API Access
- Visit Reddit App Preferences and create a new Script-type application.
- Copy your Client ID and Client Secret into the notebook:
reddit = praw.Reddit(
client_id="YOUR_CLIENT_ID",
client_secret="YOUR_CLIENT_SECRET",
user_agent="YOUR_USER_AGENT"
)5. Execute the Notebook Run cells top to bottom. The pipeline will handle everything — scraping, processing, sentiment tagging, model training, and evaluation.
6. Check the Output
Review results/results.txt for model metrics and explore the data/ folder for intermediate datasets.
StockMind demonstrates that financial markets don't move in isolation — they move with people. By capturing how investors and enthusiasts talk about stocks online, the system adds a layer of predictive intelligence that pure price-based models miss. The project also shows how NLP and classical ML can be effectively combined without needing deep learning for the final prediction step.
Found a bug? Have an idea to improve accuracy? Pull requests and suggestions are welcome. Let's make market prediction more accessible and transparent.