Skip to content

alphaZytx/StockMind

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 

Repository files navigation

🧠 StockMind

Where social intelligence meets market prediction.

StockMind bridges the gap between online public discourse and financial markets. By tapping into what people are actively talking about on Reddit and pairing that with real stock movement data, it builds a system capable of forecasting whether a stock will rise, fall, or hold steady — for companies like Tesla, Apple, and Amazon.


✨ What Makes This Different

  • 🔁 Fully Automated Pipeline: From raw internet discussions to trained prediction models — everything runs inside a single notebook.
  • 🧠 Finance-Aware NLP: Uses FinBERT, a transformer model pre-trained on financial text, to understand market-relevant language.
  • 📉 Data Fusion: Goes beyond raw price history by layering in crowd sentiment as a predictive signal.
  • ⚙️ Multi-Model Benchmarking: Several algorithms compete head-to-head so the strongest one earns the job.

🗂 Repository Structure

📂 stock_trends_prediction/
├── stock_trends_prediction.ipynb   # The main Colab notebook
├── data/                          # Directory for datasets
│   ├── stock_data_raw.csv         # Raw Reddit data
│   ├── stock_cleaned.csv          # Cleaned Reddit data
│   ├── stock_preprocessed.csv     # Preprocessed sentiment data
│   ├── all_companies_classification_data.csv  # Stock data with labels
│   └── merged_stock_sentiment_data.csv        # Final merged dataset
├── results/                       # Directory for outputs
│   └── results.txt                # Final result matrix (model evaluation)
└── README.md                      # Project documentation
└── Report.pdf                     # Overview of project

⚙️ How It Works

1️⃣ Gathering the Raw Material

StockMind pulls from two sources simultaneously — community posts from Reddit's financial communities (r/stocks, r/wallstreetbets) via the Reddit API, and historical price data from Yahoo Finance. Together, they form the foundation of the dataset.

2️⃣ Cleaning & Structuring

Raw text is messy. Duplicate posts, irrelevant symbols, and noise are stripped out first. FinBERT then reads through the cleaned posts and assigns each one a sentiment label — Positive, Neutral, or Negative — based on financial context. This sentiment data is then aligned with stock prices by date and company.

3️⃣ Training the Models

Five algorithms are trained and compared against each other:

  • Logistic Regression — the reliable baseline
  • Random Forest — robust against overfitting
  • Gradient Boosting — sequential error correction
  • Support Vector Machine (SVM) — effective in high-dimensional space
  • LightGBM — fast, gradient-based boosting

GridSearchCV is used to fine-tune each model, with performance measured across accuracy, precision, recall, and F1-score.

4️⃣ Delivering Predictions

The top-performing model is surfaced with its full configuration and outputs a three-class prediction for stock movement: Increase, Decrease, or No Change.


📦 Datasets

All data files live inside the data/ folder. Here's what each one contains:

File Description
stock_data_raw.csv Original scraped Reddit posts, unprocessed
stock_cleaned.csv Posts after noise removal and deduplication
stock_preprocessed.csv Sentiment-labelled, feature-ready data
all_companies_classification_data.csv Stock price history with movement labels
merged_stock_sentiment_data.csv Final unified dataset for model training

📥 Download the Data

The datasets are too large for GitHub. Grab them from Google Drive:

👉 Access Datasets


📊 Evaluation Output

Once training is complete, results/results.txt captures a full breakdown of the winning model:

  • Model name and chosen hyperparameters
  • Accuracy score
  • Precision, Recall, and F1-Score per class

This file serves as the ground truth for comparing future iterations of the model.


🚀 Getting Started

1. Get the Notebook Download stock_trends_prediction.ipynb and save it to your machine or Google Drive.

2. Launch in Colab Open the notebook via Google Colab directly or by uploading from Drive.

3. Install Dependencies Paste and run the following in the first cell:

!pip install praw transformers torch pandas scikit-learn yfinance matplotlib lightgbm

4. Set Up Reddit API Access

  • Visit Reddit App Preferences and create a new Script-type application.
  • Copy your Client ID and Client Secret into the notebook:
reddit = praw.Reddit(
    client_id="YOUR_CLIENT_ID",
    client_secret="YOUR_CLIENT_SECRET",
    user_agent="YOUR_USER_AGENT"
)

5. Execute the Notebook Run cells top to bottom. The pipeline will handle everything — scraping, processing, sentiment tagging, model training, and evaluation.

6. Check the Output Review results/results.txt for model metrics and explore the data/ folder for intermediate datasets.


💡 Takeaways

StockMind demonstrates that financial markets don't move in isolation — they move with people. By capturing how investors and enthusiasts talk about stocks online, the system adds a layer of predictive intelligence that pure price-based models miss. The project also shows how NLP and classical ML can be effectively combined without needing deep learning for the final prediction step.


💬 Open to Collaboration

Found a bug? Have an idea to improve accuracy? Pull requests and suggestions are welcome. Let's make market prediction more accessible and transparent.

About

Mining the crowd's wisdom — predicting stock price movements by fusing public sentiment from Reddit with historical market data.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors