DataDates: Project Description

⚠️ Note: GitHub's notebook viewer may not display all cells correctly. You may see only a partial view of Cat_A_DataDates.ipynb with no warnings. For the complete notebook with all outputs, please clone the repository and open it locally in Jupyter or VS Code.

This data science project presents a Machine Learning pipeline that transforms Champion Group's raw company data into actionable Business Intelligence via robust Exploratory Data Analysis, unsupervised clustering and zero-shot prompt engineering. There are 2 main Jupyter notebooks, namely CAT_A_DataDates.ipynb and lookalike_search.ipynb, both built upon cleaned data from eda.ipynb.

CAT_A_DataDates.ipynb performs geospatial analysis based on HDBSCAN clustering, employs prompt engineering to generate cluster description summaries, and compares 6 ML models (Ridge, Lasso, ElasticNet, RandomForest, GradientBoosting, XGBoost) to identify the best model to predict market revenue and value of future companies.

lookalike_search.ipynb comprises a simple patron-facing search engine to support decision-making based on customizable company characteristics. This can be run independently from the ML pipeline trained in CAT_A_DataDates.ipynb.

Environment Setup

To run this project, perform the following steps in sequential order:

Install Python 3.10+.
Run pip install -r requirements.txt.
Download the dataset (champions_group_data.xlsx) into the data.
Run eda.ipynb to generate a cleaned dataset (cache/cleaned_data.csv).
Login to a HuggingFace account and get a free HuggingFace token before running CAT_A_DataDates.ipynb.
Run CAT_A_DataDates.ipynb to train a ML pipeline to predict market revenue and value.
Run lookalike_search.ipynb to simulate a search engine for customers to filter companies by preferances.

Repository Structure

After completing the environment setup, the following repository structure is expected locally:

├── data
│   └── champions_group_data.xlsx
├── cache
│   └── cleaned_data.csv
│   └── cluster_summaries.json
│   └── clusters.json
│   └── geocode_cache.db
│   └── sic_lookup.json
├── interactive_maps
│   ├── cluster_map.html
│   ├── self_to_targets_map_adaptive.html
│   └── self_to_targets_maps.html
├── pkg
│   ├── __init__.py
│   ├── eda.py
│   ├── embedding.py
│   ├── examples.py
│   ├── geocode.py
│   ├── geography.py
│   ├── helpers.py
│   ├── ml_pipeline.py
│   ├── prediction_utils.py
│   ├── prompts.py
│   ├── run_pipeline.py
│   └── sic.py
├── .gitattributes
├── .gitignore
├── Cat_A_DataDates.ipynb
├── README.md
├── Report.docx
├── eda.ipynb
├── lookalike_search.ipynb
└── requirements.txt

Project Workflow Overview

CAT_A_DataDates.ipynb performs multi-dimensional analysis of 8,559 companies. It comprises these steps in sequential order:

Data Preprocessing
Unsupervised Clustering
Comparative Industry and Geographical Analysis of Clusters
Prompt Engineering for Cluster Summaries
Regression Model Selection and Hyperparameter Tuning to train a ML pipeline to predict market revenue and value.

lookalike_search.ipynb implements a search engine for patrons to find company recommendations based on preferred characteristics. This extends the power of simple high-level classifications in conventional EDA alone.

4 Key Insights

Input companies can be split into 4 main clusters based on IT spend, PC usage, market revenue, and market value. They are small branch subsidiaries, big enterprises, lower-value lower-IT intensive companies (possibly not mainly in the tech industry), and high-IT tech firms.
Geospatial analysis reveals most among all 4 identified clusters are expectedly concentrated in 1st-line Chinese cities (Chongqing, Guangdong, Shanghai, Beijing, Shenzhen, etc). Nonetheless, the 2nd cluster is more broadly distributed; the 3rd cluster is more inland whereas the last cluster is more coastal-leaning. This is reaffirmed by LLM summarisation.
IT spend and IT budget are both strong predictors of market revenue and market spend. However, IT spend's correlation against market value is no longer supported for noisy data (outlier companies with incomplete data).
XGBoost regression model performs the best in predicting market revenue and value, with RMSE as the loss function, ostensibly as it works best for high-dimensional data inputs. However, if high-cardinality string descriptions are dropped, RandomForest surpasses in performance ceteris paribus, ostensibly as noise is significant in removed strings and overfitting is less pronounced.

Future Improvements

Integrate transformer-based text embeddings (e.g., SBERT) to extract semantic features from unstructured company descriptions. Natural Language Processing (NLP) techniques can be employed.
Implement an aesthetic GUI for the search engine to make it more beginner-friendly for arbitrary patrons to search their desired companies.
Enhance the ML pipeline to recommend which cities franchisers should optimally open their next franchisee based on existing market competitors' geospatial data and market condition (possibly interpreted from their market revenue and value), and spread of existing franchisees.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

DataDates: Project Description

Environment Setup

Repository Structure

Project Workflow Overview

4 Key Insights

Future Improvements

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 40 Commits
cache		cache
interactive_maps		interactive_maps
pkg		pkg
prompts		prompts
.gitattributes		.gitattributes
.gitignore		.gitignore
Cat_A_DataDates.ipynb		Cat_A_DataDates.ipynb
README.md		README.md
eda.ipynb		eda.ipynb
lookalike_search.ipynb		lookalike_search.ipynb
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation

DataDates: Project Description

Environment Setup

Repository Structure

Project Workflow Overview

4 Key Insights

Future Improvements

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages