⚠️ Note: GitHub's notebook viewer may not display all cells correctly. You may see only a partial view ofCat_A_DataDates.ipynbwith no warnings. For the complete notebook with all outputs, please clone the repository and open it locally in Jupyter or VS Code.
This data science project presents a Machine Learning pipeline that transforms Champion Group's raw company data into actionable Business Intelligence via robust Exploratory Data Analysis, unsupervised clustering and zero-shot prompt engineering. There are 2 main Jupyter notebooks, namely CAT_A_DataDates.ipynb and lookalike_search.ipynb, both built upon cleaned data from eda.ipynb.
CAT_A_DataDates.ipynb performs geospatial analysis based on HDBSCAN clustering, employs prompt engineering to generate cluster description summaries, and compares 6 ML models (Ridge, Lasso, ElasticNet, RandomForest, GradientBoosting, XGBoost) to identify the best model to predict market revenue and value of future companies.
lookalike_search.ipynb comprises a simple patron-facing search engine to support decision-making based on customizable company characteristics. This can be run independently from the ML pipeline trained in CAT_A_DataDates.ipynb.
To run this project, perform the following steps in sequential order:
- Install Python 3.10+.
- Run
pip install -r requirements.txt. - Download the dataset (
champions_group_data.xlsx) into thedata. - Run
eda.ipynbto generate a cleaned dataset (cache/cleaned_data.csv). - Login to a HuggingFace account and get a free HuggingFace token before running
CAT_A_DataDates.ipynb. - Run
CAT_A_DataDates.ipynbto train a ML pipeline to predict market revenue and value. - Run
lookalike_search.ipynbto simulate a search engine for customers to filter companies by preferances.
After completing the environment setup, the following repository structure is expected locally:
├── data
│ └── champions_group_data.xlsx
├── cache
│ └── cleaned_data.csv
│ └── cluster_summaries.json
│ └── clusters.json
│ └── geocode_cache.db
│ └── sic_lookup.json
├── interactive_maps
│ ├── cluster_map.html
│ ├── self_to_targets_map_adaptive.html
│ └── self_to_targets_maps.html
├── pkg
│ ├── __init__.py
│ ├── eda.py
│ ├── embedding.py
│ ├── examples.py
│ ├── geocode.py
│ ├── geography.py
│ ├── helpers.py
│ ├── ml_pipeline.py
│ ├── prediction_utils.py
│ ├── prompts.py
│ ├── run_pipeline.py
│ └── sic.py
├── .gitattributes
├── .gitignore
├── Cat_A_DataDates.ipynb
├── README.md
├── Report.docx
├── eda.ipynb
├── lookalike_search.ipynb
└── requirements.txt
CAT_A_DataDates.ipynb performs multi-dimensional analysis of 8,559 companies. It comprises these steps in sequential order:
- Data Preprocessing
- Unsupervised Clustering
- Comparative Industry and Geographical Analysis of Clusters
- Prompt Engineering for Cluster Summaries
- Regression Model Selection and Hyperparameter Tuning to train a ML pipeline to predict market revenue and value.
lookalike_search.ipynb implements a search engine for patrons to find company recommendations based on preferred characteristics. This extends the power of simple high-level classifications in conventional EDA alone.
-
Input companies can be split into 4 main clusters based on IT spend, PC usage, market revenue, and market value. They are small branch subsidiaries, big enterprises, lower-value lower-IT intensive companies (possibly not mainly in the tech industry), and high-IT tech firms.
-
Geospatial analysis reveals most among all 4 identified clusters are expectedly concentrated in 1st-line Chinese cities (Chongqing, Guangdong, Shanghai, Beijing, Shenzhen, etc). Nonetheless, the 2nd cluster is more broadly distributed; the 3rd cluster is more inland whereas the last cluster is more coastal-leaning. This is reaffirmed by LLM summarisation.
-
IT spend and IT budget are both strong predictors of market revenue and market spend. However, IT spend's correlation against market value is no longer supported for noisy data (outlier companies with incomplete data).
-
XGBoost regression model performs the best in predicting market revenue and value, with RMSE as the loss function, ostensibly as it works best for high-dimensional data inputs. However, if high-cardinality string descriptions are dropped, RandomForest surpasses in performance ceteris paribus, ostensibly as noise is significant in removed strings and overfitting is less pronounced.
-
Integrate transformer-based text embeddings (e.g., SBERT) to extract semantic features from unstructured company descriptions. Natural Language Processing (NLP) techniques can be employed.
-
Implement an aesthetic GUI for the search engine to make it more beginner-friendly for arbitrary patrons to search their desired companies.
-
Enhance the ML pipeline to recommend which cities franchisers should optimally open their next franchisee based on existing market competitors' geospatial data and market condition (possibly interpreted from their market revenue and value), and spread of existing franchisees.