!!! This notebook is under construction. Cleaning of the Notebooks, Documentation and Descriptions are in process. Thank you for your understanding. !!!!
Online retailers for fashion products suffer the paradox of choice: With increasing number of articles/choices, customers tend to make less or wrong purchases. This results in less customer satisfaction, more returns and less turnover.
This capstone project deals with the challenging task of creating a recommendation system for a fashion label. This increases customer satisfaction and the sustainability of the sales process, and a targeted recommendation generates significant economic value. In addition, a recommendation system provides valuable insights into customer preferences and forecasting opportunities for stakeholders. The task is to read customers’ minds without evaluation information based solely on customers’ purchase decisions using cleaver feature engineering and targeted models. An in-depth error analysis was conducted to select an appropriate model, and the winning model was further trained on the proper metric. In addition to the challenging nature of a recommender system, several additional obstacles were overcome. Starting from a fast adaptation to an unfamiliar subject area, dealing with Big-Data, and targeted feature engineering to acquaint with the yet foreign methodology. All these efforts were made because “you have the right to look fabulous.”
The data was provided by H&M by their fashion recommendations Kaggel challenge.
The dataset contains 3 Tables :
- article meta data
- 105,542 articles and 25 features
- customer data
- 1,371,980 customers and 7 features
- transactions table
- 31,788,324 transactions and 5 features 20.09.2018 - 22.09.2020
The results of this capstone project were presented on the 19th of May 2022 during the graduation event of the neue fische data science bootcamp.
The final slides can be accessed here.
The main table was the transactions table, which connects the article table with the customer table. There every single purchased item is present. An overview is given in the following:
Column name | Datatyp | Meaning | nunique | annotation |
---|---|---|---|---|
tdat | object | Transaction date | 734 | needs to be transformed |
customerid | object | Custumer ID | 1362281 | primary key |
articleid | int64 | article_id | 104547 | secondary key |
price | float64 | price | 9857 | is transformed or different currency |
saleschannelid | int64 | 2 is online and 1 store | 2 | encoding to 0 &1 ? |
The article table is describing the 105.542 articles by 25 features in more detail these should be shown in the following:
Column name | Datatyp | Meaning | nunique | annotation |
---|---|---|---|---|
article_id | int64 | Id connected to image name | 105542 | |
product_code | int64 | Overlying product category | 45875 | Takes the first 7 digits of the article ID |
prod_name | object | overall product name | 47224 | General product name |
product_type_no | int64 | Classificaton after product type number | 132 | -1 = unknown = NaN ? = 121 |
product_type_name | object | Classificaton product type label | 131 | 131 values + -1 for unknown |
product_group_name | object | Product typ categories by name | 19 | 121 unknown |
graphical_appearance_no | int64 | Appearance no | 30 | -1 = unknown = NaN ? = 52 |
graphical_appearance_name | object | Appearance label | 30 | |
colour_group_code | int64 | (dominating)Colour group number | 50 | No closed numbers i.e. 18, 24-29, 34-39 etc. is missing |
colour_group_name | object | (dominating)Colour group label | 50 | |
perceived_colour_value_id | int64 | "gray" scale or lightning condition of the image | 8 | -1 = unknown = NaN? = 28 |
perceived_colour_value_name | object | description of the lightning condition label | 8 | |
perceived_colour_master_id | int64 | color master id | 20 | 17 is missing and -1 = unknown |
perceived_colour_master_name | object | label of the color master id | 20 | |
department_no | int64 | Department number | 299 | More than one department might be responsible for a certain topic |
department_name | object | Department label (topics) | 250 | |
index_code | object | "target group classification" index code | 10 | A-F |
index_name | object | "target group classification" index label | 10 | Ladieswear, Menswear etc. |
index_group_no | int64 | Corse "target group classification" | 5 | 1,2,3,4,26 |
index_group_name | object | Corse "target group classification" label | 5 | 'Ladieswear' = 1 ,'Divided' =2,'Menswear'=3 'Baby/Children' = 4 , 'Sport' =26 |
section_no | int64 | "target group classification" + usage numbering | 57 | not a closed numbering |
section_name | object | "target group classification" + usage label | 56 | -1 is missing |
garment_group_no | int64 | garnet group no | 21 | 1002, 1003, 1007 ...1025 |
garment_group_name | object | garnet group label | 21 | |
detail_desc | object | more detailed description | 43404 | so not every item as an individual description and 416 NaN values |
The third table, is the Customer Table. It contains 1,371,980 customers and 7 features. and should be shown in the following.
Column name | Datatyp | Meaning | nunique | annotation |
---|---|---|---|---|
customer_id | object | Custumer id | 1371980 | hashed Customer Id |
FN | float64 | Recievs Fashion News | 1 | |
Active | float64 | customer is active for communication | 1 | Active = ACTIVE |
club_member_status | object | club status | 3 | ACTIVE = Active ? |
fashion_news_frequency | object | Fashion News frequency | 4 | interested in topics |
age | float64 | age | 84 | |
postal_code | object | Zip code | 352899 | Hashed postalcode |
Further investiagtion to the basic data can be found in the notebook Columsdescription.ipynb.
To deal with the big data a cloud service (Google Cloud Platform)
was used. The datasets were transformed to .parquet files
(pyarrow) and loaded into Google cloud storage
. In addition Bigquery
was enabled as a Data warehouse, to be adressed with SQL-Queries. For paralizsation of the data wrangling and EDA, the Python library dask
was used.
See lso the following notebooks for more EDA:
Four different algorithms plus a baseline model were applied. This chapter introduces the algorithms and gives an overview of the used notebooks.
This approach tries to find similarities between articles solely based on their features. It recommend articles that are similar to that what a user likes (based on actions or explicit feedback).
@ SINAN: Please list your notebooks and provide short descriptions (see market basket approach)
This approach filters articles based on past user-item interactions, in our case this means purchases. The main cf model in this project is an alternating least squares model, i.e., a model that learns 'latent' user and item factors (1280 in standard configuration) respectively, based on the utility matrix derived from the transactions data frame. Similar to the content based approach each factor can be interpreted as an attribute of an item, and the corresponding user factor then is the preference of a user for that attribute. However, in this approach both factors are learned from data by alternating between optimization of the user and item factors, based on how well they can explain the recorded user-item interactions in the utility matrix.
To run the ALS model just go to the root directory of the repo and enter:
python modeling/train_and_predict.py path/to/config.json
Note that the path to the configuration file is optional. If no file is provided the default config.json from the modeling/ directory will be used.
The model will be trained for all test weeks defined in the config file. See modeling/config.json for available options. For the default configuration these are the last 5 weeks of the data. Resulting predictions will be saved in the data/ directory as csv-files. The format is in line with the one required by the associated Kaggle challenge and can be used to submit on Kaggle without further processing. Note: training and prediction can take several hours.
To evaluate the MAP@k for the resulting predictions run:
python modeling/eval.py path/to/config.json
The path to the config file is optional and the default will be used if not provided. Use the same config file that was used in training and predicting. All test weeks will be scored according to the MAP@k (default for k is 12 as per Kaggle challenge). For reference here are the results for the standard configuration:
| submission_file | MAP_at_12 |data/results_ALS_1280_f_0.01_r_30_it_2020-08-26--2020-09-01.csv | 0.01371742561481911 | |data/results_ALS_1280_f_0.01_r_30_it_2020-09-02--2020-09-08.csv | 0.01647618039373869 | |data/results_ALS_1280_f_0.01_r_30_it_2020-09-09--2020-09-15.csv | 0.016722822849194158 | |data/results_ALS_1280_f_0.01_r_30_it_2020-09-16--2020-09-22.csv | 0.01583545858718839 |
Note that the 5th (Kaggle) test week can not be validated, because the data is not public. Submitting it to Kaggle yielded a MAP@12 of 0.01474, consistent with the above results.
This approach is more or less a bridge between NLP and recommender systems and can be explained as a kind of "auto-complete for purchase decisions".
@ MARTIN: Please list your notebooks and provide short descriptions (see market basket approach)
This approach mines for association rules of articles which often appear together in marketbaskets. Instead of using market baskets we decided to apply the algorithm on "wardrobes". A wardrobe is defined by all articles an individual customer has purchased. In our case all articles which appear in the transactions_train dataset and can be assigned to an individual customer id.
The following notebooks are used for this approach:
- Generate_wardrobe_df_CMM: This notebooks helps to generate a dataframe and csv which stores all wardrobes which can be collected out of the transaction dataset.
- Marketbasket_Analysis_on_wardrobes_CMM: This notebook mines assocation rules out of given wardrobes based on the apyori library.
- WIP: Marketbasket_Analysis_on_baskets_CMM: This notebook mines assocation rules based on market baskets. This approach is work-in-progress. The notebook does not include final and applicable code.
- Marketbasket_Analysis_Build_recommendations_CMM: This notebooks generates submission files (csv) with 12 recommendations for each customer_id. Empty fields are filled up with the Top12 sold aricles. The csv can either be used to upload it on kaggle or calculate results with eval-code or Monetary_value.
- Monetary_value: This notebook takes a submission-csv and checks which recommendations were actually purchased on the 4 test weeks. Based on this the hitrate, the monetary value of the hits (based on article-prices) and the share of predicted turnover is calculated. Finally, the results are stored in a csv.
While our original baseline model takes the Top12 most sold articles of the entire customer base, this approach calculates the Top12 most sold articles of specific age-groups of the customers. See Top12_clusters for the code.
...tbd