This project focuses on processing and analyzing international football matches data using big data technologies. The dataset includes historical match results, shootout details, and goal scorers from 1872 to 2024.
This project was implemented in two different ways:
- Using Jupyter Notebook: The implementation is available in the
notebooks/folder. - Using Apache Airflow: The workflow is defined in the
scripts/anddags/folders.
📝 Note: This project was executed twice—once without Airflow using Jupyter Notebook in the
notebooks/folder, and once with Airflow in thescripts/anddags/folders.
- Apache Spark (Local Mode) - for distributed data processing
- Hadoop & HDFS - for storage and data management
- Jupyter Notebook - for interactive data exploration and processing
- Google Cloud Storage (GCS) - for cloud-based storage
- BigQuery - for querying large datasets efficiently
- Looker - for data visualization and reporting
- Apache Airflow - for workflow automation and orchestration
The dataset contains three main CSV files:
This file contains historical international football match results, including:
date: Match datehome_team: Name of the home teamaway_team: Name of the away teamhome_score: Home team scoreaway_score: Away team scoretournament: Tournament namecity: City where the match was playedcountry: Country where the match was playedneutral: Whether the match was played at a neutral venue
This file includes penalty shootout details, with columns:
date: Match datehome_team: Home team nameaway_team: Away team namewinner: Winner of the shootoutfirst_shooter: Team that shot first in the penalty shootout
This file contains details of goals scored in international matches, including:
date: Match datehome_team: Home team nameaway_team: Away team nameteam: Team that scored the goalscorer: Player who scored the goalminute: Minute in which the goal was scoredown_goal: Whether it was an own goalpenalty: Whether the goal was a penalty
- Data Cleaning & Transformation: Load, clean, and transform the data into a structured format.
- Schema Design: Design a star schema with fact and dimension tables.
- ETL Pipelines: Build pipelines to process, store, and query the data efficiently.
- Big Data Processing: Utilize Spark and Hadoop for scalable processing.
- Cloud Integration: Store and query data using GCS and BigQuery.
- Data Visualization: Create interactive dashboards using Looker.
- Workflow Automation: Orchestrate ETL processes with Apache Airflow.