International Football Matches Data Engineering Project

Overview

This project focuses on processing and analyzing international football matches data using big data technologies. The dataset includes historical match results, shootout details, and goal scorers from 1872 to 2024.

This project was implemented in two different ways:

Using Jupyter Notebook: The implementation is available in the notebooks/ folder.
Using Apache Airflow: The workflow is defined in the scripts/ and dags/ folders.

📝 Note: This project was executed twice—once without Airflow using Jupyter Notebook in the notebooks/ folder, and once with Airflow in the scripts/ and dags/ folders.

Technologies Used

Apache Spark (Local Mode) - for distributed data processing
Hadoop & HDFS - for storage and data management
Jupyter Notebook - for interactive data exploration and processing
Google Cloud Storage (GCS) - for cloud-based storage
BigQuery - for querying large datasets efficiently
Looker - for data visualization and reporting
Apache Airflow - for workflow automation and orchestration

Dataset International Football Results (1872-2017) - Kaggle

The dataset contains three main CSV files:

1. `results.csv`

This file contains historical international football match results, including:

date: Match date
home_team: Name of the home team
away_team: Name of the away team
home_score: Home team score
away_score: Away team score
tournament: Tournament name
city: City where the match was played
country: Country where the match was played
neutral: Whether the match was played at a neutral venue

2. `shootouts.csv`

This file includes penalty shootout details, with columns:

date: Match date
home_team: Home team name
away_team: Away team name
winner: Winner of the shootout
first_shooter: Team that shot first in the penalty shootout

3. `goalscorers.csv`

This file contains details of goals scored in international matches, including:

date: Match date
home_team: Home team name
away_team: Away team name
team: Team that scored the goal
scorer: Player who scored the goal
minute: Minute in which the goal was scored
own_goal: Whether it was an own goal
penalty: Whether the goal was a penalty

Project Goals

Data Cleaning & Transformation: Load, clean, and transform the data into a structured format.
Schema Design: Design a star schema with fact and dimension tables.
ETL Pipelines: Build pipelines to process, store, and query the data efficiently.
Big Data Processing: Utilize Spark and Hadoop for scalable processing.
Cloud Integration: Store and query data using GCS and BigQuery.
Data Visualization: Create interactive dashboards using Looker.
Workflow Automation: Orchestrate ETL processes with Apache Airflow.

Name		Name	Last commit message	Last commit date
Latest commit History 27 Commits
Dashboards		Dashboards
Data		Data
dags		dags
notebooks		notebooks
scripts		scripts
Big query.png		Big query.png
README.md		README.md
StarSchemaDiagram.drawio.png		StarSchemaDiagram.drawio.png
airflow_dag.PNG		airflow_dag.PNG
data_and_dwh_links.md		data_and_dwh_links.md
projectDiagram.png		projectDiagram.png

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

International Football Matches Data Engineering Project

Overview

Technologies Used

Dataset International Football Results (1872-2017) - Kaggle

1. `results.csv`

2. `shootouts.csv`

3. `goalscorers.csv`

Project Goals

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

International Football Matches Data Engineering Project

Overview

Technologies Used

Dataset International Football Results (1872-2017) - Kaggle

1. results.csv

2. shootouts.csv

3. goalscorers.csv

Project Goals

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

1. `results.csv`

2. `shootouts.csv`

3. `goalscorers.csv`

Packages