Exercises in Python/SQL, semester project for Advanced Topics in Database Systems course at ECE⚡, NTUA🎓, academic year 2021-2022
The dataset used for this project is Full MovieLens Dataset .
The project consists of two main parts:
- Implement and test 5 requested queries using RDD API and Spark SQL
- Do performance analysis for Reduce-Side join, Map-Side join implementations
Details:
- We used 3 VMs for our cluster ( 1 NameNode , 2 DataNodes )
- Dataset formats used: csv, dataframe, parquet
- get familiar with Spark API
- evaluate performance for a list of queries
- compare different join algorithms in Spark Map-Reduce
Project's assignment and report are written in greek.
Name - GitHub | |
---|---|
Stylianos Kandylakis | |
Kitsos Orfanopoulos | |
Christos Tsoufis |
OS | CPUs | RAM | Disk space |
---|---|---|---|
Ubuntu 16.04 LTS (Xenial) | 2 | 2GB | 30GB |