This repository will be the central location for the hands-on programming component of the course.
The goal of the course is to build an end-to-end data pipeline processing Amazon reviews.
The data pipeline you construct will look like below:
- Week 1 - Environment Setup - Configure your environment to begin the programming course work
- Week 2 - Spark SQL - write a Python Spark application to analyze local Amazon review data
- Week 3 - Write to Amazon S3 - the program will now connect to Amazon S3 and write data to the storage
- Week 4 - Kafka + Bronze layer - read from Kafka instead of the local file, and use Spark structured streaming to be output to Amazon S3 creating the Bronze layer
- Week 5 - Silver layer - transform and enrich data from the Bronze layer, creating the Silver layer
- Week 6 - Gold layer - define a schema for the silver layer, streams the data from the silver layer, transforms the data, and establishes the gold layer
- TODO: Week 7 BI
Want to continue your learning in Data Engineering? Great -- check out these links:
-
STL Big Data - Innovation, Data Engineering, Analytics Group A meetup for users of Big Data services and tools in the Saint Louis Area. We are interested in Innovation (new tools, techniques, and services), Data Engineering (architecture and design of data movement systems), and Analytics (converting information into meaning). (with Kit Menke and Matt Harris)
-
Data Engineering Podcast This show goes behind the scenes for the tools, techniques, and difficulties associated with the discipline of data engineering. Databases, workflows, automation, and data manipulation are just some of the topics that you will find here.