Polish up your data processing skill using pyspark!
check here to install spark 3.0+
This repo contains 50+ example scripts, 100+ minimum pyspark processing examples so far.
The tutorial is from spark-examples/pyspark-examples
The notebook is a cheatsheet contains 60+ problem and pyspark solutions
| Content ID | Date | Content | Note | 
|---|---|---|---|
| 001 | 1/11 | hello_world | |
| 002 | 1/12 | create_spark_session | |
| 003 | 1/12 | accumulator | |
| 004 | 1/13 | RDD creation | |
| 005 | 1/13 | RDD pararllelization Repartition() vs Coalesce() | |
| 006 | 1/18 | RDD operations - transformations (from 006 - 0064) | |
| 007 | 2/8 | cluster managers | |
| 008 | 2/22 | spark UI | |
| 009 | 2/23 | RDD shuffle | |
| 009 | 2/23 | RDD persist | |
| 010 | 3/9 | Broadcasting | 
| Content ID | Date | Content | Note | 
|---|---|---|---|
| d001 | 1/18 | create_dataframe (from d001 - d0012) | |
| d0011 | 1/18 | create_dataframe_csv | |
| d0012 | 1/18 | create_dataframe_json | |
| d002 | 1/18 | create_empty_dataframe | |
| d003 | 1/18 | spark_frame_to_pandas_frame | |
| d004 | 1/20 | structType/structField from d004 - d0042 | |
| d005 | 1/20 | Row object d005 | |
| d006 | 1/20 | select column from dataframe | |
| d007 | 1/26 | retreve_data_from_dataframe | |
| d008 | 1/26 | add, update, drop column in a dataframe | |
| d009 | 1/27 | filter rows | |
| d010 | 1/27 | filter null | |
| d011 | 1/27 | drop_na | |
| d012 | 1/27 | drop_duplicated | |
| d013 | 1/27 | sorting | |
| d014 | 2/8 | groupby, pivot from d014 to d 0141 | |
| d015 | 2/8 | join | |
| d016 | 2/8 | union | |
| d017 | 2/9 | udf | |
| d018 | 2/9 | flatmap | |
| d019 | 2/9 | map | |
| d020 | 2/13 | sampling | |
| d021 | 2/13 | aggregation | |
| d022 | 2/13 | add_month | |
| d023 | 2/13 | split | |
| d024 | 2/23 | regular expression on pyspark dataframe | |
| d025 | 3/1 | extract img src tag in html by pyspark | 
| Content ID | Date | Content | Note | 
|---|---|---|---|
| p001 | 2/13 | spark-df-profiling | setup doc on pkg/p001 | 
| p002 | 5/20 | graphframes | 
| Content ID | Date | Content | Note | 
|---|---|---|---|
| 001 | 1/21 | MapReduce | |
| 002 | 1/26 | Introduction to Spark(I) - rdd ops, shuffle and stage | revisited 4/13 | 
| 003 | 2/14 | Apache Parquet 2.0 | |
| 004 | 2/16 | Introduction to Parquet | |
| 005 | 4/13 | Introduction to Spark(II) - Driver, Executor, Application, ... | |
| 006 | 4/27 | spark join I | |
| 007 | 4/27 | spark join II | |
| 008 | detect data skew in sparkUI | ||
| 009 | 7/21 | Spark OOM | 
- rdd
 - repartition/coalesce
 - map-reduce
 - yarn
 - mesos
 - parquet
 
- 2017 - Optimizing Apache Spark SQL Joins: Spark Summit East talk by Vida Ha
 - 2019 - Optimizing Apache Spark SQL at LinkedIn
 
| Content ID | Date | Content | Note | 
|---|---|---|---|
| 001 | 0520 | why graph? why spark | 
spark-examples/pyspark-examples
spark python api documentation 3.0.1
Learning Apache Spark with Python
2017 - Optimizing Apache Spark SQL Joins: Spark Summit East talk by Vida Ha