- Don’t collect large RDDs
- Don't use count() when you don't need to return the exact number of rows
- Avoiding Shuffle "Less stage, run faster"
- Picking the Right Operators
- Avoid List of Iterators
- Avoid groupByKey when performing a group of multiple items by key
- Avoid groupByKey when performing an associative reductive operation
- Avoid reduceByKey when the input and output value types are different
- Avoid the flatMap-join-groupBy pattern
- Use TreeReduce/TreeAggregate instead of Reduce/Aggregate
- Hash-partition before transformation over pair RDD
- Use coalesce to repartition in decrease number of partition
- TreeReduce and TreeAggregate Demystified
- When to use Broadcast variable
- Joining a large and a small RDD
- Joining a large and a medium size RDD