- Appealing APIs(極具吸引力的API) - Python, R, Scala, Java, with SQL like DataFrame API
- Lazy Execution - helpful when you define a complex series of transformation
- East Conversion -
toPandas()
- Do aggregation by spark, plot by pandas! - Open Source Community - sparknlp, h2o and so on
-
Cluster management - really difficult to maintain, OOM error is very common.
-
Debugging - OOM errors, logging in UDF is hard.
-
Slowness of Pyspark UDFs - parsing python object into JVM
-
Hard-to-Guarantee Maximal Parallelism - it's control by spark.
-
API Awkwardness - accessing array elements is very hard, a lot of Spark-ML functions return arrays.
-
Lack of Maturity and Feature Completeness - MLLib and ML
- Random Forest did not have feature importance in its new ML library unitl Spark 2.0
- Gradient Boostied Tree did not expose a probability score until Spark 2.2
- model exposing a floating point score, an ArrayType is returned, [0.25, 0.75], Shockingly, there is no built-in function to extract that 0.75. It requries a UDF. As a result, we fund ourselves falling back to training models locally using the more mature scikit-learn library.