Skip to content

Latest commit

 

History

History
29 lines (18 loc) · 1.4 KB

good_bad_ugly_spark.md

File metadata and controls

29 lines (18 loc) · 1.4 KB

Ref

https://thenewstack.io/the-good-bad-and-ugly-apache-spark-for-data-science-work/?fbclid=IwAR2JdXmkinD2jUzUVNHd9uTvQ9ZjlkqB4ePw4Qmr3qfGiX8vAchZ2AuiUwQ

Good

  • Appealing APIs(極具吸引力的API) - Python, R, Scala, Java, with SQL like DataFrame API
  • Lazy Execution - helpful when you define a complex series of transformation
  • East Conversion - toPandas() - Do aggregation by spark, plot by pandas!
  • Open Source Community - sparknlp, h2o and so on

bad

  • Cluster management - really difficult to maintain, OOM error is very common.

  • Debugging - OOM errors, logging in UDF is hard.

  • Slowness of Pyspark UDFs - parsing python object into JVM

  • Hard-to-Guarantee Maximal Parallelism - it's control by spark.

Ugly

  • API Awkwardness - accessing array elements is very hard, a lot of Spark-ML functions return arrays.

  • Lack of Maturity and Feature Completeness - MLLib and ML

    • Random Forest did not have feature importance in its new ML library unitl Spark 2.0
    • Gradient Boostied Tree did not expose a probability score until Spark 2.2
    • model exposing a floating point score, an ArrayType is returned, [0.25, 0.75], Shockingly, there is no built-in function to extract that 0.75. It requries a UDF. As a result, we fund ourselves falling back to training models locally using the more mature scikit-learn library.