Apache Spark is a lightning-fast unified analytics engine for big data and machine learning.
Apache Spark is an open-source distributed general-purpose cluster computing framework with an in-memory data processing engine that can do
Machine learning &
Graph processing on large volumes of data
at rest (batch processing) or in motion (streaming processing) with rich concise high-level APIs for the programming languages: Scala, Python, Java, R, and SQL.
Highlights of Apache Spark
|You could also describe Spark as a distributed, data processing engine for batch and streaming modes featuring SQL queries, graph processing, and machine learning.|
|In contrast to Hadoop’s two-stage disk-based MapReduce computation engine, Spark’s multi-stage in-memory computing engine allows for running most computations in memory.
Hence provides better performance for certain applications, e.g. iterative algorithms or interactive data mining.
|Aims at speed, ease of use, extensibility and interactive analytics.|
|Often called cluster computing engine or simply execution engine.|
|Distributed platform for executing complex multi-stage applications, like machine learning algorithms, and interactive ad hoc queries.|
|Provides an efficient abstraction for in-memory cluster computing called Resilient Distributed Dataset.|
|Using Spark Application Frameworks, Spark simplifies access to machine learning and predictive analytics at scale.|
|Mainly written in Scala, but provides developer API for languages like Java, Python, and R.|
|Access any data type across any data source.|
|Huge demand for storage and data processing.|
|The Apache Spark project is an umbrella for SQL (with Datasets), streaming, machine learning (pipelines) and graph processing engines built on top of the Spark Core.
You can run them all in a single application using a consistent API.
|Spark runs locally as well as in clusters, on-premises, or in the cloud. It runs on top of Hadoop YARN, Apache Mesos, standalone or in the cloud (Amazon EC2 or IBM Bluemix).|
|Apache Spark’s Structured Streaming and SQL programming models with MLlib and GraphX make it easier for developers and data scientists to build applications that exploit machine learning and graph analytics.|
Apache Spark | The Good
It does in-memory, distributed and iterative computation, which is particularly useful when working with machine learning algorithms. Other tools might require writing intermediate results to disk and reading them back into memory, which can make using iterative algorithms painfully slow.
Appealing APIs and Lazy Execution
Spark’s API is truly appealing. Users can choose from multiple languages: Python, R, Scala, and Java. Spark offers a data frame abstraction with object-oriented methods for transformations, joins, filters, and more. This object orientation makes it easy to create custom reusable code that is also testable with mature testing frameworks.
“Lazy execution” is especially helpful as it allows you to define a complex series of transformations represented as an object. Further, you can inspect the structure of the end result without even executing the individual intermediate steps.
And Spark checks for errors in the execution plan before submitting so that bad code fails fast.
PySpark offers a “toPandas()” method to seamlessly convert Spark DataFrames to Pandas, and its “SparkSession.createDataFrame()” can do the reverse.
“toPandas()” method allows you to work in-memory once Spark has crunched the data into smaller datasets. When combined with Pandas’ plotting method, you can chain together commands to join your large datasets, filter, aggregate and plot all in one command.
Python is a language that enables rapid operationalization of data and the PySpark package extends this functionality to massive datasets.
Pivoting is a challenge for many big data frameworks. In SQL, it typically requires many case statements. Spark has an easy and intuitive way of pivoting a DataFrame. The user simply performs a “groupBy” on the target index columns, a pivot of the target field to use as columns, and finally an aggregation step. It executes surprisingly fast and is also easy to use.
Another asset of Spark is the “map-side join” broadcast method. This method speeds up joins significantly when one of the tables is smaller than the other and can fit in its entirety on individual machines.
The smaller one gets sent to all nodes so the data from the bigger table doesn’t need to be moved around. This also helps mitigate problems from skew. If the big table has a lot of skew on the join keys, it will try to send a large amount of data from the big table to a small number of nodes to perform the join and overwhelm those nodes.
Open Source Community
Spark has a massive open-source community behind it. The community improves the core software and contributes practical add-on packages.
For example, a team has developed a natural language processing library for Spark. Previously, a user would either have to use other software or rely on slow user-defined functions to leverage Python packages such as Natural Language Toolkit.
Apache Spark | The Bad
Apache Spark is notoriously difficult to tune and maintain.
If your cluster isn’t expertly managed, this can negate “the Good” as we described above. Jobs failing with out-of-memory errors is very common and having many concurrent users makes resource management even more challenging.
Do you go with fixed or dynamic memory allocation? How many of your cluster’s cores do you allow Spark to use?
How much memory does each executor get? How many partitions should Spark use when it shuffles data?
Getting all these settings right for data science workloads is difficult.
Debugging Spark can be frustrating. The client-side type checking for “DataFrame” operations in PySpark can catch some bugs. But memory errors and errors occurring within user-defined functions can be difficult to track down.
Distributed systems are inherently complex, and so it goes for Spark. Error messages can be misleading or suppressed. Logging from a PySpark User Defined Function (UDF) is difficult and introspection into current processes is not feasible.
Creating tests for your UDFs that run locally helps, but sometimes a function that passes local tests fail when running on the cluster. Figuring out the cause in those cases is challenging.
The slowness of PySpark UDFs
PySpark UDFs are much slower and more memory-intensive than Scala and Java UDFs are. The performance skew towards Scala and Java is understandable, since Spark is written in Scala and runs on the Java Virtual Machine (JVM).
Python UDFs require moving data from the executor’s JVM to a Python interpreter, which is slow. If Python UDF performance is problematic, Spark does enable a user to create Scala UDFs, which can be run in Python. However, this slows down development time.
Hard-to-Guarantee Maximal Parallelism
One of Spark’s key value propositions is distributed computation, yet it can be difficult to ensure Spark parallelizes computations as much as possible. Spark tries to elastically scale how many executors a job uses based on the job’s needs, but it often fails to scale up on its own.
So if you set the minimum number of executors too low, your job may not utilize more executors when it needs them.
Spark divides RDDs (Resilient Distributed Dataset)/DataFrames into partitions, which is the smallest unit of work that an executor takes on. If you set too few partitions, then there may not be enough chunks of work for all the executors to work on. Also, fewer partitions means larger partitions, which can cause executors to run out of memory.