Apache Spark

Apache Spark is a unified analytics engine for large-scale data processing. Matei Zaharia created it at UC Berkeley's AMPLab in 2010, and it became an Apache top-level project in 2014. Spark's core innovation is in-memory computation — by keeping intermediate results in RAM rather than writing them to disk between steps, it processes iterative workloads orders of magnitude faster than Hadoop MapReduce.

The current release is Spark 4.0. The engine provides several integrated modules: Spark SQL for querying structured data with SQL and DataFrames, Structured Streaming for continuous data processing, MLlib for machine learning at scale, and GraphX for graph computation. Spark runs on Hadoop YARN, Kubernetes, Apache Mesos, or in standalone mode, and provides APIs in Scala, Java, Python (PySpark), and R.

The official documentation covers programming guides, SQL references, and deployment. The source code is on GitHub under the Apache 2.0 license.

spark.apache.org

Related technologies

Apache Hadoop Apache Kafka

Get an honest assessment

Not sure what you need yet? We'll take an honest look at where you stand and what makes sense as a next step.