Differences between revisions 5 and 6

Apache Spark

Spark 2.4.8 is a MapReduce-like cluster computing framework designed for low-latency iterative jobs and interactive use from an interpreter. It provides clean, language-integrated APIs in Scala and Java, with a rich array of parallel operators. Spark can run on top of the Apache Mesos cluster manager, Hadoop YARN, Amazon EC2, or without an independent resource manager (“standalone mode”).

Spark 3.1.2 Apache Spark is a unified analytics engine for large-scale data processing. It provides high-level APIs in Java, Scala, Python and R, and an optimized engine that supports general execution graphs. It also supports a rich set of higher-level tools including Spark SQL for SQL and structured data processing, MLlib for machine learning, GraphX for graph processing, and Structured Streaming for incremental computation and stream processing.

Spark components

Spark components:

Driver program
Cluster manager
Worker Node
https://spark.apache.org/docs/3.1.1/api/python/reference/api/pyspark.RDD.html#pyspark-rdd

-  ⇤ ← Revision 5 as of 2021-07-23 10:54:06 → 
  Size: 1127
  Editor: localhost
  Comment:
+   ← Revision 6 as of 2021-07-23 14:37:00 → ⇥
  Size: 1224
  Editor: localhost
  Comment:
-Deletions are marked like this.
+Additions are marked like this.
 Line 11:
- * Resilient distributed dataset (RDD)
+ * [[Resilient distributed dataset (RDD) | https://spark.apache.org/docs/3.1.1/api/python/reference/api/pyspark.RDD.html#pyspark-rdd ]]