= Apache Spark = [[https://spark.apache.org/docs/2.4.8/index.html|Spark 2.4.8]] is a '''MapReduce-like cluster computing framework designed for low-latency iterative jobs and interactive use from an interpreter'''. It provides clean, language-integrated APIs in Scala and Java, with a rich array of parallel operators. Spark can run on top of the Apache Mesos cluster manager, Hadoop YARN, Amazon EC2, or without an independent resource manager (“standalone mode”). [[https://spark.apache.org/docs/3.1.2/index.html|Spark 3.1.2]] Apache Spark is a '''unified analytics engine for large-scale data processing'''. It provides high-level APIs in Java, Scala, Python and R, and an optimized engine that supports general execution graphs. It also supports a rich set of higher-level tools including Spark SQL for SQL and structured data processing, MLlib for machine learning, GraphX for graph processing, and Structured Streaming for incremental computation and stream processing. == Spark components == Spark components: * Driver program * Cluster manager * Worker Node * [[https://spark.apache.org/docs/3.1.1/api/python/reference/api/pyspark.RDD.html#pyspark-rdd | Resilient distributed dataset (RDD) ]]