= Apache Spark = [[https://spark.apache.org/docs/2.4.8/index.html|Spark 2.4.8]] is a '''MapReduce-like cluster computing framework designed for low-latency iterative jobs and interactive use from an interpreter'''. It provides clean, language-integrated APIs in Scala and Java, with a rich array of parallel operators. Spark can run on top of the Apache Mesos cluster manager, Hadoop YARN, Amazon EC2, or without an independent resource manager (“standalone mode”). [[https://spark.apache.org/docs/3.1.2/index.html|Spark 3.1.2]] Apache Spark is a '''unified analytics engine for large-scale data processing'''. It provides high-level APIs in Java, Scala, Python and R, and an optimized engine that supports general execution graphs. It also supports a rich set of higher-level tools including Spark SQL for SQL and structured data processing, MLlib for machine learning, GraphX for graph processing, and Structured Streaming for incremental computation and stream processing. == Spark components == Spark components: * Driver program * Cluster manager * Worker Node * [[https://spark.apache.org/docs/3.1.1/api/python/reference/api/pyspark.RDD.html#pyspark-rdd | Resilient distributed dataset (RDD) ]] === RDD === A Resilient Distributed Dataset (RDD), the basic abstraction in Spark. Represents an immutable, partitioned collection of elements that can be operated on in parallel. == Install Spark == {{{#!highlight bash cd ~/tmp curl -O https://apache.claz.org/spark/spark-3.1.2/spark-3.1.2-bin-hadoop3.2.tgz tar tvzf spark-3.1.2-bin-hadoop3.2.tgz tar xvzf spark-3.1.2-bin-hadoop3.2.tgz vi ~/.bashrc export SPARK_HOME=/home/vitor/tmp/spark-3.1.2-bin-hadoop3.2 export PATH=$PATH:$SPARK_HOME/bin:$SPARK_HOME/sbin . ~/.bashrc cd ~/tmp/spark-3.1.2-bin-hadoop3.2/conf vi spark-env.sh SPARK_MASTER_HOST=127.0.0.1 cd ~/tmp start-master.sh # stop it # stop-master.sh # /home/vitor/tmp/spark-3.1.2-bin-hadoop3.2/logs/spark-vitor-org.apache.spark.deploy.master.Master-1-debian.out # 21/07/23 12:02:49 INFO Master: Starting Spark master at spark://127.0.0.1:7077 # 21/07/23 12:02:49 INFO Master: Running Spark version 3.1.2 # 21/07/23 12:02:49 WARN Utils: Service 'MasterUI' could not bind on port 8080. Attempting port 8081. # 21/07/23 12:02:49 INFO Utils: Successfully started service 'MasterUI' on port 8081. # 21/07/23 12:02:49 INFO MasterWebUI: Bound MasterWebUI to 0.0.0.0, and started at http://10.0.2.15:8081 # http://127.0.0.1:8081 # start a worker start-worker.sh spark://127.0.0.1:7077 # http://127.0.0.1:8081 # Alive Workers: 1 }}} === pyspark example === ==== Install pyspark ==== {{{#!highlight bash cd ~/tmp mkdir pyspark-test cd pyspark-test sudo apt install python3-venv python3 -m venv virtenv . virtenv/bin/activate pip3 install --upgrade setuptools pip distlib pip3 install pyspark cd ~/tmp/pyspark-test/ . virtenv/bin/activate python3 test_spark1.py }}} ==== test_spark1.py ==== {{{#!highlight python from pyspark.sql import SparkSession master_url="spark://127.0.0.1:7077" spark = SparkSession.builder.master(master_url).getOrCreate() print("spark session created") }}}