Apache Spark


Apache Spark is a multi-language engine for executing data engineering, data science, and machine learning on single-node machines or clusters. In this document we provide instructions to run multi-node Spark cluster on Andes system and show an example of pyspark job.

Getting Started

Download Spark from the Apache Spark download page. I used Spark-3.1, but it should work for newer versions as well. untar the downloaded file and rename the directory as spark.

To setup a Spark cluster, we use the following Slurm script to request compute nodes. The slurm script requests four nodes and spawns a Spark cluster having a master node and three worker nodes. Number of worker nodes can be increased or decreased by changing the value of -N option in the Slurm script.

#SBATCH --mem=0
#SBATCH -A <ABC1234>
#SBATCH -t 1:00:00
#SBATCH -J spark_test
#SBATCH -o o.spark_test
#SBATCH -e e.spark_test

module load spark
module load python

nodes=($(scontrol show hostnames ${SLURM_JOB_NODELIST} | sort | uniq ))
last=$(( $numnodes - 1 ))
ssh ${nodes[0]} "cd ${SPARK_HOME}; source /etc/profile ; module load spark; ./sbin/start-master.sh"
for i in $( seq 1 $last )
    ssh ${nodes[$i]} "cd ${SPARK_HOME}; source /etc/profile ; module load spark; ./sbin/start-worker.sh ${masterurl}"

ssh ${nodes[0]} "cd ${SPARK_HOME}; source /etc/profile ; module load spark; /usr/bin/time -v ./bin/spark-submit --deploy-mode client --executor-cores 32 --executor-memory 250G --conf spark.standalone.submit.waitAppCompletion=true --master $masterurl spark_test.py"
echo 'end'

The Slurm script submits a test python script (spark_test.py) described below. This script runs a pyspark code to test the Spark cluster. Copy the content below and save it as a spark_test.py file in the SPARK_HOME directory. You can also change the spark_test.py file’s path, but you will have to update the Slurm script appropriately.

import random
from pyspark.sql import SparkSession
import pyspark.sql.functions as F

spark = SparkSession.builder.appName('Test-app').getOrCreate()

#Generate sample dataset
cola_list = ['2022-01-01', '2022-01-02', '2022-01-03' ]
colb_list = ['CSC', 'PHY', 'MAT', 'ENG', 'CHE', 'ENV', 'BIO', 'PHRM']
colc_list = [100, 200, 300, 400, 500, 600, 700, 800, 900]

# declaring a random.seed value to generate same data in every run
sample_data = []
for idx in range(1000):
    sample_data.append([random.choice(cola_list), random.choice(colb_list), random.choice(colc_list)])

columns= ["date", "org", "value"]
#creating a Spark dataframe
df = spark.createDataFrame(data = sample_data, schema = columns)

res = (df.groupBy('date','org')

If the spark cluster is setup and the spark-test.py executes successfully, the output in the log file o.spark_test should look similar to the table below.

|      date| org|count_value|
|2022-01-03| BIO|         37|
|2022-01-02| ENV|         53|
|2022-01-03| CHE|         39|
|2022-01-03| PHY|         46|
|2022-01-01| CSC|         45|
|2022-01-03| CSC|         48|
|2022-01-01| BIO|         39|
|2022-01-01| MAT|         42|
|2022-01-02| CHE|         44|
|2022-01-03| ENV|         33|
|2022-01-01| ENG|         33|
|2022-01-02| ENG|         28|
|2022-01-01| ENV|         33|
|2022-01-02| CSC|         45|
|2022-01-02| MAT|         51|
|2022-01-01| PHY|         38|
|2022-01-01|PHRM|         40|
|2022-01-03|PHRM|         42|
|2022-01-02|PHRM|         43|
|2022-01-03| ENG|         56|
only showing top 20 rows

Spark also provides a web UI to monitor cluster, and you can access it on your local machine by port forwarding the master node to local machine.

  • For example, if master node is running on andes338, you can run the following code on your local machine terminal.

    ssh -N <USERNAME>@andes-login1.olcf.ornl.gov -L 8080:andes338.olcf.ornl.gov:8080

  • Then access the Spark dashboard using address http://localhost:8080/ on a web browser on your local machine.


The spark documentation is very useful tool, go through it to find the Spark capabilities.