Editing CommunityData:Hyak Spark (section)

== Starting a Spark cluster with many nodes on Hyak == 

It is pretty easy to start up a multiple node cluster on Hyak. 

If you have <code> /com/local/bin </code> in your <code> $PATH </code> then you should be able to run:

<syntaxhighlight lang="bash">
    get_spark_nodes.sh 4
</syntaxhighlight>

To checkout 4 nodes that can be used as a Spark cluster. The spark cluster will have 4 worker nodes, one of these is also the "master" node. When you run <code> get_spark_nodes.sh </code> you will be routed to the machine that will become the master. If you only want 2 nodes just do 

<syntaxhighlight lang="bash">
    get_spark_nodes.sh 2
</syntaxhighlight>

After you get the nodes and have a shell on the master node run

<syntaxhighlight lang="bash">
   start_spark_cluster.sh
</syntaxhighlight>

This will setup the cluster.  Make sure you start up the cluster from the same session you used to get_spark_nodes -- otherwise, the startup script doesn't have access to the assigned node list and will fail. Take note of the node that is assigned to be the master, and use that information to set your $SPARK_MASTER environment variable, for example <code>export SPARK_MASTER="n0650"</code>. The program <code> spark-submit </code> submits your script to the running Spark cluster. 

<syntaxhighlight lang="bash">
    spark-submit --master  spark://$SPARK_MASTER:18899 your_script.py [Arguments to your script here]. 
</syntaxhighlight>

For example, we can submit the script we used in the walkthrough as:   

<syntaxhighlight lang="bash">
    spark-submit --master  spark://$SPARK_MASTER:18899 wikiq_users_spark.py --output-format tsv  -i "/com/output/wikiq-enwiki-20180301/enwiki-20180301-pages-meta-history*.tsv" -o  "/com/output/wikiq-users-enwiki-20180301-tsv/" --num-partitions 500
</syntaxhighlight>

When you have a spark cluster running, it will serve some nice monitoring tools on ports 8989 and 4040 of the master.  You can build an ssh tunnel between your laptop and these nodes to monitor the progress of your spark jobs. 

When you are done with the cluster, you should shut it down using the script in <code> $SPARK_HOME/sbin </code>

<syntaxhighlight lang="bash">
   $SPARK_HOME/sbin/stop-all.sh
</syntaxhighlight>

=== Monitoring the cluster ===

From a login node (hyak):

<syntaxhighlight lang="bash">
    ssh -L localhost:8989:localhost:8989 $SPARK_MASTER -N -f && ssh -L localhost:4040:localhost:4040 $SPARK_MASTER -n -F
</syntaxhighlight>

From your laptop:

<syntaxhighlight lang="bash">
    ssh -L localhost:8989:localhost:8989 hyak -N -f && ssh -L localhost:4040:localhost:4040 hyak -n -F
</syntaxhighlight>

Point your browser to localhost:8989 to see the cluster status and to localhost:4040 to monitor jobs.