Editing CommunityData:Hyak Spark

From CommunityData

Warning: You are not logged in. Your IP address will be publicly visible if you make any edits. If you log in or create an account, your edits will be attributed to your username, along with other benefits.

The edit can be undone. Please check the comparison below to verify that this is what you want to do, and then publish the changes below to finish undoing the edit.

Latest revision Your text
Line 9: Line 9:
This page will help you decide if you should use Spark on Hyak for your problem and provide instructions on how to get started.  
This page will help you decide if you should use Spark on Hyak for your problem and provide instructions on how to get started.  


 
So far only Ikt is supported.


== Pros and Cons of Spark ==  
== Pros and Cons of Spark ==  
Line 43: Line 43:
         / __/__  ___ _____/ /__
         / __/__  ___ _____/ /__
         _\ \/ _ \/ _ `/ __/  '_/
         _\ \/ _ \/ _ `/ __/  '_/
       /__ / .__/\_,_/_/ /_/\_\  version 2.4.4
       /__ / .__/\_,_/_/ /_/\_\  version 2.3.1
           /_/
           /_/


If so then you are ready to start running Spark programs on a single node. If you don't see this, then you need to check your <code>$SPARK_HOME</code>, <code>$PATH</code>, <code>$JAVA_HOME</code>, and <code>$PYTHONPATH</code> environment variables.  
If so then you are ready to start running Spark programs on a single node. If you don't see this, then you need to check your <code>$SPARK_HOME</code>, <code>$PATH</code>, <code>$JAVA_HOME</code>, and <code>$PYTHONPATH</code> environment variables.  


If you are using the [[CommunityData:Hyak-Mox | cdsc mox setup]] then you should have a working spark configuration in your environment already.  Otherwise, you'll need to have the following in your .bashrc (remember to source .bashrc or re-login to load up these changes):
You should have the following in your .bashrc (remember to source .bashrc or re-login to load up these changes):


<syntaxhighlight lang="bash">
<syntaxhighlight lang="bash">
Line 64: Line 64:
== Spark Walkthrough ==
== Spark Walkthrough ==


Spark programming is somewhat different from normal python programming. This section will walk you through a script to help you learn how to work with Spark. You may find this script useful as a template for building variables on top of [[wikiq]] data.
Spark programs is somewhat different from normal python programming. This section will walk you through a script to help you learn how to work with Spark. You may find this script useful as a template for building variables on top of [[wikiq]] data.


This section presents a pyspark program that  
This section presents a pyspark program that  
Line 277: Line 277:


If you have <code> /com/local/bin </code> in your <code> $PATH </code> then you should be able to run:
If you have <code> /com/local/bin </code> in your <code> $PATH </code> then you should be able to run:
 
  get_spark_nodes.sh 4
<syntaxhighlight lang="bash">
    get_spark_nodes.sh 4
</syntaxhighlight>


To checkout 4 nodes that can be used as a Spark cluster. The spark cluster will have 4 worker nodes, one of these is also the "master" node. When you run <code> get_spark_nodes.sh </code> you will be routed to the machine that will become the master. If you only want 2 nodes just do  
To checkout 4 nodes that can be used as a Spark cluster. The spark cluster will have 4 worker nodes, one of these is also the "master" node. When you run <code> get_spark_nodes.sh </code> you will be routed to the machine that will become the master. If you only want 2 nodes just do  


<syntaxhighlight lang="bash">
     get_spark_nodes.sh 2
     get_spark_nodes.sh 2
</syntaxhighlight>


After you get the nodes and have a shell on the master node run
After you get the nodes and have a shell on the master node run
    start_spark_cluster.sh


<syntaxhighlight lang="bash">
This will setup the cluster.  Take note of the node that is assigned to be the master, and use that information to set your $SPARK_MASTER environment variable, for example <code>export SPARK_MASTER="n0650"</code>. The program <code> spark-submit </code> submits your script to the running Spark cluster.  
  start_spark_cluster.sh
</syntaxhighlight>
 
This will setup the cluster.  Make sure you start up the cluster from the same session you used to get_spark_nodes -- otherwise, the startup script doesn't have access to the assigned node list and will fail. Take note of the node that is assigned to be the master, and use that information to set your $SPARK_MASTER environment variable, for example <code>export SPARK_MASTER="n0650"</code>. The program <code> spark-submit </code> submits your script to the running Spark cluster.  


<syntaxhighlight lang="bash">
     spark-submit --master  spark://$SPARK_MASTER:18899 your_script.py [Arguments to your script here].  
     spark-submit --master  spark://$SPARK_MASTER:18899 your_script.py [Arguments to your script here].  
</syntaxhighlight>


For example, we can submit the script we used in the walkthrough as:   
For example, we can submit the script we used in the walkthrough as:   


<syntaxhighlight lang="bash">
     spark-submit --master  spark://$SPARK_MASTER:18899 wikiq_users_spark.py --output-format tsv  -i "/com/output/wikiq-enwiki-20180301/enwiki-20180301-pages-meta-history*.tsv" -o  "/com/output/wikiq-users-enwiki-20180301-tsv/" --num-partitions 500
     spark-submit --master  spark://$SPARK_MASTER:18899 wikiq_users_spark.py --output-format tsv  -i "/com/output/wikiq-enwiki-20180301/enwiki-20180301-pages-meta-history*.tsv" -o  "/com/output/wikiq-users-enwiki-20180301-tsv/" --num-partitions 500
</syntaxhighlight>


When you have a spark cluster running, it will serve some nice monitoring tools on ports 8989 and 4040 of the master.  You can build an ssh tunnel between your laptop and these nodes to monitor the progress of your spark jobs.  
When you have a spark cluster running, it will serve some nice monitoring tools on ports 8989 and 4040 of the master.  You can build an ssh tunnel between your laptop and these nodes to monitor the progress of your spark jobs.  
Line 310: Line 298:
When you are done with the cluster, you should shut it down using the script in <code> $SPARK_HOME/sbin </code>
When you are done with the cluster, you should shut it down using the script in <code> $SPARK_HOME/sbin </code>


<syntaxhighlight lang="bash">
   $SPARK_HOME/sbin/stop-all.sh
   $SPARK_HOME/sbin/stop-all.sh
</syntaxhighlight>
 


=== Monitoring the cluster ===
=== Monitoring the cluster ===


From a login node (hyak):
From a login node (ikt):
 
<syntaxhighlight lang="bash">
     ssh -L localhost:8989:localhost:8989 $SPARK_MASTER -N -f && ssh -L localhost:4040:localhost:4040 $SPARK_MASTER -n -F
     ssh -L localhost:8989:localhost:8989 $SPARK_MASTER -N -f && ssh -L localhost:4040:localhost:4040 $SPARK_MASTER -n -F
</syntaxhighlight>


From your laptop:
From your laptop:  
 
     ssh -L localhost:8989:localhost:8989 ikt -N -f && ssh -L localhost:4040:localhost:4040 ikt -n -F
<syntaxhighlight lang="bash">
     ssh -L localhost:8989:localhost:8989 hyak -N -f && ssh -L localhost:4040:localhost:4040 hyak -n -F
</syntaxhighlight>


Point your browser to localhost:8989 to see the cluster status and to localhost:4040 to monitor jobs.
Point your browser to localhost:8989 to see the cluster status and to localhost:4040 to monitor jobs.
Line 341: Line 322:
# Edit your environment variables (i.e. in your .bashrc) to  
# Edit your environment variables (i.e. in your .bashrc) to  


<syntaxhighlight lang="bash">
     JAVA_HOME=/home/you/Oracle_JDK/
     JAVA_HOME=/home/you/Oracle_JDK/
     PATH=$JAVA_HOME/bin:$PATH  
     PATH=$JAVA_HOME/bin:$PATH  
</syntaxhighlight>




Line 357: Line 336:
You should see this:  
You should see this:  


<syntaxhighlight lang="bash">
 
     Welcome to
     Welcome to
           ____              __
           ____              __
Line 364: Line 343:
       /__ / .__/\_,_/_/ /_/\_\  version 2.3.1
       /__ / .__/\_,_/_/ /_/\_\  version 2.3.1
           /_/
           /_/
</syntaxhighlight>




Line 370: Line 348:


Add to your .bashrc:
Add to your .bashrc:
<syntaxhighlight lang="bash">
     export PYTHONPATH=$SPARK_HOME/python/:$PYTHONPATH
     export PYTHONPATH=$SPARK_HOME/python/:$PYTHONPATH
</syntaxhighlight>


= Tips, Recipes, and Resources =
= Tips, Recipes, and Resources =
Line 378: Line 354:
To gain access to various useful SparkContext functions, you need to instantiate a pointer to the context which encloses your session. It seems to be common for Spark users to call this pointer sc, e.g. after you do
To gain access to various useful SparkContext functions, you need to instantiate a pointer to the context which encloses your session. It seems to be common for Spark users to call this pointer sc, e.g. after you do


<syntaxhighlight lang="python">
     spark = SparkSession.builder.getOrCreate()
     spark = SparkSession.builder.getOrCreate()
</syntaxhighlight>


add a line like
add a line like


<syntaxhighlight lang="python">
     sc = spark.sparkContext
     sc = spark.sparkContext
</syntaxhighlight>


and then you can use sc to access the functions described here: [http://spark.apache.org/docs/2.1.0/api/python/pyspark.html#pyspark.SparkContext].
and then you can use sc to access the functions described here: [http://spark.apache.org/docs/2.1.0/api/python/pyspark.html#pyspark.SparkContext].
Line 394: Line 366:
One way to create an empty dataframe is to generate a schema as in the example script, and then pass the schema into the create method, with an empty RDD object as data.
One way to create an empty dataframe is to generate a schema as in the example script, and then pass the schema into the create method, with an empty RDD object as data.


<syntaxhighlight lang="python">
     myAwesomeDataset = spark.createDataFrame(data=sc.emptyRDD(), schema=myGroovySchema)
     myAwesomeDataset = spark.createDataFrame(data=sc.emptyRDD(), schema=myGroovySchema)
</syntaxhighlight>


==== Pyspark string slicing seems to be non-pythonic ====
==== Pyspark string slicing seems to be non-pythonic ====
Line 408: Line 378:


You can access 2015 with
You can access 2015 with
<syntaxhighlight lang="python">
     articleDF.timestamp[1:4]
     articleDF.timestamp[1:4]
</syntaxhighlight>


And to get 07:
And to get 07:
<syntaxhighlight lang="python">
     articleDF.timestamp[5:2]
     articleDF.timestamp[5:2]
</syntaxhighlight>


==== When Reading In Multiple Files with Different Schemas ====
==== When Reading In Multiple Files with Different Schemas ====


Make sure you re-instantiate your reader object, e.g.
Make sure you re-instantiate your reader object, e.g.  
<syntaxhighlight lang="python">
     sparkReader = spark.read
     sparkReader = spark.read
</syntaxhighlight>


when changing to a new file. The reader may cache the schema of the previous file and fail to detect the new schema. To make sure you have what you're expecting, try a  
when changing to a new file. The reader may cache the schema of the previous file and fail to detect the new schema. To make sure you have what you're expecting, try a  
 
 
<syntaxhighlight lang="python">
     if DEBUG:
     if DEBUG:
         yourDataset.show()
         yourDataset.show()
</syntaxhighlight>


This will get you the same behavior as pandas print(yourDataset.head()) -- 20 rows, nicely formatted in your stdout.
This will get you the same behavior as pandas print(yourDataset.head()) -- 20 rows, nicely formatted in your stdout.
Line 452: Line 413:
     spark java.io.IOException: No subfolder can be created in .
     spark java.io.IOException: No subfolder can be created in .


the University of Google told me it was a disk space issue. But shutting down the cluster and restarting it solved the problem -- maybe it hadn't shut down cleanly the last time someone used spark.
the University of Google told me it was a disk space issue. Re-running the same job with a 4-node cluster instead of 2-node cluster seemed to fix the problem, which is kinda strange since all my code uses a big shared disk.


=== Slurm kills my job! Encountering Memory Limits ===
=== Slurm kills my job! Encountering Memory Limits ===
Line 465: Line 426:
# Be careful moving data out of distributed Spark objects into normal python objects. Work your way up from a small sample to a larger sample.  
# Be careful moving data out of distributed Spark objects into normal python objects. Work your way up from a small sample to a larger sample.  
# You might try tweaking memory management options in <code>$SPARK_HOME/conf/spark-env.sh</code> and <$SPARK_HOME/conf/spark-defaults.conf</code>. Decreasing the number of executors, and the total memory allocated to executors should make Spark more resilient at the cost of performance.
# You might try tweaking memory management options in <code>$SPARK_HOME/conf/spark-env.sh</code> and <$SPARK_HOME/conf/spark-defaults.conf</code>. Decreasing the number of executors, and the total memory allocated to executors should make Spark more resilient at the cost of performance.
=== Errors While Starting Cluster ===
Sometimes I get errors like this:
    n0650: failed to launch: nice -n 0 /com/local/spark/bin/spark-class org.apache.spark.deploy.worker.Worker --webui-port 8081 spark://n0649:18899
Usually it seems to happen if I relinquish my spark cluster (whether I use the kill script or not) and then immediately restart one. The error goes away if I shut down and wait a minute or two before re-launching; my assumption is that there's some hygienic work being done behind the scenes that the scheduler doesn't know about and I need to let that finish.
And sometimes I get errors like this:
scontrol: error: host list is empty
when trying to launch spark. This means I'm in a session that doesn't know what nodes are assigned to the spark cluster. This will launch a dysfunctional cluster. Stop-all and then start_spark_cluster from the same session you landed in when you ran get_spark_nodes.
When I get errors about full logs, I go in and clean up the temporary files folder it refers to.
Please note that all contributions to CommunityData are considered to be released under the Attribution-Share Alike 3.0 Unported (see CommunityData:Copyrights for details). If you do not want your writing to be edited mercilessly and redistributed at will, then do not submit it here.
You are also promising us that you wrote this yourself, or copied it from a public domain or similar free resource. Do not submit copyrighted work without permission!

To protect the wiki against automated edit spam, we kindly ask you to solve the following CAPTCHA:

Cancel Editing help (opens in new window)