Not logged in
Talk
Contributions
Create account
Log in
Navigation
Main page
About
People
Publications
Teaching
Resources
Research Blog
Wiki Functions
Recent changes
Help
Licensing
Project page
Discussion
Edit
View history
Editing
CommunityData:Hyak Spark
(section)
From CommunityData
Jump to:
navigation
,
search
Warning:
You are not logged in. Your IP address will be publicly visible if you make any edits. If you
log in
or
create an account
, your edits will be attributed to your username, along with other benefits.
Anti-spam check. Do
not
fill this in!
== Starting a Spark cluster with many nodes on Hyak == It is pretty easy to start up a multiple node cluster on Hyak. If you have <code> /com/local/bin </code> in your <code> $PATH </code> then you should be able to run: <syntaxhighlight lang="bash"> get_spark_nodes.sh 4 </syntaxhighlight> To checkout 4 nodes that can be used as a Spark cluster. The spark cluster will have 4 worker nodes, one of these is also the "master" node. When you run <code> get_spark_nodes.sh </code> you will be routed to the machine that will become the master. If you only want 2 nodes just do <syntaxhighlight lang="bash"> get_spark_nodes.sh 2 </syntaxhighlight> After you get the nodes and have a shell on the master node run <syntaxhighlight lang="bash"> start_spark_cluster.sh </syntaxhighlight> This will setup the cluster. Make sure you start up the cluster from the same session you used to get_spark_nodes -- otherwise, the startup script doesn't have access to the assigned node list and will fail. Take note of the node that is assigned to be the master, and use that information to set your $SPARK_MASTER environment variable, for example <code>export SPARK_MASTER="n0650"</code>. The program <code> spark-submit </code> submits your script to the running Spark cluster. <syntaxhighlight lang="bash"> spark-submit --master spark://$SPARK_MASTER:18899 your_script.py [Arguments to your script here]. </syntaxhighlight> For example, we can submit the script we used in the walkthrough as: <syntaxhighlight lang="bash"> spark-submit --master spark://$SPARK_MASTER:18899 wikiq_users_spark.py --output-format tsv -i "/com/output/wikiq-enwiki-20180301/enwiki-20180301-pages-meta-history*.tsv" -o "/com/output/wikiq-users-enwiki-20180301-tsv/" --num-partitions 500 </syntaxhighlight> When you have a spark cluster running, it will serve some nice monitoring tools on ports 8989 and 4040 of the master. You can build an ssh tunnel between your laptop and these nodes to monitor the progress of your spark jobs. When you are done with the cluster, you should shut it down using the script in <code> $SPARK_HOME/sbin </code> <syntaxhighlight lang="bash"> $SPARK_HOME/sbin/stop-all.sh </syntaxhighlight> === Monitoring the cluster === From a login node (hyak): <syntaxhighlight lang="bash"> ssh -L localhost:8989:localhost:8989 $SPARK_MASTER -N -f && ssh -L localhost:4040:localhost:4040 $SPARK_MASTER -n -F </syntaxhighlight> From your laptop: <syntaxhighlight lang="bash"> ssh -L localhost:8989:localhost:8989 hyak -N -f && ssh -L localhost:4040:localhost:4040 hyak -n -F </syntaxhighlight> Point your browser to localhost:8989 to see the cluster status and to localhost:4040 to monitor jobs.
Summary:
Please note that all contributions to CommunityData are considered to be released under the Attribution-Share Alike 3.0 Unported (see
CommunityData:Copyrights
for details). If you do not want your writing to be edited mercilessly and redistributed at will, then do not submit it here.
You are also promising us that you wrote this yourself, or copied it from a public domain or similar free resource.
Do not submit copyrighted work without permission!
To protect the wiki against automated edit spam, we kindly ask you to solve the following CAPTCHA:
Cancel
Editing help
(opens in new window)
Tools
What links here
Related changes
Special pages
Page information