Editing CommunityData:Hyak Spark (section)

== Spark on Hyak == 

If you are already set up on Hyak following the instructions on [[CommunityData:Hyak]] then you should already have a working spark installation on Hyak.  Test this by running 
   pyspark

from a hyak cluster node (directly on the login node will give you an insufficient memory error).

You should see this: 

    Welcome to
          ____              __
         / __/__  ___ _____/ /__
        _\ \/ _ \/ _ `/ __/  '_/
       /__ / .__/\_,_/_/ /_/\_\   version 3.3.1
          /_/

If so then you are ready to start running Spark programs on a single node. If you don't see this, then you need to check your <code>$SPARK_HOME</code>, <code>$PATH</code>, <code>$JAVA_HOME</code>, and <code>$PYTHONPATH</code> environment variables. 

If you are using the [[CommunityData:Hyak-Mox | cdsc mox setup]] then you should have a working spark configuration in your environment already.  Otherwise, you'll need to have the following in your .bashrc (remember to source .bashrc or re-login to load up these changes):

<syntaxhighlight lang="bash">
    export JAVA_HOME='/gscratch/comdata/local/open-jdk'
    export PATH="$JAVA_HOME/bin:$PATH"
    export SPARK_HOME='/gscratch/comdata/local/spark'
    export PATH="$SPARK_HOME/bin":$PATH
    export PYTHONPATH="$SPARK_HOME/python:"$PYTHONPATH

    export TMPDIR="/gscratch/comdata/users/$USER/tmpdir"
</syntaxhighlight>

You can also run spark programs on many nodes, but this requires additional steps. These are described below.