Editing CommunityData:Hyak (section)

== Running Jobs on Hyak == 

When you first log in to Hyak, you will be on a "login node". These are nodes that have access to the Internet, and can be used to update code, move files around, etc. They should not be used for computationally intensive tasks. To actually run jobs, there are a few different options, described in detail [http://wiki.cac.washington.edu/display/hyakusers/Mox_scheduler in the Hyak User documentation]. Following are basic instructions for some common use cases.

=== Interactive nodes ===

Interactive nodes are systems where you get a <code>bash</code> shell from which you can run your code. This mode of operation is conceptually similar to running your code on your own computer, the difference being that you have access to much more CPU and memory. To check out an interactive node, run the <code>big_machine</code> or <code>any_machine</code> command from your login shell. Before running these commands, you will want to be in a [[CommunityData:Tmux|<code>tmux</code>]] or <code>screen</code> session so that you can start your job, and log off without having to worry about your job getting terminated.

{{note}} At a given point of time, unless you are using the <code>ckpt</code> (formerly the <code>bf</code>) queue, our entire group can collectiveley have one instance of <code>big_machine</code> and three instances of <code>any_machine</code> running at the same time. You may need to coordinate over IRC if you need to use a specific node for any reason.

=== Killing jobs on compute nodes ===

The Slurm scheduler provides a command called [https://slurm.schedmd.com/scancel.html scancel] to terminate jobs. For example, you might run <tt>queue_state</tt> from a login node to figure out the ID number for your job (let's say it's 12345), then run <tt>scancel --signal=TERM 12345</tt> to send a SIGTERM signal or <tt>scancel --signal=KILL 12345</tt> to send a SIGKILL signal that will bring job 12345 to an end.

=== Parallelization Tips ===

The nodes on Mox have 28 CPU cores.  Our nodes on Klone have 40.  These may help in speeding up your analysis ''significantly''. If you are using R functions such as <code>lapply</code>, there are parallelized equivalents (e.g. <code>mclappy</code>) which can take advantage of all the cores and give you a 2800% or (4000)% boost! However, something to be aware of here is your code's memory requirement—if you are running 28 processes in parallel, your memory needs can also go up to 28x, which may be more than the ~200GB that the <code>big_machine</code> node on mox will have. In such cases, you may want to dial down the number of CPU cores being used—a way to do that globally in your code is to run the following snippet of code before calling any of the parallelized functions.

If you find yourself doing this often, consider if it is possible to reduce your memory usage via streaming, databases (like sqlite; parquet files; or duckdb), or lower-precision data types (i.e., use 32bit or even 16bit floating point numbers instead of the standard 64bit). 

<source lang="r">
library(parallel)
options(mc.cores=20)  ## tell the mc* functions to use 20 cores unless otherwise specified
mcaffinity(1:20)
</source> 

More information on parallelizing your R code can be found in the [https://stat.ethz.ch/R-manual/R-devel/library/parallel/doc/parallel.pdf <code>parallel</code> package documentation].

=== Using the Checkpoint Queue ===

Hyak has a special way of scheduling jobs using the '''checkpoint queue'''.  When you run jobs on the checkpoint queue, they run on someone else's hyak node that they aren't using right now.  This is awesome as it gives us a huge amount of free (as in beer) computing.  But using the checkpoint queue does take some effort, mainly because your jobs can get killed at any time if the owner of the node checks it out.  So if you want to run a job for more than a few minutes on the checkpoint queue it will need to be able to "checkpoint" by saving it's state periodically and then restarting. 

 
==== Starting a checkpoint queue job ====
To start a checkpoint queue job we'll use <code>sbatch</code> instead of srun.  See the [https://slurm.schedmd.com/sbatch.html documentation] for a refresher starting hpc jobs using sbatch.

To request a job on the checkpoint queue put the following in the top of your <code>sbatch</code> script.

    #SBATCH --export=ALL
    #SBATCH --account=comdata-ckpt
    #SBATCH --partition=ckpt