Editing CommunityData:Hyak (section)

= Tips and Faqs =

== 5 productivity tips ==

# Find a workflow that works for you. There isn't a standardized workflow for quantitative / computational social science or social computing. People normally develop idiosyncratic workflows around the distinctive tools they know or have been exposed and that meet their diverse needs and tastes. Be aware of how you're spending your time and effort and adopt tools in your workflow that make things easier or more efficient. For example, if you're spending a lot of time typing into the hyak command line, bash-completion and bash-history can help, and a pipeline (see below)  might help even more.
# If you find yourself spending time manually rerunning code in a multistage project, learn [https://en.wikipedia.org/wiki/Make_(software) Make] or another pipeline tool.  Such tools take some effort but really help you organize, test, and refine your project.  Make is a good choice because it is old and incredibly polished and featureful. You don't need to learn every feature, just the basics. Its interface has a different flavor than more recently designed tools which can be a downside.  Other positives are that it is language agnostic and can run shell commands.
# [https://slurm.schedmd.com/documentation.html Slurm] the system that you use to access hyak nodes, is also a very powerful system.  The hyak team used to maintain a tool called parallel-sql which helped with running a large number of short-running programs. This tool is no longer supported, but [https://slurm.schedmd.com/job_array.html job arrays] are slurm feature that is even better. 
# Use the free resources.  Job arrays (mentioned above) are great in combination with the [https://wiki.cac.washington.edu/display/hyakusers/Mox_checkpoint checkpoint queue]. The checkpoint (or ckpt) queue runs your jobs on other people's idle nodes.  You can access thousands of cores and terabytes of RAM on the checkpoint queue.  There are limitations. If the owner of a node wants to use it, they will cancel your job.  If this happens, the scheduler will automatically restart it, and it has a maximum total running time (restarts don't reset the clock). Therefore, it is best suited for jobs that can be paused (saved) and restarted.  If you can design a script to catch the checkpoint signal, save progress, and restart you will be able to make excellent use of the checkpoint queue. Note that checkpoint jobs get run according to a priority system and if members of our group overuse this resource then our jobs will have lower priority. <br /> There is also virtually [https://hyak.uw.edu/docs/storage/gscratch/ unlimited free storage] on hyak under <code>/gscratch/scrubbed/comdata</code> with the catch that the storage is much slower and that files will be automatically deleted after a short time (currently 21 days). 
# Get connected to the hyak team and other hyak users.  Hyak isn't perfect and has many recent issues related to the new Klone system. If you run into trouble and it feels like the system isn't working you should email help@uw.edu with a subject line that starts with "hyak:". They are nice and helpful.  Other good resources are the [https://mailman12.u.washington.edu/mailman/listinfo/hyak-users mailing list] and if you are a UW student, the [https://depts.washington.edu/uwrcc/getting-started-2/getting-started/ research computing club].  The club has its own nodes, including GPU nodes that only students who join the club can use.

== Common Troubles and How to Solve Them ==
=== Help! I'm over CPU quota and Hyak is angry! ===

'''Don't panic.''' Everyone has done this at least once. Mako has done it dozens of times. It is a little bit difficult to deal with but can be solved. You are not in trouble.

The usual reason for this to happen is because you've accidentally run something on a login node that ought to be run on a compute node. The solution is to find the badly behaved process and then use kill to kill the process.

If it's a script or command on your commandline, '''Ctrl-c''' to kill it. If you backgrounded it, type <code>fg</code> to foreground it and then '''Ctrl-c'''. But if you ran parallel, you'll need to kill parallel itself.

<code>ps -faux | grep <your username></code> will show you all the things you are running (or have someone else run it for you if the spam is so terrible you can't get a command to run). The first column has the usernames, the second column has the process IDs, the last column has the things you're running.

[[File:faux.jpg]]

In the screenshot, the red is the user name being grepped for. At the end of the line the last three entries are the time (in hyak time, type date if you want to compare hyak time to your time), then how much CPU time something has consumed, then a little diagram of parent and child processes. You want parallel (in the example, 9977). 

Killing the child process (in the example, 9992) won't likely help because parallel will just go on to the next task you queued up for it. You will need to run something like: <code>kill <process id></code>

=== My R Job is getting Killed ===

First, make sure you're running on a compute node (like n2344) or else the int_machine and don't use a --time-min flag -- there seems to be a bug with --time-min where it evicts jobs incorrectly. 

Second, see if you can narrow down where in your R code the problem is happening. Kaylea has seen it primarily when reading or writing files, and this tip is from that experience. Breaking the read or write into smaller chunks (if that makes sense for your project) might be all it takes.