Editing CommunityData:Hyak

From CommunityData

Warning: You are not logged in. Your IP address will be publicly visible if you make any edits. If you log in or create an account, your edits will be attributed to your username, along with other benefits.

The edit can be undone. Please check the comparison below to verify that this is what you want to do, and then publish the changes below to finish undoing the edit.

Latest revision Your text
Line 180: Line 180:
# If you find yourself spending time manually rerunning code in a multistage project, learn [https://en.wikipedia.org/wiki/Make_(software) Make] or another pipeline tool.  Such tools take some effort but really help you organize, test, and refine your project.  Make is a good choice because it is old and incredibly polished and featureful. You don't need to learn every feature, just the basics. Its interface has a different flavor than more recently designed tools which can be a downside.  Other positives are that it is language agnostic and can run shell commands.
# If you find yourself spending time manually rerunning code in a multistage project, learn [https://en.wikipedia.org/wiki/Make_(software) Make] or another pipeline tool.  Such tools take some effort but really help you organize, test, and refine your project.  Make is a good choice because it is old and incredibly polished and featureful. You don't need to learn every feature, just the basics. Its interface has a different flavor than more recently designed tools which can be a downside.  Other positives are that it is language agnostic and can run shell commands.
# [https://slurm.schedmd.com/documentation.html Slurm] the system that you use to access hyak nodes, is also a very powerful system.  The hyak team used to maintain a tool called parallel-sql which helped with running a large number of short-running programs. This tool is no longer supported, but [https://slurm.schedmd.com/job_array.html job arrays] are slurm feature that is even better.  
# [https://slurm.schedmd.com/documentation.html Slurm] the system that you use to access hyak nodes, is also a very powerful system.  The hyak team used to maintain a tool called parallel-sql which helped with running a large number of short-running programs. This tool is no longer supported, but [https://slurm.schedmd.com/job_array.html job arrays] are slurm feature that is even better.  
# Use the free resources.  Job arrays (mentioned above) are great in combination with the [https://wiki.cac.washington.edu/display/hyakusers/Mox_checkpoint checkpoint queue]. The checkpoint (or ckpt) queue runs your jobs on other people's idle nodes.  You can access thousands of cores and terabytes of RAM on the checkpoint queue.  There are limitations. If the owner of a node wants to use it, they will cancel your job.  If this happens, the scheduler will automatically restart it, and it has a maximum total running time (restarts don't reset the clock). Therefore, it is best suited for jobs that can be paused (saved) and restarted.  If you can design a script to catch the checkpoint signal, save progress, and restart you will be able to make excellent use of the checkpoint queue. Note that checkpoint jobs get run according to a priority system and if members of our group overuse this resource then our jobs will have lower priority. <br /> There is also virtually [https://hyak.uw.edu/docs/storage/gscratch/ unlimited free storage] on hyak under <code>/gscratch/scrubbed/comdata</code> with the catch that the storage is much slower and that files will be automatically deleted after a short time (currently 21 days).  
# Use the free resources.  Job arrays (mentioned above) are great in combination with the [https://wiki.cac.washington.edu/display/hyakusers/Mox_checkpoint checkpoint queue]. The checkpoint (or ckpt) queue runs your jobs on other people's idle nodes.  You can access thousands of cores and terabytes of RAM on the checkpoint queue.  There are limitations. If the owner of a node wants to use it, they will cancel your job.  If this happens, the scheduler will automatically restart it, and it has a maximum total running time (restarts don't reset the clock). Therefore, it is best suited for jobs that can be paused (saved) and restarted.  If you can design a script to catch the checkpoint signal, save progress, and restart you will be able to make excellent use of the checkpoint queue. <br /> There is also virtually [https://hyak.uw.edu/docs/storage/gscratch/ unlimited free storage] on hyak under <code>/gscratch/scrubbed/comdata</code> with the catch that the storage is much slower and that files will be automatically deleted after a short time (currently 21 days).  
# Get connected to the hyak team and other hyak users.  Hyak isn't perfect and has many recent issues related to the new Klone system. If you run into trouble and it feels like the system isn't working you should email help@uw.edu with a subject line that starts with "hyak:". They are nice and helpful.  Other good resources are the [https://mailman12.u.washington.edu/mailman/listinfo/hyak-users mailing list] and if you are a UW student, the [https://depts.washington.edu/uwrcc/getting-started-2/getting-started/ research computing club].  The club has its own nodes, including GPU nodes that only students who join the club can use.
# Get connected to the hyak team and other hyak users.  Hyak isn't perfect and has many recent issues related to the new Klone system. If you run into trouble and it feels like the system isn't working you should email help@uw.edu with a subject line that starts with "hyak:". They are nice and helpful.  Other good resources are the [https://mailman12.u.washington.edu/mailman/listinfo/hyak-users mailing list] and if you are a UW student, the [https://depts.washington.edu/uwrcc/getting-started-2/getting-started/ research computing club].  The club has its own nodes, including GPU nodes that only students who join the club can use.


Please note that all contributions to CommunityData are considered to be released under the Attribution-Share Alike 3.0 Unported (see CommunityData:Copyrights for details). If you do not want your writing to be edited mercilessly and redistributed at will, then do not submit it here.
You are also promising us that you wrote this yourself, or copied it from a public domain or similar free resource. Do not submit copyrighted work without permission!

To protect the wiki against automated edit spam, we kindly ask you to solve the following CAPTCHA:

Cancel Editing help (opens in new window)

Templates used on this page: