Editing CommunityData:Hyak
From CommunityData
Warning: You are not logged in. Your IP address will be publicly visible if you make any edits. If you log in or create an account, your edits will be attributed to your username, along with other benefits.
The edit can be undone. Please check the comparison below to verify that this is what you want to do, and then publish the changes below to finish undoing the edit.
Latest revision | Your text | ||
Line 180: | Line 180: | ||
# If you find yourself spending time manually rerunning code in a multistage project, learn [https://en.wikipedia.org/wiki/Make_(software) Make] or another pipeline tool. Such tools take some effort but really help you organize, test, and refine your project. Make is a good choice because it is old and incredibly polished and featureful. You don't need to learn every feature, just the basics. Its interface has a different flavor than more recently designed tools which can be a downside. Other positives are that it is language agnostic and can run shell commands. | # If you find yourself spending time manually rerunning code in a multistage project, learn [https://en.wikipedia.org/wiki/Make_(software) Make] or another pipeline tool. Such tools take some effort but really help you organize, test, and refine your project. Make is a good choice because it is old and incredibly polished and featureful. You don't need to learn every feature, just the basics. Its interface has a different flavor than more recently designed tools which can be a downside. Other positives are that it is language agnostic and can run shell commands. | ||
# [https://slurm.schedmd.com/documentation.html Slurm] the system that you use to access hyak nodes, is also a very powerful system. The hyak team used to maintain a tool called parallel-sql which helped with running a large number of short-running programs. This tool is no longer supported, but [https://slurm.schedmd.com/job_array.html job arrays] are slurm feature that is even better. | # [https://slurm.schedmd.com/documentation.html Slurm] the system that you use to access hyak nodes, is also a very powerful system. The hyak team used to maintain a tool called parallel-sql which helped with running a large number of short-running programs. This tool is no longer supported, but [https://slurm.schedmd.com/job_array.html job arrays] are slurm feature that is even better. | ||
# Use the free resources. Job arrays (mentioned above) are great in combination with the [https://wiki.cac.washington.edu/display/hyakusers/Mox_checkpoint checkpoint queue]. The checkpoint (or ckpt) queue runs your jobs on other people's idle nodes. You can access thousands of cores and terabytes of RAM on the checkpoint queue. There are limitations. If the owner of a node wants to use it, they will cancel your job. If this happens, the scheduler will automatically restart it, and it has a maximum total running time (restarts don't reset the clock). Therefore, it is best suited for jobs that can be paused (saved) and restarted. If you can design a script to catch the checkpoint signal, save progress, and restart you will be able to make excellent use of the checkpoint queue | # Use the free resources. Job arrays (mentioned above) are great in combination with the [https://wiki.cac.washington.edu/display/hyakusers/Mox_checkpoint checkpoint queue]. The checkpoint (or ckpt) queue runs your jobs on other people's idle nodes. You can access thousands of cores and terabytes of RAM on the checkpoint queue. There are limitations. If the owner of a node wants to use it, they will cancel your job. If this happens, the scheduler will automatically restart it, and it has a maximum total running time (restarts don't reset the clock). Therefore, it is best suited for jobs that can be paused (saved) and restarted. If you can design a script to catch the checkpoint signal, save progress, and restart you will be able to make excellent use of the checkpoint queue. <br /> There is also virtually [https://hyak.uw.edu/docs/storage/gscratch/ unlimited free storage] on hyak under <code>/gscratch/scrubbed/comdata</code> with the catch that the storage is much slower and that files will be automatically deleted after a short time (currently 21 days). | ||
# Get connected to the hyak team and other hyak users. Hyak isn't perfect and has many recent issues related to the new Klone system. If you run into trouble and it feels like the system isn't working you should email help@uw.edu with a subject line that starts with "hyak:". They are nice and helpful. Other good resources are the [https://mailman12.u.washington.edu/mailman/listinfo/hyak-users mailing list] and if you are a UW student, the [https://depts.washington.edu/uwrcc/getting-started-2/getting-started/ research computing club]. The club has its own nodes, including GPU nodes that only students who join the club can use. | # Get connected to the hyak team and other hyak users. Hyak isn't perfect and has many recent issues related to the new Klone system. If you run into trouble and it feels like the system isn't working you should email help@uw.edu with a subject line that starts with "hyak:". They are nice and helpful. Other good resources are the [https://mailman12.u.washington.edu/mailman/listinfo/hyak-users mailing list] and if you are a UW student, the [https://depts.washington.edu/uwrcc/getting-started-2/getting-started/ research computing club]. The club has its own nodes, including GPU nodes that only students who join the club can use. | ||