Editing CommunityData:Hyak Ikt (Deprecreated)
From CommunityData
Warning: You are not logged in. Your IP address will be publicly visible if you make any edits. If you log in or create an account, your edits will be attributed to your username, along with other benefits.
The edit can be undone. Please check the comparison below to verify that this is what you want to do, and then publish the changes below to finish undoing the edit.
Latest revision | Your text | ||
Line 1: | Line 1: | ||
To use Hyak, you must first have a UW NetID, access to Hyak, and a two factor authentication token. Details on getting set up with all three are available at [[CommunityData:Hyak setup]]. | To use Hyak, you must first have a UW NetID, access to Hyak, and a two factor authentication token. Details on getting set up with all three are available at [[CommunityData:Hyak setup]]. | ||
Line 123: | Line 121: | ||
The basic workflow is: | The basic workflow is: | ||
1. Prepare the code, and test it with a single file (either on your computer, or on an interactive node). | 1. Prepare the code, and test it with a single file (either on your computer, or on an interactive node). | ||
2. Write a job_script file. This tells the node what job to run. There is an example on the Parallel SQL wiki page (linked above), and an example in the wikiresearch/hyak_example directory. | 2. Write a job_script file. This tells the node what job to run. There is an example on the Parallel SQL wiki page (linked above), and an example in the wikiresearch/hyak_example directory. | ||
3. Create a task_list file. This is a list of commands that should be run, with one line per file that the command should operate on. An example file might look something like: | 3. Create a task_list file. This is a list of commands that should be run, with one line per file that the command should operate on. An example file might look something like: | ||
Line 204: | Line 193: | ||
Note that only four user accounts at a time can have the bits necessary to kill other people's jobs, so while you can do this on your own jobs, you'll need to bother the IRC channel to find help cancelling other's jobs (we think that Jeremy, Nate, Aaron, and Mako currently have the bits). Also, check out the [http://docs.adaptivecomputing.com/maui/commands/mjobctl.php documentation for mjobctl] for more info. | Note that only four user accounts at a time can have the bits necessary to kill other people's jobs, so while you can do this on your own jobs, you'll need to bother the IRC channel to find help cancelling other's jobs (we think that Jeremy, Nate, Aaron, and Mako currently have the bits). Also, check out the [http://docs.adaptivecomputing.com/maui/commands/mjobctl.php documentation for mjobctl] for more info. | ||
== Common Troubles and How to Solve Them == | |||
=== Help! I'm over CPU Quota and Hyak is Angry! === | |||
# Don't panic. Everyone has done this at least once. It is spammy and a little bit difficult to deal with but can be solved. You are not in trouble. | |||
# The usual reason for this to happen is because you've accidentally run something on a login node that ought to be run on a compute node. The solution is to find the badly behaved process and then use kill to kill the process. | |||
# If it's a script or command on your commandline, control-c to kill it -- if you backgrounded it, type fg to foreground it and then control-c. But if you ran parallel, you'll need to kill parallel itself. | |||
# ps -faux | grep <your username> will show you all the things you are running (or have someone else run it for you if the spam is so terrible you can't get a command to run). The first column has the usernames, the second column has the process IDs, the last column has the things you're running. In the screenshot, the red smudge is the (obscured) user name, followed by the process ID. At the end of the line the last three entries are the time (in hyak time, type date if you want to compare hyak time to your time), then how much CPU time something has consumed, then a little diagram of parent and child processes. You want parallel (in the example, 9977). Killing the child process (in the example, 9992) won't likely help because parallel will just go on to the next task you queued up for it. | |||
# kill <process id> | |||
[[File:faux.jpg]] |