Editing CommunityData:Hyak Ikt (Deprecreated)
From CommunityData
The edit can be undone. Please check the comparison below to verify that this is what you want to do, and then publish the changes below to finish undoing the edit.
Latest revision | Your text | ||
Line 204: | Line 204: | ||
Note that only four user accounts at a time can have the bits necessary to kill other people's jobs, so while you can do this on your own jobs, you'll need to bother the IRC channel to find help cancelling other's jobs (we think that Jeremy, Nate, Aaron, and Mako currently have the bits). Also, check out the [http://docs.adaptivecomputing.com/maui/commands/mjobctl.php documentation for mjobctl] for more info. | Note that only four user accounts at a time can have the bits necessary to kill other people's jobs, so while you can do this on your own jobs, you'll need to bother the IRC channel to find help cancelling other's jobs (we think that Jeremy, Nate, Aaron, and Mako currently have the bits). Also, check out the [http://docs.adaptivecomputing.com/maui/commands/mjobctl.php documentation for mjobctl] for more info. | ||
== Common Troubles and How to Solve Them == | |||
=== Help! I'm over CPU Quota and Hyak is Angry! === | |||
# Don't panic. Everyone has done this at least once. It is spammy and a little bit difficult to deal with but can be solved. You are not in trouble. | |||
# The usual reason for this to happen is because you've accidentally run something on a login node that ought to be run on a compute node. The solution is to find the badly behaved process and then use kill to kill the process. | |||
# If it's a script or command on your commandline, control-c to kill it -- if you backgrounded it, type fg to foreground it and then control-c. But if you ran parallel, you'll need to kill parallel itself. | |||
# ps -faux | grep <your username> will show you all the things you are running (or have someone else run it for you if the spam is so terrible you can't get a command to run). The first column has the usernames, the second column has the process IDs, the last column has the things you're running. In the screenshot, the red is the user name being grepped for. At the end of the line the last three entries are the time (in hyak time, type date if you want to compare hyak time to your time), then how much CPU time something has consumed, then a little diagram of parent and child processes. You want parallel (in the example, 9977). Killing the child process (in the example, 9992) won't likely help because parallel will just go on to the next task you queued up for it. | |||
# kill <process id> | |||
[[File:faux.jpg]] |