Editing CommunityData:Hyak tutorial

This file provides a complete, step-by-step walk-through for how to parse a list of Wikia wikis with wikiq. The same principles can be followed for other tasks.

== Things you should know before you start ==

* Computing paradigms: [[:wikipedia:High performance computing|HPC]] versus [[:wikipedia:MapReduce|MapReduce/Hadoop]]
* [https://wiki.cac.washington.edu/display/hyakusers/WIKI+for+Hyak+users#WIKIforHyakusers-HyakOverview ikt versus mox] and the transition
** This material will cover getting setup on the older ikt cluster
** Our mox cluster is online and we will migrating to it in late 2019/early 2020

== Connecting to Hyak ==

Details information on setting up Hyak is covered [[CommunityData:Hyak]]. Make sure you have:

* Set up SSH
* Connected to Hyak
* Set up your user's Hyak environment with the CDSC aliases and tools

== Interactive Jobs ==

Getting familiar with the MOX scheduler:

* Check out UW-IT's detailed documentation on using the [https://wiki.cac.washington.edu/display/hyakusers/Mox_scheduler Hyak wiki]
* For reference, the system that UW uses is called [https://slurm.schedmd.com/documentation.html Slurm] and you can find lots of other information on it online.
* Some useful commands are:
* <code>sinfo -p comdata</code> — information about our allocation
* <code>squeue -p comdata</code> — information about our current usage
* <code>squeue -u '''makohill'''</code> — information about your current jobs
* <code>hyakalloc</code> — general information on allocations in Hyak

Running interactive jobs is relatively straight forward:

# Run [https://linux.die.net/man/1/screen screen] or [https://www.hamvocke.com/blog/a-quick-and-easy-guide-to-tmux/ tmux] to maintain connections over time ([[CommunityData:tmux|CDSC tmux cheatsheet]])
# Four ways to check out nodes:
:* <code>int_machine</code> — interactive machine (shared with the group) '''[USE THIS FIRST!]'''
:* <code>any_machine</code> — dedicated interactive machine
:* <code>big_machine</code> — dedicated interactive machine with large amounts of memory
:* <code>build_machine</code> — interactive machine with an Internet connection for building R modules and so on

=== Running a job across many cores using GNU R's parallelization features ===

The hyak machines have 16 cpu cores.  The Mox machines will have 28! Running your program on all the cores can speed things up a lot! We make heavy use of R for building datasets and for fitting models. Like most programming languages, R uses only one cpu by default. However, for typical computation-heavy data science tasks it is pretty easy to make R use all the cores. 

For fitting models, the R installed should use all cores automatically. This is thanks to OpenBlas, which is a numerical library that implements and parallelizes linear algebra routines like matrix factorization, matrix inversion, and other operations that bottleneck model fitting. 

However, for building datasets, you need to do a little extra work. One common strategy is to break up the data into independent chunks (for example, when building wikia datasets there is one input file for each wiki) and then use <code>mcapply</code> from <code>library(parallel)</code> to build variables from each chunk. Here is an example: 

    library(parallel)
    options(mc.cores=detectCores())  ## tell R to use all the cores
    
    mcaffinity(1:detectCores()) ## required and explained below 
   
    library(data.table) ## for rbindlist, which concatenates a list of data.tables into a single data.table
    
    ## imagine defining a list of wikis to analyze
    ## and a function to build variables for each wiki
    source("wikilist_and_buildvars")
    
    dataset <- rbindlist(mclapply(wikilist,buildvars))
    
    mcaffinity(rep(1,detectCores())) ## return processor affinities to the status preferred by OpenBlas

A working example can be found in the [[Message Walls]] git repository. 

<code>mcaffinity(1:detectCores())</code> is required for the R <code>library(parallel)</code> to use multiple cores. The reason is technical and has to do with OpenBlas. Essentially, OpenBlas changes settings that govern how R assigns processes to cores. OpenBlas wants all processes assigned to the same core, so that the other cores do not interfere with it's fancy multicore linear algebra. However, when building datasets, the linear algebra is not typically the bottleneck. The bottleneck is instead operations like sorting and merging that OpenBlas does not parallelize. 

The important thing to know is that if you want to use mclapply, you need to do <code>mcaffinity(1:detectCores())</code>. If you want to then fit models you should do <code>mcaffinity(rep(1,detectCores())</code> so that OpenBlas can do its magic.

=== Running jobs across many cores with GNU parallel ===

Generate a task list:

 $ find ./input/ -mindepth 1 | xargs -I {} echo "python3 /com/local/bin/wikiq {} -o ./output" > task_list

Run:

 $ parallel < task_list

Connect to your node with ssh to check on it:

 $ ssh '''n0648'''
 $ htop

== Batch Jobs ==

{{notice|This information is not fully updated yet. We'll cover this next week!}}

=== Setup for running batch jobs on Hyak (only need to be done once) ===

Create a users directory for yourself in /com/users:

You will want to store the output of your script in /com/, or you will run out of space in your personal filesystem (/usr/lusers/...)

 $ mkdir /com/users/USERNAME  # Replace USERNAME with your user name

2. Create a batch_jobs directory

 $ mkdir /com/users/USERNAME/batch_jobs

3. Create a symlink from your home directory to this directory (this lets you use the /com storage from the more convenient home directory)

 $ ln -s /com/users/USERNAME/batch_jobs ~/batch_jobs

4. Create a user in parallel SQL

 $ sudo pssu --initial
 $ [sudo] password for USERID: <Enter your UW NetID password>

=== Project-specific steps (done for each project) ===

1. Create a new project in your batch_jobs directory

 $ mkdir ~/batch_jobs/wikiq_test
 $ cd ~/batch_jobs/wikiq_test

2. Create a symlink to the data that you will be using as an input (in this case, the 2010 wikia dump)

 $ ln -s /com/raw_data/wikia_dumps/2010-04-mako  ./input

3. Create an output directory

 $ mkdir ./output

4. Test to make sure everything is working well, and everything is where it should be, run wikiq on one file

 $ python3 /com/local/bin/wikiq ./input/012thfurryarmybrigade.xml.7z -o ./output

This should provide some output in the terminal, and should create a file at ~/batch_jobs/wikiq_test/output/012thfurryarmybrigade.tsv. You should examine this file to make sure it looks as expected

When you're done, remove it

 $ rm ./output/*

5. Now we'll use that command as a template for creating a task_list. This is a file with a line for each command we would like our job to run. In this case, we'll use the terminal to find a list of all of the wiki files, which we will pipe to xargs. xargs takes each file name, and uses echo to insert it into the command. Each line is then written to the task_list file.

 $ find ./input/ -mindepth 1 | xargs -I {} echo "python3 /com/local/bin/wikiq {} -o ./output" >  task_list

This will create a file named task_list. Make sure it is as large as expected (it should have 76471 lines) (Note: this will take a while - approx. 1 minute.)

 $ wc -l task_list
        
You can also visually inspect it, to make sure that it looks like it should.

6. Copy the job_script from this directory

 $ cp /PATH/TO/wikiresearch/hyak_example/job_script ~/batch_jobs/wikiq_test/job_script

7. Edit the job_script. https://sig.washington.edu/itsigs/Hyak_parallel-sql has a good example script, with explanations for what each piece does. For our project, you should just change the following two lines, to your user name

 #PBS -o /usr/lusers/USERNAME/batch_jobs/wikiq_test
 #PBS -d /usr/lusers/USERNAME/batch_jobs/wikiq_test

You can do this with vim, or you can just run the following:

 $ sed -i -e 's/USERNAME/<Your User Name>/' job_script

The other part of this file that you will often have to change is the walltime. This is how long you want to have the node assigned to your job. For long jobs, you will need to increase this parameter.

8. Load up 100 tasks into Parallel SQL, as a test. You want to make sure that everything is working end-to-end before trying it on the whole set of files.

 $ module load parallel_sql
 $ cat task_list | head -n 100 | psu --load

Check to make sure that they loaded correctly (they should show up as 100 available jobs)

 $ psu --stats 

9. Check to make sure there are available nodes

 $ showq -w group=hyak-mako

We have 8 nodes currently, so subtract the number of active jobs from 8, and that is the number of available nodes.

10. Run the jobs on the available nodes.

 $ for job in $(seq 1 N); do qsub job_script; done

Replace "N" with the number of available nodes

11. Make sure things are working correctly

 $ watch showq -w group=hyak-mako

This lets you watch to make sure that your jobs are assigned to nodes correctly. Once they are assigned, Ctrl+c gets you out of watch, and you can watch the task list in Parallel SQL

 $ watch psu --stats

This lets you watch the task list. You should see the tasks move from available to completed. When they are all completed, run

 $ ls ./output | wc -l

This checks to make sure all 100 files were written to the output folder. You probably want to also look at a few files, to make sure they look as expected.

If everything looks good, then remove the output files

 $ psu --del-com

12. Finally, run the jobs over the full set of files

 $ cat task_list | psu --load
 $ psu --stats      # Should show all 76471 tasks
 $ showq -w group=hyak-mako     # Find out how many nodes are available
 $ for job in $(seq 1 N); do qsub job_script; done         # Replace N with the nodes available

Keep an eye on the tasks with 

 $ watch showq -w group=hyak-mako 

and 

$ watch psu --stats
@@ Line 31: / Line 31: @@
 # Run [https://linux.die.net/man/1/screen screen] or [https://www.hamvocke.com/blog/a-quick-and-easy-guide-to-tmux/ tmux] to maintain connections over time ([[CommunityData:tmux|CDSC tmux cheatsheet]])
-# We have four pre-defined ways to check out nodes using aliases in the shared bashrc:
+# Four ways to check out nodes:
-:* <code>int_machine</code> — interactive machine (cpu cores of this machine are shared with the group, but memory you allocate will be dedicated to you) '''[USE THIS FIRST!]'''.
+:* <code>int_machine</code> — interactive machine (shared with the group) '''[USE THIS FIRST!]'''
 :* <code>any_machine</code> — dedicated interactive machine
 :* <code>big_machine</code> — dedicated interactive machine with large amounts of memory
 :* <code>build_machine</code> — interactive machine with an Internet connection for building R modules and so on
-=== Solving common mox node woes ===
-==== I'm getting errors about not being able to access the Internet from my node ====
-Only machines using the build_machine profile have Internet access. You can install new software using a build node and it'll be immediately available on other nodes.
-==== It looks like there are no nodes available? ====
-# Try the int_machine
-# Try a shorter lease time. When you check out an interactive node with <code>any_machine</code>, for example, you're essentially leasing it for some amount of time, and if that time overlaps with a scheduled Hyak maintenance, your command will hang in the terminal. To change the lease time, first see the contents of our alias command by typing <code>which <node-alias></code>, i.e. <code>which any_machine</code>, and you'll see a <code>--time=$walltime</code> flag. <code>echo $walltime</code> will tell you that current walltime is a large number, like 200:00:00 -- 200 hours! Copy-paste the alias contents (starting with srun...., no single-quotes) and set the time to something smaller.
-# Check ourjobs to see who is using the other nodes, and ask on the irc channel to see if anyone can free up a node.
-==== My job on int_machine is getting killed, or doesn't have enough memory ====
-When you request an int_machine with srun, the default is 24G. Try a higher number. Technically the max is 240G but that would mean no one else in the group can have any memory if they need to access an int_machine....so ask for no more than 216G unless you're able to vacate the node right away if asked.
-==== My job is running out of time and I'd like to add more ====
-{{forthcoming}}
 === Running a job across many cores using GNU R's parallelization features ===
-The Mox machines have 28 cores. Running your program on all the cores can speed things up a lot. We make heavy use of R for building datasets and for fitting models. Like most programming languages, R uses only one cpu by default. However, for typical computation-heavy data science tasks it is pretty easy to make R use all the cores.
+The hyak machines have 16 cpu cores.  The Mox machines will have 28! Running your program on all the cores can speed things up a lot! We make heavy use of R for building datasets and for fitting models. Like most programming languages, R uses only one cpu by default. However, for typical computation-heavy data science tasks it is pretty easy to make R use all the cores.
 For fitting models, the R installed should use all cores automatically. This is thanks to OpenBlas, which is a numerical library that implements and parallelizes linear algebra routines like matrix factorization, matrix inversion, and other operations that bottleneck model fitting.
@@ Line 106: / Line 87: @@
 === Setup for running batch jobs on Hyak (only need to be done once) ===
-Create a users directory for yourself in /gscratch/comdata/users:
+Create a users directory for yourself in /com/users:
-You will want to store the output of your script in /gscratch/comdata, or you will run out of space in your personal filesystem (/usr/lusers/...)
+You will want to store the output of your script in /com/, or you will run out of space in your personal filesystem (/usr/lusers/...)
-  $ mkdir /gscratch/comdata/users/USERNAME  # Replace USERNAME with your user name
+  $ mkdir /com/users/USERNAME  # Replace USERNAME with your user name
 . Create a batch_jobs directory
-  $ mkdir /gscratch/comdata/users/USERNAME/batch_jobs
+  $ mkdir /com/users/USERNAME/batch_jobs
 . Create a symlink from your home directory to this directory (this lets you use the /com storage from the more convenient home directory)
-  $ ln -s /gscratch/comdata/users/USERNAME/batch_jobs ~/batch_jobs
+  $ ln -s /com/users/USERNAME/batch_jobs ~/batch_jobs
 . Create a user in parallel SQL