CommunityData:Hyak tutorial: Difference between revisions

From CommunityData
Line 36: Line 36:
:* <code>big_machine</code> — dedicated interactive machine with large amounts of memory
:* <code>big_machine</code> — dedicated interactive machine with large amounts of memory
:* <code>build_machine</code> — interactive machine with an Internet connection for building R modules and so on
:* <code>build_machine</code> — interactive machine with an Internet connection for building R modules and so on
=== Running a job across many cores using GNU R's parallization features ===
=== Running jobs across many cores with GNU parallel ===
Generate a task list:
$ find ./input/ -mindepth 1 | xargs -I {} echo "python3 /com/local/bin/wikiq {} -o ./output" > task_list
Run:
$ parallel < task_list
Connect to your node with ssh to check on it:
$ ssh '''n0648'''
$ htop


== Batch Jobs ==
== Batch Jobs ==

Revision as of 19:04, 2 August 2019

This file provides a complete, step-by-step walk-through for how to parse a list of Wikia wikis with wikiq. The same principles can be followed for other tasks.

Things you should know before you start

  • Computing paradigms: HPC versus MapReduce/Hadoop
  • ikt versus mox and the transition
    • This material will cover getting setup on the older ikt cluster
    • Our mox cluster is online and we will migrating to it in late 2019/early 2020

Connecting to Hyak

Details information on setting up Hyak is covered CommunityData:Hyak. Make sure you have:

  • Set up SSH
  • Connected to Hyak
  • Set up your user's Hyak environment with the CDSC aliases and tools

Interactive Jobs

Getting familiar with the MOX scheduler:

  • Check out UW-IT's detailed documentation on using the [1]
  • For reference, the system that UW uses is called Slurm and you can find lots other information on it online.
  • Some useful commands are:
  • sinfo -p comdata — information about our allocation
  • squeue -p comdata — information about our current usage
  • squeue -u makohill — information about your current jobs
  • hyakalloc — general information on allocations in Hyak

Running interactive jobs is relatively straight forward:

  1. Run screen or tmux to maintain connections over time
  2. Four ways to check out nodes:
  • int_machine — interactive machine (shared with the group) [USE THIS FIRST!]
  • any_machine — dedicated interactive machine
  • big_machine — dedicated interactive machine with large amounts of memory
  • build_machine — interactive machine with an Internet connection for building R modules and so on

Running a job across many cores using GNU R's parallization features

Running jobs across many cores with GNU parallel

Generate a task list:

$ find ./input/ -mindepth 1 | xargs -I {} echo "python3 /com/local/bin/wikiq {} -o ./output" > task_list

Run:

$ parallel < task_list

Connect to your node with ssh to check on it:

$ ssh n0648
$ htop

Batch Jobs

This information is not fully updated yet. We'll cover this next week!

Setup for running batch jobs on Hyak (only need to be done once)

Create a users directory for yourself in /com/users:

You will want to store the output of your script in /com/, or you will run out of space in your personal filesystem (/usr/lusers/...)

$ mkdir /com/users/USERNAME  # Replace USERNAME with your user name

2. Create a batch_jobs directory

$ mkdir /com/users/USERNAME/batch_jobs

3. Create a symlink from your home directory to this directory (this lets you use the /com storage from the more convenient home directory)

$ ln -s /com/users/USERNAME/batch_jobs ~/batch_jobs

4. Create a user in parallel SQL

$ sudo pssu --initial
$ [sudo] password for USERID: <Enter your UW NetID password>

Project-specific steps (done for each project)

1. Create a new project in your batch_jobs directory

$ mkdir ~/batch_jobs/wikiq_test
$ cd ~/batch_jobs/wikiq_test

2. Create a symlink to the data that you will be using as an input (in this case, the 2010 wikia dump)

$ ln -s /com/raw_data/wikia_dumps/2010-04-mako  ./input

3. Create an output directory

$ mkdir ./output

4. Test to make sure everything is working well, and everything is where it should be, run wikiq on one file

$ python3 /com/local/bin/wikiq ./input/012thfurryarmybrigade.xml.7z -o ./output

This should provide some output in the terminal, and should create a file at ~/batch_jobs/wikiq_test/output/012thfurryarmybrigade.tsv. You should examine this file to make sure it looks as expected

When you're done, remove it

$ rm ./output/*

5. Now we'll use that command as a template for creating a task_list. This is a file with a line for each command we would like our job to run. In this case, we'll use the terminal to find a list of all of the wiki files, which we will pipe to xargs. xargs takes each file name, and uses echo to insert it into the command. Each line is then written to the task_list file.

$ find ./input/ -mindepth 1 | xargs -I {} echo "python3 /com/local/bin/wikiq {} -o ./output" >  task_list

This will create a file named task_list. Make sure it is as large as expected (it should have 76471 lines) (Note: this will take a while - approx. 1 minute.)

$ wc -l task_list
       

You can also visually inspect it, to make sure that it looks like it should.

6. Copy the job_script from this directory

$ cp /PATH/TO/wikiresearch/hyak_example/job_script ~/batch_jobs/wikiq_test/job_script

7. Edit the job_script. https://sig.washington.edu/itsigs/Hyak_parallel-sql has a good example script, with explanations for what each piece does. For our project, you should just change the following two lines, to your user name

#PBS -o /usr/lusers/USERNAME/batch_jobs/wikiq_test
#PBS -d /usr/lusers/USERNAME/batch_jobs/wikiq_test

You can do this with vim, or you can just run the following:

$ sed -i -e 's/USERNAME/<Your User Name>/' job_script

The other part of this file that you will often have to change is the walltime. This is how long you want to have the node assigned to your job. For long jobs, you will need to increase this parameter.

8. Load up 100 tasks into Parallel SQL, as a test. You want to make sure that everything is working end-to-end before trying it on the whole set of files.

$ module load parallel_sql
$ cat task_list | head -n 100 | psu --load

Check to make sure that they loaded correctly (they should show up as 100 available jobs)

$ psu --stats 

9. Check to make sure there are available nodes

$ showq -w group=hyak-mako

We have 8 nodes currently, so subtract the number of active jobs from 8, and that is the number of available nodes.

10. Run the jobs on the available nodes.

$ for job in $(seq 1 N); do qsub job_script; done

Replace "N" with the number of available nodes

11. Make sure things are working correctly

$ watch showq -w group=hyak-mako

This lets you watch to make sure that your jobs are assigned to nodes correctly. Once they are assigned, Ctrl+c gets you out of watch, and you can watch the task list in Parallel SQL

$ watch psu --stats

This lets you watch the task list. You should see the tasks move from available to completed. When they are all completed, run

$ ls ./output | wc -l

This checks to make sure all 100 files were written to the output folder. You probably want to also look at a few files, to make sure they look as expected.

If everything looks good, then remove the output files

$ psu --del-com

12. Finally, run the jobs over the full set of files

$ cat task_list | psu --load
$ psu --stats      # Should show all 76471 tasks
$ showq -w group=hyak-mako     # Find out how many nodes are available
$ for job in $(seq 1 N); do qsub job_script; done         # Replace N with the nodes available

Keep an eye on the tasks with

$ watch showq -w group=hyak-mako 

and

$ watch psu --stats