CommunityData:Hyak walkthrough

From CommunityData
Warning: This page is outdated and needs to be updated to use the slurm scheduler

This file provides a complete, step-by-step walkthrough for how to parse a list of Wikia wikis with wikiq. The same principles can be followed for other tasks.


Setup steps (only need to be done once)[edit]

  1. Create a users directory for yourself in /gscratch/comdata/users
    • You will want to store the output of your script in /gscratch/comdata/, or you will run out of space in your personal filesystem (/usr/lusers/...)
    $ mkdir /gscratch/comdata/users/USERNAME # Replace USERNAME with your user name
  2. Create a batch_jobs directory
    $ mkdir /gscratch/comdata/users/USERNAME/batch_jobs
  3. Create a symlink from your home directory to this directory (this lets you use the /gscratch/comdata storage from the more convenient home directory)
    ln -s /gscratch/comdata/users/USERNAME/batch_jobs ~/batch_jobs
  4. Create a user in parallel SQL
    module load parallel_sql
    sudo pssu --initial
    [sudo] password for USERID: <Enter your UW NetID password>

Project-specific steps (done for each project)[edit]

  1. Create a new project in your batch_jobs directory
    mkdir ~/batch_jobs/wikiq_test
    cd ~/batch_jobs/wikiq_test
  2. Create a symlink to the data that you will be using as an input (in this case, the 2010 wikia dump)
    ln -s /com/raw_data/wikia_dumps/2010-04-mako ./input
  3. Create an output directory
    mkdir ./output
  4. Test to make sure everything is working well, and everything is where it should be, run wikiq on one file
    python3 /com/local/bin/wikiq ./input/012thfurryarmybrigade.xml.7z -o ./output
    • This should provide some output in the terminal, and should create a file at ~/batch_jobs/wikiq_test/output/012thfurryarmybrigade.tsv. You should examine this file to make sure it looks as expected
  5. When you're done, remove it
    rm ./output/*
  6. Now we'll use that command as a template for creating a task_list. This is a file with a line for each command we would like our job to run. In this case, we'll use the terminal to find a list of all of the wiki files, which we will pipe to xargs. xargs takes each file name, and uses echo to insert it into the command. Each line is then written to the task_list file.
    find ./input/ -mindepth 1 | xargs -I {} echo "python3 /com/local/bin/wikiq {} -o ./output" > task_list
    • This will create a file named task_list. * Note: this will take a while - approx. 1 minute.
  7. Make sure it is as large as expected (it should have 76471 lines)
    wc -l task_list
    • You can also visually inspect it, to make sure that it looks like it should.
    less task_list
  8. Copy this job_script to your wikiq_test directory
    vi ~/batch_jobs/wikiq_test/job_script
  9. Edit the job_script. https://sig.washington.edu/itsigs/Hyak_parallel-sql has a good example script, with explanations for what each piece does. For our project, you should just change where it says USERNAME to your user name
    • You can do this with vim, or you can just run the following:
    sed -i -e 's/USERNAME/<Your User Name>/' job_script
    • The other part of this file that you will often have to change is the walltime. This is how long you want to have the node assigned to your job. For long jobs, you will need to increase this parameter.
  10. Load up 100 tasks into Parallel SQL, as a test. You want to make sure that everything is working end-to-end before trying it on the whole set of files.
    module load parallel_sql
    cat task_list | head -n 100 | psu --load
    • Check to make sure that they loaded correctly (they should show up as 100 available jobs)
    psu --stats
  11. Check to make sure there are available nodes
    showq -w group=hyak-mako
    • We have 8 nodes currently, so subtract the number of active jobs from 8, and that is the number of available nodes.
  12. Run the jobs on the available nodes.
    for job in $(seq 1 N); do qsub job_script; done Replace "N" with the number of available nodes
  13. Make sure things are working correctly
    watch showq -w group=hyak-mako
    • This lets you watch to make sure that your jobs are assigned to nodes correctly. Once they are assigned, Ctrl+c gets you out of watch, and you can watch the task list in Parallel SQL
    watch psu --stats
    • This lets you watch the task list. You should see the tasks move from available to completed. When they are all completed, ru
    ls ./output | wc -l
    • This checks to make sure all 100 files were written to the output folder. You probably want to also look at a few files, to make sure they look as expected.
    • If everything looks good, then remove the output files
    rm ./output/*
    • and clean up the parallel SQL DB
    psu --del
  14. Finally, run the jobs over the full set of files
    cat task_list | psu --load
    psu --stats # Should show all 76471 tasks
    showq -w group=hyak-mako # Find out how many nodes are available
    for job in $(seq 1 N); do qsub job_script; done # Replace N with the nodes available
    • Keep an eye on the tasks with
    watch showq -w group=hyak-mako
    and
    watch psu --stats