Editing CommunityData:Hyak Ikt (Deprecreated) (section)

=== Parallel SQL ===

For big jobs you will want to use multiple nodes. Hyak has a very cool tool that makes this very easy, called Parallel SQL. Detailed instructions are in [https://sig.washington.edu/itsigs/Hyak_parallel-sql the itsigs parallel-sql documentation]. There is also a [[CommunityData:Hyak walkthrough|full walkthrough example with instructions]].

The basic workflow is:

0. Be empowered to run parallel_sql -- the first time you use parallel_sql, you will need to:
   login$ module load parallel_sql
   login$ sudo pssu --initial
   [sudo] password for USERID: <Enter your UW NetID password>

See more information at: [[https://wiki.cac.washington.edu/display/hyakusers/Hyak+parallel-sql]]. If you're not initialized, it'll say "Cannot read database config file '/usr/lusers/<<your username>>/.parallel/db.conf': No such file or directory' when you try.

1. Prepare the code, and test it with a single file (either on your computer, or on an interactive node).

2. Write a job_script file. This tells the node what job to run. There is an example on the Parallel SQL wiki page (linked above), and an example in the wikiresearch/hyak_example directory.

3. Create a task_list file. This is a list of commands that should be run, with one line per file that the command should operate on. An example file might look something like:

 python analysis_script.py -i ./input/wiki_1.tsv -o ./output/wiki_1_analysis.tsv
 python analysis_script.py -i ./input/wiki_2.tsv -o ./output/wiki_2_analysis.tsv
 ...

The README in the hyak_example directory has some example bash commands that you might use to generate this file.

4. Load the task_list into Parallel SQL.

 $ module load parallel_sql
 $ cat task_list | psu --load

5. Run the job_script on as many nodes as you need. When each task is finished, the node will get the next task from Parallel SQL.

 $ for job in $(seq 1 N); do qsub job_script; done 
 # N is the number of nodes

You can also use the -t flag, which makes jobs using multiple nodes easier to kill, but is not recommended by "the HYAK people".

 $ qsub job_script -t 0-N
 # N is the number of nodes


For producing your task_list file, you might find it useful to make a python script that slurps up a list of files from a dir and then inserts those filenames into a command file to be run repeatedly:

 #!/usr/bin/env python3
 import glob
 outfile = "many_Redir_Runs.txt"
 infileDir = "/com/raw_data/complete_wmf_dumps-20180220/enwiki-20180301/"
 fileList = glob.glob(infileDir + "enwiki-20180301-pages-meta-history*.7z") #get all the 7z metahistory files
 with open(outfile, 'w') as outFileHandle:
    for file in fileList:
        cleanFile = file.split("/")[-1]
        commandString = "7za x -so " + file + "| python ./01-extract_redirects.py > output/redir/" + cleanFile + ".tsv \n"
        outFileHandle.write(commandString)