Editing CommunityData:Hyak

From CommunityData

Warning: You are not logged in. Your IP address will be publicly visible if you make any edits. If you log in or create an account, your edits will be attributed to your username, along with other benefits.

The edit can be undone. Please check the comparison below to verify that this is what you want to do, and then publish the changes below to finish undoing the edit.

Latest revision Your text
Line 158: Line 158:
     #SBATCH --account=comdata-ckpt
     #SBATCH --account=comdata-ckpt
     #SBATCH --partition=ckpt
     #SBATCH --partition=ckpt
==== Running wikiq with dmtcp and parallel_sql ====
<Big>
WARNING! Follow the below example at your own risk.  It may not work reliably and can lead to missing data! Try using dmtcp without parallel-sql instead. 
</Big>
To run wikiq with parallelsql the following need to be arranged:
# A shell script for each dumpfile that makes a workspace for <code>dmtcp</code> to keep it's data and restart script.
# These shell scripts loaded in <code>parallel sql</code>.
# A <code>sbatch</code> script that gets a checkpoint node and starts running jobs from <code>parallel_sql</code>.
# You need to restart jobs that get interrupted using parallel sql.
You first need to set up parallel_sql on Hyak: https://wiki.cac.washington.edu/display/hyakusers/Hyak+parallel-sql#Hyakparallel-sql-Usingparallel-sql
Nate made a python script that generates the scripts and makes a file with all the scripts. Notice that each dumpfile gets a script, it's own checkpoint directory, and a line in <code>wikiq_parallel_jobs.sh</code>
<syntaxhighlight lang='python'>
#!/usr/bin/env python3
from os import path
import os
import stat
import glob
archives = glob.glob("/gscratch/comdata/raw_data/wikia_dumps/2010-04-mako/*.xml.7z")
scripts_dir = '/gscratch/comdata/users/nathante/wikiq_parallel_scripts'
output_dir =  '/gscratch/comdata/users/nathante/wikiq_output'
checkpoint_dir = '/gscratch/comdata/users/nathante/wikiq_checkpoint'
if not path.isdir(scripts_dir):
    os.mkdir(scripts_dir)
if not path.isdir(output_dir):
    os.mkdir(output_dir)
script ="""#!/bin/bash
mkdir -p {0}
cd {0}
start_dmtcp_coordinator -i 60  #checkpoint every 20 minutes
if [ -x dmtcp_restart_script.sh ]; then
    bash dmtcp_restart_script.sh
else
    # On first pass, run program under DMTCP
    dmtcp_launch --rm {1}
fi
"""
with open("wikiq_parallel_jobs.sh",'w') as calls:
    for dumpfile in archives:
        wikiq_base_call = f"wikiq -u -o {output_dir} {dumpfile}"
        wikiq_call = wikiq_base_call
        wiki = path.split(dumpfile)[1]
        wikiq_script = script.format( path.join(checkpoint_dir,wiki), wikiq_call)
        script_file = path.join(scripts_dir, wiki + '.sh')
        with open(script_file,'w') as of:
            of.write(wikiq_script)
       
        os.chmod(script_file,os.stat(script_file).st_mode | stat.S_IEXEC)
        calls.write(script_file)
        calls.write('\n')
</syntaxhighlight>
We also need an sbatch script as <code>parallel_sql_job.sh</code>.
<syntaxhighlight lang='bash'>
#!/bin/bash
## parallel_sql_job.sh
#SBATCH --job-name=wikiq_dmtcp
## Allocation Definition
#SBATCH --account=comdata-ckpt
#SBATCH --partition=ckpt
## Resources
## Nodes. This should always be 1 for parallel-sql.
#SBATCH --nodes=1   
## Walltime (12 hours)
#SBATCH --time=12:00:00
## Memory per node
#SBATCH --mem=100G
module load parallel_sql
#Put here commands to load other modules (e.g. matlab etc.)
#Below command means that parallel_sql will get tasks from the database
#and run them on the node (in parallel). So a 16 core node will have
#16 tasks running at one time.
parallel-sql --sql -a parallel --exit-on-term
</syntaxhighlight>
Next load the scripts into <code>parallel_sql</code>
  module load parallel_sql
  cat wikiq_parallel_jobs.sh | psu --load
We can now fire up a whole bunch of checkpoint nodes. The limit is technically 2000!  But let's just ask for 10 nodes :)
  for job in $(seq 1 10); do sbatch parallel_sql_job.sh; done
If our jobs get interrupted we'll need to run <code> psu --reset-slurm </code> to set them back into '''avail''' state. We can run a little script running on a login node to do this automatically every minute or so.
<syntaxhighlight lang='python'>
#!/usr/bin/env python3
## auto_reset_psu.py
import time
import subprocess
running = subprocess.run(["psu", "--show-running"],  universal_newlines=True, stdout=subprocess.PIPE, stderr=subprocess.PIPE)
print(running)
while hasattr(running, 'stdout') and len(running.stdout) > 0:
    subprocess.run(["psu","--reset-slurm"])
    time.sleep(60)
    running = subprocess.run(["psu", "--show-running"],  stdout=subprocess.PIPE)
</syntaxhighlight>
That's it! Unleash the power of the checkpoint queue!  Reach out to Nate if you try this and have problems or if you have any questions!


== New Datasets ==
== New Datasets ==
Please note that all contributions to CommunityData are considered to be released under the Attribution-Share Alike 3.0 Unported (see CommunityData:Copyrights for details). If you do not want your writing to be edited mercilessly and redistributed at will, then do not submit it here.
You are also promising us that you wrote this yourself, or copied it from a public domain or similar free resource. Do not submit copyrighted work without permission!

To protect the wiki against automated edit spam, we kindly ask you to solve the following CAPTCHA:

Cancel Editing help (opens in new window)

Templates used on this page: