CommunityData:Hyak tutorial

This file provides a complete, step-by-step walk-through for how to parse a list of Wikia wikis with wikiq. The same principles can be followed for other tasks.

Things you should know before you start

Computing paradigms: HPC versus MapReduce/Hadoop
ikt versus mox and the transition
- This material will cover getting setup on the older ikt cluster
- Our mox cluster is online and we will migrating to it in late 2019/early 2020

Connecting to Hyak

Details information on setting up Hyak is covered CommunityData:Hyak. Make sure you have:

Set up SSH
Connected to Hyak
Set up your user's Hyak environment with the CDSC aliases and tools

Batch Jobs

Setup for running batch jobs on Hyak (only need to be done once)

Create a users directory for yourself in /com/users:

You will want to store the output of your script in /com/, or you will run out of space in your personal filesystem (/usr/lusers/...)

$ mkdir /com/users/USERNAME  # Replace USERNAME with your user name

2. Create a batch_jobs directory

$ mkdir /com/users/USERNAME/batch_jobs

3. Create a symlink from your home directory to this directory (this lets you use the /com storage from the more convenient home directory)

$ ln -s /com/users/USERNAME/batch_jobs ~/batch_jobs

4. Create a user in parallel SQL

$ module load parallel_sql
$ sudo pssu --initial
$ [sudo] password for USERID: <Enter your UW NetID password>

Project-specific steps (done for each project)

1. Create a new project in your batch_jobs directory

$ mkdir ~/batch_jobs/wikiq_test
$ cd ~/batch_jobs/wikiq_test

2. Create a symlink to the data that you will be using as an input (in this case, the 2010 wikia dump)

$ ln -s /com/raw_data/wikia_dumps/2010-04-mako  ./input

3. Create an output directory

$ mkdir ./output

4. Test to make sure everything is working well, and everything is where it should be, run wikiq on one file

$ python3 /com/local/bin/wikiq ./input/012thfurryarmybrigade.xml.7z -o ./output