CommunityData:Hyak tutorial: Difference between revisions
No edit summary |
No edit summary |
||
Line 16: | Line 16: | ||
* Set up your user's Hyak environment with the CDSC aliases and tools | * Set up your user's Hyak environment with the CDSC aliases and tools | ||
== Setup for running batch jobs on Hyak (only need to be done once) == | == Batch Jobs == | ||
=== Setup for running batch jobs on Hyak (only need to be done once) === | |||
Create a users directory for yourself in /com/users: | Create a users directory for yourself in /com/users: | ||
Line 37: | Line 38: | ||
$ sudo pssu --initial | $ sudo pssu --initial | ||
$ [sudo] password for USERID: <Enter your UW NetID password> | $ [sudo] password for USERID: <Enter your UW NetID password> | ||
=== Project-specific steps (done for each project) === | |||
1. Create a new project in your batch_jobs directory | |||
$ mkdir ~/batch_jobs/wikiq_test | |||
$ cd ~/batch_jobs/wikiq_test | |||
2. Create a symlink to the data that you will be using as an input (in this case, the 2010 wikia dump) | |||
$ ln -s /com/raw_data/wikia_dumps/2010-04-mako ./input | |||
3. Create an output directory | |||
$ mkdir ./output | |||
4. Test to make sure everything is working well, and everything is where it should be, run wikiq on one file | |||
$ python3 /com/local/bin/wikiq ./input/012thfurryarmybrigade.xml.7z -o ./output |
Revision as of 20:35, 2 August 2019
This file provides a complete, step-by-step walk-through for how to parse a list of Wikia wikis with wikiq. The same principles can be followed for other tasks.
Things you should know before you start
- Computing paradigms: HPC versus MapReduce/Hadoop
- ikt versus mox and the transition
- This material will cover getting setup on the older ikt cluster
- Our mox cluster is online and we will migrating to it in late 2019/early 2020
Connecting to Hyak
Details information on setting up Hyak is covered CommunityData:Hyak. Make sure you have:
- Set up SSH
- Connected to Hyak
- Set up your user's Hyak environment with the CDSC aliases and tools
Batch Jobs
Setup for running batch jobs on Hyak (only need to be done once)
Create a users directory for yourself in /com/users:
You will want to store the output of your script in /com/, or you will run out of space in your personal filesystem (/usr/lusers/...)
$ mkdir /com/users/USERNAME # Replace USERNAME with your user name
2. Create a batch_jobs directory
$ mkdir /com/users/USERNAME/batch_jobs
3. Create a symlink from your home directory to this directory (this lets you use the /com storage from the more convenient home directory)
$ ln -s /com/users/USERNAME/batch_jobs ~/batch_jobs
4. Create a user in parallel SQL
$ module load parallel_sql $ sudo pssu --initial $ [sudo] password for USERID: <Enter your UW NetID password>
Project-specific steps (done for each project)
1. Create a new project in your batch_jobs directory
$ mkdir ~/batch_jobs/wikiq_test $ cd ~/batch_jobs/wikiq_test
2. Create a symlink to the data that you will be using as an input (in this case, the 2010 wikia dump)
$ ln -s /com/raw_data/wikia_dumps/2010-04-mako ./input
3. Create an output directory
$ mkdir ./output
4. Test to make sure everything is working well, and everything is where it should be, run wikiq on one file
$ python3 /com/local/bin/wikiq ./input/012thfurryarmybrigade.xml.7z -o ./output