Latest revision |
Your text |
Line 1: |
Line 1: |
| '''Klone''' is the latest version of hyak, the UW super computing system. We will soon have a larger allocation of machines on Klone than on Mox. The Klone machines have 40 cores and either 384GB or 768GB of RAM. You can check storage allocation usage with the 'hyakstorage' command. | | '''Klone''' is the latest version of hyak, the UW super computing system. We will soon have a larger allocation of machines on Klone than on Mox. The Klone machines have 40 cores and either 384GB or 768GB of RAM. |
|
| |
|
| == Setting up SSH == | | == Setup == |
| | The recommended way to manage software for your research projects on Klone is to use [https://sylabs.io/docs/ Singularity containers]. You can build a singularity container using the linux distribution manager of your choice (i.e., debian, ubuntu, centos). The instructions on this page document how to build the <code>cdsc_base.sif</code> singularity package which provides python, R, julia, and pyspark based on Debian 11 (Bullseye). |
|
| |
|
| When you connect to SSH, it will ask you for a key from your token. Typing this in every time you start a connection be a pain. One approach is to create an .ssh config file that will create a "tunnel" the first time you connect and send all subsequent connections to Hyak over that tunnel. Some details [http://wiki.cac.washington.edu/display/hyakusers/Logging+In in the Hyak documentation].
| | Copies of the definition file and a working container are located at <code>/gscratch/comdata/containers/cdsc_base/</code>. |
| | |
| I've added the following config to the file <code>~/.ssh/config</code> on my laptop (you will want to change the username):
| |
| | |
| Host klone klone.hyak.uw.edu
| |
| User '''<YOURNETID>'''
| |
| HostName klone.hyak.uw.edu
| |
| ControlPath ~/.ssh/master-%r@%h:%p
| |
| ControlMaster auto
| |
| ControlPersist yes
| |
| Compression yes
| |
| | |
| {{Note}} If your SSH connection becomes stale or disconnected (e.g., if you change networks) it may take some time for the connection to time out. Until that happens, any connections you make to hyak will silently hang. If your connections to ssh hyak are silently hanging but your Internet connection seems good, look for ssh processes running on your local machine with:
| |
| | |
| ps ax|grep klone
| |
| | |
| If you find any, kill them with <code>kill '''<PROCESSID>'''</code>. Once that is done, you should have no problem connecting to Hyak.
| |
| | |
| | |
| == Setting up your Environment ==
| |
| The recommended way to manage software for your research projects on Klone is to use [https://apptainer.org/docs/user/main/quick_start.html Apptainer containers] (formerly known as Singularity). At first, you probably do not need to know much about containers because we maintain a shared setup described below. However, before getting to work on Klone, you'll need to set up an environment that provides our containerized commands and a few other conveniences. You do this by creating the following <code>.bashrc</code> file in your home directory (i.e., <code>/mmfs1/home/{your_username}</code>) where you land when you connect to klone.
| |
|
| |
|
| === Initial .Bashrc === | | === Initial .Bashrc === |
| Before we get started using our apptainer package on klone, we need to start with a <code>.bashrc</code>. Using a text editor (nano is a good choice if you don't already have a preference), create your <code>.bashrc</code> by pasting in the following code. Then run the command <code>source ~/.bashrc</code> to run the .bashrc and enable the environment. | | Before we get started using our singularity package on klone, we need to start with a <code>.bashrc</code>. |
|
| |
|
| <syntaxhighlight language='bash'> | | <syntaxhighlight language='bash'> |
| # .bashrc | | # .bashrc |
| | | # Stuff that's in there already that you need for working with the cluster. |
| export LOGIN_NODE=$(hostname | grep -q '^klone-login01' ; echo $?)
| | # Add the following two lines |
| export SBATCH_EXPORT=BASH_ENV='~/.bashrc'
| |
| export SLURM_EXPORT_ENV=BASH_ENV='~/.bashrc'
| |
| export SLURM_EXPORT_ENV=BASH_ENV='~/.bashrc'
| |
| | |
| if [ -f ~/.bash_aliases ]; then
| |
| . ~/.bash_aliases
| |
| fi
| |
| | |
| # User specific environment
| |
| if ! [[ "$PATH" =~ "$HOME/.local/bin:$HOME/bin:" ]]
| |
| then
| |
| PATH="$HOME/.local/bin:$HOME/bin:$PATH"
| |
| fi
| |
| | |
| export PATH
| |
| | |
| # Source global definitions | |
| if [ -f /etc/bashrc ]; then
| |
| . /etc/bashrc
| |
| fi
| |
| | |
| source "/gscratch/comdata/env/cdsc_klone_bashrc"
| |
| | |
| if [[ "$LOGIN_NODE" == 0 ]]; then
| |
| :
| |
| else
| |
| | |
| # Uncomment the following line if you don't like systemctl's auto-paging feature:
| |
| # export SYSTEMD_PAGER=
| |
| | |
| # User specific aliases and functions
| |
| umask 007 | | umask 007 |
| export APPTAINER_BIND="/gscratch:/gscratch,/mmfs1:/mmfs1,/gpfs:/gpfs,/sw:/sw,/usr:/kloneusr,/bin:/klonebin" | | module load singularity |
| | | export SINGULARITY_BIND="/gscratch:/gscratch,/mmfs1:/mmfs1,/xcatpost:/xcatpost,/gpfs:/gpfs,/sw:/sw" |
| export OMP_THREAD_LIMIT=40
| | alias big_machine="srun -A comdata -p compute-bigmem --time=6:00:00 -c 40 --pty bash -l" |
| export OMP_NUM_THREADS=40
| | alias huge_machine="srun -A comdata -p compute-hugemem --time=6:00:00 -c 40 --pty bash -l" |
|
| |
| export PATH="$PATH:/gscratch/comdata/users/$(whoami)/bin:/gscratch/comdata/local/spark:/gscratch/comdata/local/bin"
| |
| source "/gscratch/comdata/users/nathante/spark_env.sh"
| |
| export _JAVA_OPTIONS="-Xmx362g"
| |
| fi
| |
| </syntaxhighlight> | | </syntaxhighlight> |
|
| |
|
| ==Connect to a Compute Node== | | == Installing singularity on your local computer == |
| When you first SSH into Klone, you will be on your login node. Before you can do computational work, or use software installed in our containers (see below), you will need to log into a compute node from your login node. After your <code>~/.bashrc</code> file is setup and run, you can do so by running a SLURM job or use one of the aliases described in https://wiki.communitydata.science/CommunityData:Hyak_tutorial#Interactive_Jobs.
| | You might find it more convenient to develop your singularity container on your local machine. You'll want singularity version 3.4.2. which is the version installed oh klone. Follow [https://sylabs.io/guides/3.5/admin-guide/installation.html these instructions] for installing singularity on your local linux machine. |
| | |
| | == Creating a singularity container == |
|
| |
|
| == About Containers ==
| | Our goal is to write a singularity definition file that will install the software that we want to work with. The definition file contains instructions for building a more reproducible environment. For example, the file <code>cdsc_base.def</code> contains instructions for installing an environment based on debian 11 (bullseye). Once we have the definition file, we just have to run: |
| | |
| We use [https://apptainer.org/docs/user/latest/index.html Apptainer] (formerly known as, and sometimes still referred to as Singularity) containers to install software on klone. Klone provides a very minimal operating system so without these containers, installing software can be quite labor-intensive.
| |
| Our goal has been to make using software installed through apptainer as seamless as possible. For the most part, once you have your environment configured as above, you shouldn't have to think about the containers unless you need to install something new.
| |
| | |
| We created commands (e.g., <code>python3</code>, <code>Rscript</code>, <code>jupyter-console</code>) that run the containerized version of the program. The full list of such commands is in <code>/gscratch/comdata/containers/bin</code>.
| |
| | |
| Importantly, installing packages in R, Python (e.g., using pip) or other programming languages should usually work normally because the containers already have the most common dependencies. Installing packages this way will not update the container. Instead the packages will be installed in your user directory. This is desirable so that different container users do not break each other's environments. It may happen that an installation fails because it requires a missing dependency from the operating system. If this happens you can try to add the dependency to the container as described below. If this seems
| |
| challenging or complicated or you need many changes to the container, or changes you don't understand, reach out to the IT team.
| |
| | |
| We will use multiple different apptainter containers for different applications to avoid incidentally breaking existing versions of packages during upgrades. We want containers that include "soft dependencies" that R or Python libraries might want.
| |
| | |
| == To make a new container alias ==
| |
| For example, let's say you want to make a command to run <code>jupyter-console</code> for interactive python work and let's say you know that you want to run this from the <code>cdsc_python.sif</code> container located in <code>/gscratch/comdata/containers/cdsc_python</code>.
| |
| | |
| 1. Ensure that the software you want to execute is installed in the container. Test this by running <code> apptainer exec /gscratch jupyter-console</code>.
| |
| | |
| 2. Create an executable file in /gscratch/comdata/containers/bin. The file should look like:
| |
| <syntaxhighlight lang='bash'>
| |
| #!/usr/bin/env bash
| |
| | |
| apptainer exec /gscratch/comdata/containers/cdsc_python/cdsc_python.sif jupyter-console.
| |
| | |
| </syntaxhighlight>
| |
| | |
| == Installing apptainer on your local computer ==
| |
| You might find it more convenient to develop your apptainer container on your local machine. You'll want apptainer version 3.4.2. which is the version installed on klone. Follow [https://apptainer.org/docs/user/latest/quick_start.html these instructions] for installing apptainer on your local linux machine.
| |
| | |
| == Creating a apptainer container ==
| |
| | |
| Our goal is to write a apptainer definition file that will install the software that we want to work with. The definition file contains instructions for building a more reproducible environment. For example, the file <code>cdsc_base.def</code> contains instructions for installing an environment based on debian 11 (bullseye). Once we have the definition file, we just have to run: | |
| | |
| '''NOTE:''' For some reason building a container doesn't work on the <code>/gscratch</code> filesystem. Instead build containers on the <code>/mmfs1</code> filesystem and then copy them to their eventual homes on <code>/gscratch</code>.
| |
|
| |
|
| <syntaxhighlight language='bash'> | | <syntaxhighlight language='bash'> |
| apptainer build --fakeroot cdsc_base.sif cdsc_base.def | | singularity build --fakeroot cdsc_base.sif cdsc_base.def |
| </syntaxhighlight> | | </syntaxhighlight> |
|
| |
|
| On a klone compute node to create the apptainer container <code>cdsc_base.sif</code>. This can take quite awhile to run as it downloads and installs a lot of software! | | On a klone compute node to create the singularity container <code>cdsc_base.sif</code>. This can take quite awhile to run as it downloads and installs a lot of software! |
|
| |
|
| You can start a shell in the container using: | | You can start a shell in the container using: |
|
| |
|
| <syntaxhighlight language='bash'> | | <syntaxhighlight language='bash'> |
| apptainer shell cdsc_base.sif
| | singularity shell cdsc_base.sif |
| </syntaxhighlight> | | </syntaxhighlight> |
|
| |
|
Line 127: |
Line 42: |
|
| |
|
| <syntaxhighlight language='bash'> | | <syntaxhighlight language='bash'> |
| apptainer exec cdsc_base.sif echo "my command"
| | singularity exec cdsc_base.sif echo "my command" |
| </syntaxhighlight> | | </syntaxhighlight> |
|
| |
|
Line 135: |
Line 50: |
|
| |
|
| <syntaxhighlight language='bash'> | | <syntaxhighlight language='bash'> |
| apptainer build --sandbox cdsc_base_sandbox cdsc_base.sif
| | singularity build --sandbox cdsc_base_sandbox cdsc_base.sif |
| </syntaxhighlight> | | </syntaxhighlight> |
|
| |
|
| You might run into trouble with exceeding space in your temporary file path. If you do, run | | You might run into trouble with exceeding space in your temporary file path. If you do, run |
| <syntaxhighlight language='bash'> | | <syntaxhighlight language='bash'> |
| sudo export APPTAINER_TMPDIR=/my/large/tmp | | sudo export SINGULARITY_TMPDIR=/my/large/tmp |
| sudo export APPTAINER_CACHEDIR=/my/large/apt_cache | | sudo export SINGULARITY_CACHEDIR=/my/large/apt_cache |
| sudo export APPTAINER_LOCALCACHEDIR=/my/large/apt_cache | | sudo export SINGULARITY_LOCALCACHEDIR=/my/large/apt_cache |
| </syntaxhighlight> | | </syntaxhighlight> |
| before running the build. | | before running the build. |
Line 160: |
Line 75: |
| == Spark == | | == Spark == |
|
| |
|
| To set up a spark cluster using apptainer the first step to "run" the container on each node in the cluster: | | To set up a spark cluster using singularity the first step to "run" the container on each node in the cluster: |
|
| |
|
| <syntaxhighlight lang='bash'> | | <syntaxhighlight lang='bash'> |
| # on the first node | | # on the first node |
| apptainer instance start --fakeroot cdsc_base.sif spark-boss
| | singularity instance start --writable-tmpfs cdsc_base.sif spark-boss |
| export SPARK_BOSS=$(hostname)
| | # on the first worker node |
| # on the first worker node (typically same as boss node) | | singularity instance start --writable-tmpfs cdsc_base.sif spark-worker-1 |
| apptainer instance start --fakeroot cdsc_base.sif spark-worker-1
| |
| # second worker node | | # second worker node |
| apptainer instance start --fakeroot cdsc_base.sif spark-worker-2
| | singularity instance start --writable-tmpfs cdsc_base.sif spark-worker-2 |
| </syntaxhighlight> | | </syntaxhighlight> |
|
| |
|
Line 175: |
Line 89: |
|
| |
|
| <syntaxhighlight lang='bash'> | | <syntaxhighlight lang='bash'> |
| apptainer exec instance://spark-boss /opt/spark/sbin/start_master.sh
| | singularity exec instance://spark-boss start_spark_boss.sh |
| | | singularity exec instance://spark-worker-1 start_spark_worker.sh |
| apptainer exec instance://spark-worker-1 /opt/spark/sbin/start-worker.sh $SPARK_BOSS:7077
| |
| </syntaxhighlight> | | </syntaxhighlight> |
|
| |
|
| That should be it. Though in practice it might make more sense to have special containers for the spark boss and workers. | | That should be it. Though in practice it might make more sense to have special containers for the spark boss and workers. |
|
| |
|
| You can now submit spark jobs by running <code>spark-submit.sh</code>.
| | == cdsc_base.def == |
| | <syntaxhighlight language='bash'> |
| | Bootstrap: library |
| | from: debian:bullseye |
| | |
| | %post |
| | echo "deb http://mirror.keystealth.org/debian bullseye main contrib" > "/etc/apt/sources.list" |
| | apt update && apt upgrade -y |
| | apt install -y gnupg curl |
| | curl -O https://downloads.apache.org/spark/KEYS |
| | curl -O https://downloads.apache.org/spark/spark-3.1.1/spark-3.1.1-bin-hadoop3.2.tgz.asc |
| | curl -O https://mirror.jframeworks.com/apache/spark/spark-3.1.1/spark-3.1.1-bin-hadoop3.2.tgz |
| | gpg --import KEYS |
| | ls |
| | gpg --verify spark-3.1.1-bin-hadoop3.2.tgz.asc spark-3.1.1-bin-hadoop3.2.tgz |
| | rm KEYS |
| | export JAVA_HOME=/usr/lib/jvm/default-java |
| | tar xvf spark-3.1.1-bin-hadoop3.2.tgz |
| | mv spark-3.1.1-bin-hadoop3.2/ /opt/spark |
| | curl -O https://mirror.jframeworks.com/apache/hadoop/common/hadoop-3.3.0/hadoop-3.3.0.tar.gz |
| | curl -O https://downloads.apache.org/hadoop/common/hadoop-3.3.0/hadoop-3.3.0.tar.gz.asc |
| | curl -O https://downloads.apache.org/hadoop/common/KEYS |
| | gpg --import KEYS |
| | ls |
| | gpg --verify hadoop-3.3.0.tar.gz.asc hadoop-3.3.0.tar.gz |
| | tar xvf hadoop-3.3.0.tar.gz |
| | mv hadoop-3.3.0/ /opt/hadoop |
| | export HADOOP_HOME=/opt/hadoop |
| | |
| | apt install -y libopenblas-base |
| | apt install -y r-base r-recommended emacs vim python3-sklearn jupyter moreutils julia default-jdk git curl meld xauth python3-venv python3-pip apt-utils ncdu |
| | apt clean |
| | mkdir mmfs1 |
| | mkdir gscratch |
| | mkdir xcatpost |
| | mkdir gpfs |
| | mkdir sw |
| | rm hadoop-3.3.0.tar.gz hadoop-3.3.0.tar.gz.asc KEYS spark-3.1.1-bin-hadoop3.2.tgz spark-3.1.1-bin-hadoop3.2.tgz.asc |
| | |
| | |
| | %environment |
| | export JAVA_HOME=/usr/lib/jvm/default-java |
| | export HADOOP_HOME=/opt/hadoop |
| | export LC_ALL=C |
| | export SPARK_HOME=/opt/spark |
| | export PATH=$PATH:$SPARK_HOME/bin:$SPARK_HOME/sbin |
| | export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:$HADOOP_HOME/lib/native |
|
| |
|
| <syntaxhighlight lang='bash'>
| |
| # replace n3078 with the master hostname
| |
| apptainer exec instance://spark-boss /opt/spark/bin/spark --master spark://n3078.hyak.local:7077
| |
| </syntaxhighlight> | | </syntaxhighlight> |
|
| |
| Nate's working on wrapping the above nonsense in friendlier scripts.
| |