CommunityData:TACC

From CommunityData

TACC is the high-performance computing center at the University of Texas. It has several different resources. We are likely to use Stampede3, which is an HPC resource similar to Hyak.

Stampede3 complements Hyak in two significant ways. The first is that it is a time-sharing based system instead of a "condo". This means that we have a budget to spend on compute jobs. A major advantage of this model is the ability to run larger jobs that use many nodes. We can start up a spark cluster large enough to fit a dataset in memory. A second advantage is that Stampede3 has nodes with 4 NVIDIA H100 GPUs having 96GB VRAM each, which is great for running LLMs, fitting statistical models, and other computations that can benefit from a GPU. Stampede3's disadvantage is that jobs have short maximum wall times (24 hrs or 48hrs, depending on the type of node). Jobs that run longer than that need to implement checkpointing (see below).

There are a bunch of different types of nodes on Stampede3 with different amounts of CPU and RAM that each cost different amounts of SUs (allocation units). The SPR (for CPU-heavy workloads) and ICX nodes (for more RAM-heavy tasks) tend to be the best value. See the Stampede 3 user guide for the details.

The allocation is renewable and expandable, but we have to do increasing amounts of paperwork and come under increasing scrutiny as it grows. If we do good work and are good stewards of the resource we should have no shortage of resources.

Create an Account[edit]

You can create an account at [1]. Then message Nate to get added to an allocation.

Access[edit]

If you configure your config in .ssh like so:

    Host tacc stampede3.tacc.utexas.edu
        User [your username]
        HostName stampede3.tacc.utexas.edu
        ControlPath ~/.ssh/master-%r@%h:%p
        ControlMaster auto
        ControlPersist yes
        Compression yes

You should be able to just ssh tacc from the terminal to connect. Note 2FA requirements.

TACC Filesystems[edit]

TACC users will work with 4 distributed filesystems:

  • Home is persistent, not fast, and backed up. Each user has a 15GB allocation. This is a good place to install software, code, and configuration. You can refer to your home directory in scripts using the $HOME environment variable. You can quickly navigate to your home directory using the cdh alias.
  • Work is persistent and fast. Each project has a 1TB allocation. This is a good place to store data that you are actively working with and need to persist. Use the $WORK environment variable and the cdw alias.
  • CORRAL is persistent, slower, but larger. Also, it is inaccessible from compute nodes. So you need to copy data from corral to work or scratch before working with it. The environment variable is $CORRAL
  • Scratch is fast and unlimited, but unaccessed files are automatically removed after 10 days. Use it for intermediate stages in a data analysis pipeline or for data that is too large for Work. The environment variable is $SCRATCH and the command is cds.

Note on set-up[edit]

When you run scripts, many temporary files, the logging, etc. will store to $HOME but you will run out of space and things will crash. Set it up to save to $SCRATCH instead, as it is larger but not persistent. So far, Sohyeon has had to do the follow:

export XDG_CACHE_HOME=$SCRATCH/.cache export IPYTHONDIR=$SCRATCH/.ipython export DUCKDB_TMPDIR=$SCRATCH/tmp export SLURM_STATE_SAVE_LOCATION=$SCRATCH/.slurm

General good practice as you start a session[edit]

Use tmux sessions so you can always access the login node. Create others for jobs / whatever else you want to do. For example, Sohyeon usually has:

tmux new -s login
tmux new -s h100

On the h100 tmux session, run: idev -p h100 -t 48:00:00. Now, you run a script like so: uv run ipython3 <python.script>

Containers on TACC[edit]

Similar to Hyak, TACC supports the Apptainer containerization system for installing software. Nate has containers with modern versions of python and R in /work2/10114/nathante/stampede3/containers. Putting /work2/10114/nathante/stampede3/containers/bin in your $PATH will let you use the containerized programs in that directory. This is particularly recommended if you're going to use Python or the GPUs.

Jupyter notebooks on TACC[edit]

For interactive usage on a single node, you can use the TACC Analysis platform to easily get a Jupyter notebook.

Running Long Jobs on Stampede 3[edit]

Stampede 3 has a lower maximum wall time compared to Hyak, typically 2 days. This means that if you have a job that requires more time to complete that you need to checkpoint and resume it. The key challenge in checkpoint / resume is to be able to save your program's state and the resume from that state. For example, if you are running an slow iterative algorithm you can save the state after each iteration. Then when your program runs again it can load that state and resume from the last iteration. Depending on your problem and the algorithm you want to run checkpointing might be more or less difficult.

Checkpoint/resume is relatively easy to make work with Slurm. Wrap the main part of your program with a check to see if the job is incomplete. Then load the state if it is incomplete and exit otherwise. When your job finishes in an incomplete state have it run a script that submits a new version of it to the slurm scheduler while using Slurms -d;--dependency flag to make sure the new job doesn't start until the current one is totally complete.

Using a Spark Cluster on Stampede3[edit]

Spark is currently the recommended approach for most kinds of big data processing on TACC ("big data" means the data doesn't fit in memory on a single node). Its pretty straightforward to run spark on a single node. Just install and load the pyspark package. Running spark on a single node is enough for tasks like taking samples from larger-than memory datasets, and even running some aggregate queries. However, the real power of Spark is running it as a cluster to distribute work over many nodes at once. This is a great fit to TACC's model that encourages shorter jobs that scale horizontally (using many nodes in parallel) instead of long jobs. This section explains how to make this work.

Nate recorded a video demonstrating how to use the Spark configuration and submit your jobs. Watch it here.

Spark Cluster Configuration[edit]

The first step is to create a configuration for your Spark cluster. To get started, you can copy Nate's configuration at /work2/10114/nathante/stampede3/spark_conf to your own /work2/YOUR_NUMBERS/YOUR_USERNAME/stampede3/spark_conf directory. Then edit your .bashrc and add the lines:

export SPARK_CONF_DIR=/work2/YOUR_NUMBERS/YOUR_USERNAME/stampede3/spark_conf

export SPARK_HOME=/work2/YOUR_NUMBERS/nathante/stampede3/spark

export JAVA_HOME=/work2/YOUR_NUMBERS/nathante/stampede3/java_jdk

Don't forget to replace YOUR_USERNAME so that the path is correct.

Note Note: Explanation: We each need our own spark configuration in case two different people want to run spark clusters at the same time. Otherwise, we could overwrite each other's configurations. $SPARK_CONF_DIR is an environment variable that tells Spark where to look for the configuration. $SPARK_HOME is where pyspark will look for the spark execuatables. $JAVA_HOME is where pyspark will look for the Java runtime with which to run Spark. These steps will give everyone the same versions of Spark and Java, but their own configurations.

Spark-env.sh files for Node Types[edit]

As described above, Stampede3 has several different node types. Three of these are appropriate for Spark clusters: The icx, skx, and spr nodes. You'll find corresponding icx-spark-env.sh,skx-spark-env.sh, and spr-spark-env.sh files in the $SPARK_CONF_DIR. Spark will use the settings in spark-env.sh, which is a Symbolic Link to the file for the type of node you'll use for your cluster. (To see which config is active, run ls -l $SPARK_CONF_DIR/spark-env.sh). To change node types you can remove spark-env.sh and replace it with a symbolic link. For example, if I want to make a cluster out of skx nodes I can run:

cd $SPARK_CONF_DIR

rm spark-env.sh

ln -s skx-spark-env.sh spark-env.sh


Note Note: The default configurations in $SPARK_CONF_DIR are conservative by allocating about 50% of system's memory to Spark Executors. Sometimes Spark jobs need memory available for I/O operations, particularly creating large parquet files, as well as other things. This conservative default decreases the chances your job will fail, which can be annoying, time-consuming, and expensive. You can increase the memory available to the executors via the SPARK_EXECUTOR_MEMORY in spark-env.sh. This is recommended for large jobs that won't need that much overhead (e.g., a computationally and memory intensive job but with a small output).

Note Note: There are other settings you might tweak in the configuration to optimize the Spark's performance. Often such optimizations will benefit some types of workloads and not others. Fine-tuning spark is unlikely to be the most productive use of your time, but if you're interested feel free to play around.

Choosing a Node Type[edit]

If they are available, icx nodes are recommended for spark because they are a modern CPU with a balanced ratio of memory to cores. The <skx> nodes are also a good choice, even though the CPUs are smaller and older, because they have a good memory/core balance. Moreover, you can run jobs that use a ton of them (up to 256!; But that will use a lot of our budget. Talk to Nate if you need to use more than 8 skx nodes at once.). The spr nodes are very fast, with many cores and high performance memory, but they don't have as much memory. They might be useful in some cases for running CPU-intensive algorithms within Spark. Spark is very hungry for memory and tends to become limited by disk I/O rather than memory bandwidth or CPU. The more memory available to spark, the less it has to use the disk.

Starting the Cluster and Submitting Your Script[edit]

Now that we have configured Spark, we are ready to start the cluster and run jobs.

Note Note: It is not recommended to run clusters through the Jupyter notebook or other interactive workflow. Instead, develop and test your spark code with a sample of your data. Run the cluster when you're ready to scale out to an entire large dataset. So you'll need to connect via ssh before proceeding.

Here are the steps to start the cluster and run your pyspark script on it.

  1. Run tmux so that your cluster stays alive if your laptop disconnects from Stampede3.
  2. Use idev to checkout nodes. For example to checkout 2 icx nodes you can run idev -N 2 -p icx -t 48:00:00. When the job begins you'll land on the node that will be the Spark "master" node. This node runs the "driver" process which coordinates the cluster. It also runs a worker just like the other nodes.
  3. Next, we need to tell spark which nodes to use. The slurm_workers.sh makes a list of nodes have been assigned to your job. Spark will read the $SPARK_CONF_DIR/workers file to know where to start the executors. Run slurm_workers.sh > workers to update the workers file with the nodes you have checked out.
  4. Start the cluster! Run $SPARK_HOME/sbin/start-all.sh
  5. Finally, you're ready to run your pyspark script. $SPARK_HOME/bin/spark-submit/ --master spark://$(hostname):7077 my_script.py.

Note Note: You can work with the spark cluster interactively by installing pyspark in your environment via pip or uv and then running pyspark --master spark://$(hostname):7077.

Monitoring your jobs[edit]

When your spark script is running, you'll see lots of information output in your terminal. You'll likely want to know how much progress your job is making, if it is stuck, and how long it might take to finish. With experience you can glean that information from the terminal output, but its better to use Spark's webui for this.

  1. Start the chromium browser on the master node by running /work2/10114/nathante/stampede3/containers/bin/chromium (if you put /work2/10114/nathante/stampede3/containers/bin/ on your PATH by following the container instructions above you can just run chromium.
  2. Open localhost:8080 in the browser.

Note Note: If chromium gives an error like ERROR:chrome/browser/process_singleton_posix.cc:358, that's because it got killed before it could remove its lockfile. You can just remove it manually via rm ~/.config/chromium/SingletonLock.

Stopping the cluster[edit]

Don't waste resources! When your job is done, free up your nodes for someone else.

$SPARK_HOME/bin/stop-all.sh

exit

To automatically stop the cluster and free your nodes after your job succeeds you can run all on one line:

$SPARK_HOME/bin/spark-submit --master $(hostname):7077 MY_SCRIPT.py && $SPARK_HOME/sbin/stop-all.sh && exit

Using the GPUs on Stampede3[edit]

The H100 nodes on Stampede3 have 4 Nvidia H100 gpus. That's pretty impressive, and it's possible to use 4 nodes at once (this might be useful if you're fine-tuning an LLM or want to run a really really big one).

Actually getting them working takes a little bit of doing. Here is a list of steps that ought to work for setting up vllm, a good system to use for running large language models.

Setting environment variables for uv and vllm[edit]

Add the following to your .bashrc:

export UV_CACHE_DIR=$SCRATCH/.cache

export TRANSFORMERS_CACHE=$SCRATCH/transformers_cache

export HF_HOME=$TRANSFORMERS_CACHE

export XDG_CACHE_HOME=$SCRATCH/.cache

Then reload:

exec bash

Install `uv` in your user site-packages[edit]

Note Note: Fortunately, TACC has recently installed python 3.12, so it isn't necessary to use a container anymore. Containers are still sometimes needed to install other software, including some python packages, but they add some complexity. If you're just starting out working with HPC, its fine not to use a container at first. You can just use uv.

uv is a modern package management system for python. You can use the older `pip` system to install it into your own virtual environment.

pip3 install -U uv

To be able to run uv just by typing u v [space] enter, use an alias. Create a file ~/.bash_aliases and add the line

alias uv="python3 -m uv".

You can do this just by running: echo alias uv=\"python3 -m uv\" >> ~/.bash_aliases

Now, restart your terminal session by typing exec bash.

Create a python virtual environment for your project[edit]

Note Note: A good workflow is to develop code locally and then test it on the HPC. This helps because: (1) You can use your favored editor locally instead of working with limited tools like Jupyter, terminal editors, or working with a GUI over the network. (2) It uses resources to use HPC nodes. (3) Particularly with the H100 nodes on TACC, you might have to wait (sometimes over a day) for an available node. uv is useful because it creates a `pyproject.toml` and `uv.lock` file. If you check these into git and sync them to the HPC, `uv` will make sure that the Python project on your laptop and the HPC are using the same package versions.

Note Note: Given the filesystem situation described above, you will normally work with large data objects on the $SCRATCH filesystem, and copy your datasets and results to $CORRAL. Make a new directory (i.e., using mkdir) to use for the following steps.

You can create a virtual environment using uv with this command:

uv init

Recommend: add `uv` as a dev package to the virtual environment and then sourcing it.

uv add --dev uv

source .venv/bin/activate.

Install vllm into your virtual environment[edit]

  1. Get an H100 node by running idev -p h100 -t 48:00:00
  2. Run module load gcc/13.2.0; module load cuda to load the nvidia module which puts the Nvidia cuda compilers (nvc and nvcc) on your $PATH.
  3. Navigate to the the directory where you created your virtual environment and install vllm by running uv pip install vllm[flashinfer] --torch-backend=auto --no-cache and then.
  4. Test your installation by seeing if you can run the gpt-oss-20B model in vllm. uv run vllm serve openai/gpt-oss-20b --async-scheduling