PyTorch on Frontier
PyTorch is a library for Python programs that pairs well with HPC resources and facilitates building deep learning (DL) projects. PyTorch emphasizes the flexibility and human-readableness of Python and allows DL models to be expressed in a similar manner. Compared to other frameworks and libraries, it is one of the more “beginner friendly” ML/DL packages due to its dynamic and familiar “Pythonic” nature. PyTorch is also useful when GPUs are involved because of its strong GPU acceleration ability. On Frontier, PyTorch is able to take advantage of the many AMD GPUs available on the system.
This guide outlines installation and running best practices for PyTorch on Frontier.
Installing PyTorch
In general, installing either the “stable” or “nightly” wheels of PyTorch>=2.1.0 listed on Pytorch’s Website works well on Frontier. When navigating the install instructions on their website, make sure to indicate “Linux”, “Pip”, and “ROCm” for accurate install instructions. Let’s follow those instructions to install a stable wheel of torch.
First, load your modules:
module load PrgEnv-gnu/8.5.0
module load miniforge3/23.11.0-0
module load rocm/6.1.3
module load craype-accel-amd-gfx90a
Next, create and activate a conda environment that we will install torch
conda create -p /path/to/my_env python=3.11 -c conda-forge
source activate /path/to/my_env
Finally, install PyTorch:
pip3 install torch torchvision torchaudio --index-url
You should now be ready to use PyTorch on Frontier!
For older or more specific wheels to install, take a look at these links:
However, note that older versions of the PyTorch pre-compiled wheels will be less likely to work properly on Frontier (especially versions older than v2.1.0). For users interested in older versions of PyTorch, or for those needing to install special configurations, you may need to install PyTorch from source instead. If you need to install from source, take a look at AMD’s PyTorch+ROCm fork on github: . If you’re having trouble installing from source, feel free to submit a ticket to .
Optional: Install mpi4py
Although mpi4py
isn’t required in general (you can accomplish the same task using system environment variables), it acts as a nice convenience when needing to set various MPI parameters when using PyTorch for distributed training.
This is taken from our Installing mpi4py and h5py guide:
MPICC="cc -shared" pip install --no-cache-dir --no-binary=mpi4py mpi4py
The below example uses mpi4py
Example Usage
We adapted the
DDP tutorial to work with SLURM, mpi4py
, and to use 1 GPU per MPI task.
Utilizing all the GPUs on the node in this manner means there will be 8 tasks per node.
Because we are enforcing 1 GPU per task, each MPI task only sees device 0
in PyTorch.
Even if the physical GPU ID on Frontier is different, and even though there are 8 GCDs (GPUs) on a node, the torch device in this case is still 0 due to a task only being mapped to one GPU.
The adapted script
is below:
from mpi4py import MPI
import torch
import torch.nn.functional as F
from import Dataset, DataLoader
import torch.multiprocessing as mp
from import DistributedSampler
from torch.nn.parallel import DistributedDataParallel as DDP
import torch.distributed as dist
import os
class MyTrainDataset(Dataset):
def __init__(self, size):
self.size = size = [(torch.rand(20), torch.rand(1)) for _ in range(size)]
def __len__(self):
return self.size
def __getitem__(self, index):
class Trainer:
def __init__(
model: torch.nn.Module,
train_data: DataLoader,
optimizer: torch.optim.Optimizer,
save_every: int,
snapshot_path: str,
local_rank: int,
world_rank: int,
) -> None:
self.local_rank = local_rank
self.global_rank = global_rank
self.model =
self.train_data = train_data
self.optimizer = optimizer
self.save_every = save_every
self.epochs_run = 0
self.snapshot_path = snapshot_path
if os.path.exists(snapshot_path):
print("Loading snapshot")
self.model = DDP(self.model, device_ids=[self.local_rank])
def _load_snapshot(self, snapshot_path):
loc = f"cuda:{self.local_rank}"
snapshot = torch.load(snapshot_path, map_location=loc)
self.epochs_run = snapshot["EPOCHS_RUN"]
print(f"Resuming training from snapshot at Epoch {self.epochs_run}")
def _run_batch(self, source, targets):
output = self.model(source)
loss = F.cross_entropy(output, targets)
def _run_epoch(self, epoch):
b_sz = len(next(iter(self.train_data))[0])
print(f"[GPU{self.global_rank}] Epoch {epoch} | Batchsize: {b_sz} | Steps: {len(self.train_data)}")
for source, targets in self.train_data:
source =
targets =
self._run_batch(source, targets)
def _save_snapshot(self, epoch):
snapshot = {
"MODEL_STATE": self.model.module.state_dict(),
"EPOCHS_RUN": epoch,
}, self.snapshot_path)
print(f"Epoch {epoch} | Training snapshot saved at {self.snapshot_path}")
def train(self, max_epochs: int):
for epoch in range(self.epochs_run, max_epochs):
if self.local_rank == 0 and epoch % self.save_every == 0:
def load_train_objs():
train_set = MyTrainDataset(2048) # load your dataset
model = torch.nn.Linear(20, 1) # load your model
optimizer = torch.optim.SGD(model.parameters(), lr=1e-3)
return train_set, model, optimizer
def prepare_dataloader(dataset: Dataset, batch_size: int):
return DataLoader(
def main(save_every: int, total_epochs: int, batch_size: int, local_rank: int, world_rank: int, snapshot_path: str = ""):
dataset, model, optimizer = load_train_objs()
train_data = prepare_dataloader(dataset, batch_size)
trainer = Trainer(model, train_data, optimizer, save_every, snapshot_path, local_rank, global_rank)
if __name__ == "__main__":
import argparse
parser = argparse.ArgumentParser(description='simple distributed training job')
parser.add_argument('total_epochs', type=int, help='Total epochs to train the model')
parser.add_argument('save_every', type=int, help='How often to save a snapshot')
parser.add_argument('--batch_size', default=32, type=int, help='Input batch size on each device (default: 32)')
parser.add_argument("--master_addr", type=str, required=True)
parser.add_argument("--master_port", type=str, required=True)
args = parser.parse_args()
num_gpus_per_node = torch.cuda.device_count()
print ("num_gpus_per_node = " + str(num_gpus_per_node), flush=True)
world_size = comm.Get_size()
global_rank = rank = comm.Get_rank()
local_rank = int(rank) % int(num_gpus_per_node) # local_rank and device are 0 when using 1 GPU per task
backend = None
os.environ['WORLD_SIZE'] = str(world_size)
os.environ['RANK'] = str(global_rank)
os.environ['LOCAL_RANK'] = str(local_rank)
os.environ['MASTER_ADDR'] = str(args.master_addr)
os.environ['MASTER_PORT'] = str(args.master_port)
os.environ['NCCL_SOCKET_IFNAME'] = 'hsn0'
#init_method="tcp://{}:{}".format(args.master_addr, args.master_port),
main(args.save_every, args.total_epochs, args.batch_size, local_rank, global_rank)
To run the python script, an example batch script is given below:
#SBATCH -J ddp_test
#SBATCH -o logs/ddp_test-%j.o
#SBATCH -e logs/ddp_test-%j.e
#SBATCH -t 00:05:00
#SBATCH -p batch
# Only necessary if submitting like: sbatch --export=NONE ... (recommended)
# Do NOT include this line when submitting without --export=NONE
# Load modules
module load PrgEnv-gnu/8.5.0
module load rocm/6.1.3
module load craype-accel-amd-gfx90a
module load miniforge3/23.11.0-0
# Activate your environment
source activate /path/to/my_env
# Get address of head node
export MASTER_ADDR=$(hostname -i)
# Needed to bypass MIOpen, Disk I/O Errors
export MIOPEN_USER_DB_PATH="/tmp/my-miopen-cache"
# Run script
srun -N2 -n16 -c7 --gpus-per-task=1 --gpu-bind=closest python3 -W ignore -u ./ 2000 10 --master_addr=$MASTER_ADDR --master_port=3442
As mentioned on our Python on OLCF Systems page, submitting batch scripts like below is recommended when using conda environments:
sbatch --export=NONE
After running the script, you will have successfully used PyTorch to train on 16 different GPUs for 2000 epochs and save a training snapshot. Depending on how long PyTorch takes to initialize, the script should complete in 10-20 seconds. If the script is able to utilize any cache (e.g., if you ran the script again in the same compute job), then it should complete in approximately 5 seconds.
Best Practices
Master Address and Sockets
We highly recommend setting MASTER_ADDR
when assigning host addresses:
export MASTER_ADDR=$(hostname -i)
There are different Master Ports you can use, but we typically recommend using port 3442 for MASTER_PORT
export MASTER_PORT=3442
Setting the variables above are of utmost importance when using multiple nodes.
Please avoid using torchrun
if possible.
It is recommended to use srun
to handle the task mapping instead.
On Frontier, the use of torchrun
significantly impacts the performance of your code.
Initial tests have shown that a script which normally runs on order of 10 seconds can take up to 10 minutes to run when using torchrun
– over an order of magnitude worse!
Additionally, nesting torchrun
within srun
(i.e., srun torchrun ...
) does not help, as the two task managers will clash.
Environment Location
Where your PyTorch environment is stored on Frontier makes a big difference in performance.
Although NFS locations avoid purge policies, environments stored on NFS (e.g., /ccs/home/
or /ccs/proj/
) initialize and run PyTorch slower than other locations.
Storing your environment on Lustre does perform faster than NFS, but still can be slow to initialize (especially at scale).
It is highly recommended to move your environment to the NVMe using sbcast
Although using sbcast
introduces some overhead, in the long run it is much faster at initializing PyTorch and other libraries in general.
More information on how to use sbcast
and conda-pack
to move your environment to the NVMe can be found on our Sbcast Conda Environments guide.
In a nutshell: NVMe > Orion >> NFS.
The AWS-OFI-RCCL plugin enables using libfabric as a network provider while running AMD’s RCCL based applications. This plugin can be built and used by common ML/DL libraries like PyTorch to increase performance when running on AMD GPUs.
To build the plugin on Frontier (using rocm 5.7.1 as an example):
# Load modules
module load PrgEnv-gnu/8.5.0
module load rocm/$rocm_version
module load craype-accel-amd-gfx90a
module load gcc-native/12.3
module load cray-mpich/8.1.28
# Download the plugin repo
git clone --recursive
cd aws-ofi-rccl
# Build the plugin
export LD_LIBRARY_PATH=/opt/rocm-$rocm_version/hip/lib:$LD_LIBRARY_PATH
CC=hipcc CFLAGS=-I/opt/rocm-$rocm_version/rccl/include ./configure \
--with-libfabric=$libfabric_path --with-rccl=/opt/rocm-$rocm_version --enable-trace \
--prefix=$PLUG_PREFIX --with-hip=/opt/rocm-$rocm_version/hip --with-mpi=$MPICH_DIR
make install
# Reminder to export the plugin to your path
echo "Add the following line in the environment to use the AWS OFI RCCL plugin"
RCCL library location varies based on ROCm version.
Before 6.0.0:
After 6.0.0:
Once the plugin is installed, you must include it in your LD_LIBRARY_PATH
when running applications to use it:
More information about RCCL, the plugin, and profiling its effect on Frontier applications can be found here.
Environment Variables
When running with the NCCL (RCCL) backend, there are specific environment variables that you should test to see how it affects your application’s performance. Some variables to try are:
NCCL_NET_GDR_LEVEL=3 # Can improve performance, but remove this setting if you encounter a hang/crash.
NCCL_ALGO=TREE or RING # May see performance difference with either setting. (should not need to use this, but can try)
NCCL_CROSS_NIC=1 # On large systems, this NCCL setting has been found to improve performance
NCCL_DEBUG=info # For debugging only (warning: generates a large amount of messages)
PyTorch Geometric
PyTorch Geometric (also known as PyG
or torch_geometric
) is a library built upon PyTorch to easily write and train Graph Neural Networks (GNNs).
Assuming you already have a working PyTorch installation (see above), install instructions for the torch_geometric
suite of libraries on Frontier are provided below:
# Activate your virtual environment
source activate /path/to/my_env
# Install some build tools
pip install ninja packaging
# Install PyG libraries (latest version tests in comments)
MAX_JOBS=16 pip install torch-geometric # v2.6.1
MAX_JOBS=16 pip install torch-cluster # v1.6.3
MAX_JOBS=16 pip install torch-spline-conv # v1.2.2
git clone --recursive # v0.6.18
cd pytorch_sparse
CC=gcc CXX=g++ MAX_JOBS=16 python3 bdist_wheel
pip install dist/*.whl
cd ..
git clone --recursive # v2.1.2
cd pytorch_scatter
CC=gcc CXX=g++ MAX_JOBS=16 python3 bdist_wheel
pip install dist/*.whl
cd ..
MPICH mpi4py Errors
If you see mpich
error messages indicating a given rank isn’t confined to a single NUMA node or domain like this:
MPICH ERROR: Unable to use a NIC_POLICY of 'NUMA'. Rank 4 is not confined to a single NUMA node. There are 4 numa_nodes detected (rc=0).
MPICH ERROR [Rank 0] [job id 2853270.0] [Fri Dec 13 13:41:36 2024] [frontier05084] - Abort(2665871) (rank 0 in comm 0): Fatal error in PMPI_Init_thread: Other MPI error, error stack:
MPIDI_CRAY_ofi_get_nic_index(1801)....: OFI invalid value for environment variable
and you are sure you are mapping your cores correctly via srun
, try importing mpi4py
before torch
A recent update in PyTorch broke importing mpi4py
after torch
If you still see these errors, please contact
for other workarounds (because it’s likely not a PyTorch issue).
Proxy Settings
By default, the compute nodes are closed off from the internet. If you need access for certain use-cases (e.g., need to download a checkpoint or pre-trained model) you can go through our proxy server. Set these environment variables in your batch script if needed:
export all_proxy=socks://
export ftp_proxy=
export http_proxy=
export https_proxy=
export no_proxy='localhost,,*'
c10d Socket Warnings
When using PyTorch and DDP, you may get warning messages like this:
[W socket.cpp:697] [c10d] The client socket cannot be initialized to connect to []:3442
(errno: 97 - Address family not supported by protocol).
Messages like above are harmless and it does not affect PyTorch+DDP when you’re using the NCCl/RCCL backend. Context: After PyTorch v1.x, when using tcp to initialize PyTorch DDP, the default is to use IPv6 addresses; PyTorch falls back to use IPv4 if IPv6 does not work.
Dataset Cache
The default cache directory is in your $HOME
directory, so you may run into quota issues if datasets get too large or if you have multiple datasets cached at that location.
Some packages let you indicate where you want your dataset cache to be stored.
For example, to manage your Hugging Face cache, you can change it from ~/.cache/huggingface/datasets
export HF_DATASETS_CACHE="/path/to/another/directory"
It is recommended to move your cache directory to another location if you’re seeing quota issues; however, if you store your cache directory on Orion, be mindful that data stored on Orion is subject to purge policies if data is not accessed often.
