IBM Watson Machine Learning CE¶
Getting Started¶
IBM Watson Machine Learning Community Edition is provided on Summit
through the module ibm-wml-ce
. This module includes a license for IBM
Distributed Deep Learning (DDL) allowing execution across up to 954 nodes.
To access the IBM WML CE packages use the module load
command:
module load ibm-wml-ce
This will activate a conda environment which is pre-loaded with the following packages, and their dependencies:
IBM WML CE Version | ibm-wml-ce/1.6.1 | ibm-wml-ce/1.6.2 |
Package | IBM DDL 1.4.0 | IBM DDL 1.5.0 |
Tensorflow 1.14 | Tensorflow 1.15 | |
Pytorch 1.1.0 | Pytorch 1.2.0 | |
Caffe(IBM-enhanced) 1.0.0 | Caffe(IBM-enhanced) 1.0.0 | |
Horovod @9f87459 (IBM-DDL Backend) | Horovod v0.18.2 (IBM-DDL Backend) | |
Complete List | 1.6.1 Software Packages | 1.6.2 Software Packages |
Running DDL Jobs¶
IBM DDL provides a tool called ddlrun
which facilitates the launching of
DDL jobs on Summit. When called inside of a bsub
script, ddlrun
will
automatically distribute the training job across all the compute hosts in the
reservation. ddlrun
automatically selects default arguments optimized
for performance.
Basic DDL BSUB Script¶
The following bsub script will run a distributed Tensorflow resnet50 training job across 2 nodes.
#BSUB -P <PROJECT>
#BSUB -W 0:10
#BSUB -nnodes 2
#BSUB -q batch
#BSUB -J ddl_test_job
#BSUB -o /ccs/home/<user>/job%J.out
#BSUB -e /ccs/home/<user>/job%J.out
module load ibm-wml-ce
ddlrun python $CONDA_PREFIX/tf_cnn_benchmarks/tf_cnn_benchmarks.py --variable_update=ddl --model=resnet50
bsub
is used to launch the script as follows:
bsub script.bash
For more information on bsub
and job submission
please see: Running Jobs.
Troubleshooting Tips¶
Full command¶
The output from ddlrun
includes the exact command used to launch the
distributed job. This is useful if a user wants to see exactly what ddlrun
is doing. The following is the first line of the output from the above script:
$ module load ibm-wml-ce
(ibm-wml-ce-1.6.1) $ ddlrun python $CONDA_PREFIX/tf_cnn_benchmarks/tf_cnn_benchmarks.py --variable_update=ddl --model=resnet50
+ /autofs/nccs-svm1_sw/summit/.swci/1-compute/opt/spack/20180914/linux-rhel7-ppc64le/xl-16.1.1-3/spectrum-mpi-10.3.0.1-20190611-aqjt3jo53mogrrhcrd2iufr435azcaha/bin/mpirun \
-x LSB_JOBID -x PATH -x PYTHONPATH -x LD_LIBRARY_PATH -x LSB_MCPU_HOSTS -x NCCL_LL_THRESHOLD=0 -x NCCL_TREE_THRESHOLD=0 \
-disable_gdr -gpu --rankfile /tmp/DDLRUN/DDLRUN.xoObgjtixZfp/RANKFILE -x "DDL_OPTIONS=-mode p:6x2x1x1 " -n 12 \
-mca plm_rsh_num_concurrent 12 -x DDL_HOST_PORT=2200 -x "DDL_HOST_LIST=g28n14:0,2,4,6,8,10;g28n15:1,3,5,7,9,11" bash \
-c 'source /sw/summit/ibm-wml-ce/anaconda-base/etc/profile.d/conda.sh && conda activate /sw/summit/ibm-wml-ce/anaconda-base/envs/ibm-wml-ce-1.6.1-1 \
> /dev/null 2>&1 && python /sw/summit/ibm-wml-ce/anaconda-base/envs/ibm-wml-ce-1.6.1-1/ddl-tensorflow/examples/mnist/mnist-env.py'
...
Verbose mode¶
Using the verbose flag (-v
) with ddlrun
displays much more debugging
information. This should be the first step to troubleshoot errors when
launching a distributed job.
Setting up Custom Environments¶
The IBM-WML-CE
conda environment is read-only. Therefore, users
cannot install any additional packages that may be needed. If users need
any additional conda or pip packages, they can clone the IBM-WML-CE
conda environment into their home directory and then add any packages they
need.
$ module load ibm-wml-ce
(ibm-wml-ce-1.6.2) $ conda create --name cloned_env --clone ibm-wml-ce-1.6.2
(ibm-wml-ce-1.6.2) $ conda activate cloned_env
(cloned_env) $
By default this should create the cloned environment in
/ccs/home/${USER}/.conda/envs/cloned_env
.
To activate the new environment you should still load the module first. This will ensure that all of the conda settings remain the same.
$ module load ibm-wml-ce
(ibm-wml-ce-1.6.2) $ conda activate cloned_env
(cloned_env) $
To use Horovod with the IBM DDL backend in a cloned environment, the user must
pip
install Horovod using the following command:
HOROVOD_CUDA_HOME="${CONDA_PREFIX}" HOROVOD_GPU_ALLREDUCE=DDL pip install --no-cache-dir git+https://github.com/horovod/horovod.git@9f87459ead9ebb7331e1cd9cf8e9a5543ecfb784
Best DDL Performance¶
Most users will get good performance using LSF basic job submission, and
specifying the node count with -nnodes N
. However, users trying
to squeeze out the final few percent of performance can use the following
techniques.
Reserving Whole Racks¶
When making node reservations for DDL jobs, it is best to reserve nodes in a rack-contiguous manner. IBM DDL optimizes communication with knowledge of the node layout.
In order to instruct BSUB to reserve nodes in the same rack, expert mode must
be used (-csm y
), and the user needs to explicitly specify the reservation
string. For more information on Expert mode see: Easy Mode vs. Expert Mode
The following BSUB arguments and reservation string instruct bsub
to
reserve 2 compute nodes within the same rack:
#BSUB -csm y
#BSUB -n 85
#BSUB -R 1*{select[((LN)&&(type==any))]order[r15s:pg]span[hosts=1]cu[type=rack:pref=config]}+84*{select[((CN)&&(type==any))]order[r15s:pg]span[ptile=42]cu[type=rack:maxcus=1]}
-csm y
enables ‘expert mode’.
-n 85
the total number of slots must be requested, as -nnodes
is not
compatible with expert mode.
We can break the reservation string down to understand each piece.
The first term is needed to include a launch node in the reservation.
1*{select[((LN)&&(type==any))]order[r15s:pg]span[hosts=1]cu[type=rack:pref=config]}
The second term specifies how many compute slots and how many racks.
+84*{select[((CN)&&(type==any))]order[r15s:pg]span[ptile=42]cu[type=rack:maxcus=1]}
- Here the
84
slots represents 2 compute nodes. Each compute node has 42 compute slots. - The
maxcus=1
specifies that the nodes can come from at most 1 rack.
- Here the
Best DDL Arguments¶
Summit is comprised of 256 racks of 18 nodes with 6 GPUs each. For more information about the hardware of Summit please see: System Overview.
DDL works best with topological knowledge of the cluster.
GPUs per Node X Nodes per Rack X Racks Per Aisle X Aisles
Some of this
information can be acquired automatically, but some has to be specified
by the user.
To get the best performance reservations should be made in multiples of 18,
and the user should pass topology arguments to DDLRUN
.
--nodes 18
informs DDL that there are 18 nodes per rack. Specifying 18 nodes per rack gave the best performance in preliminary testing, but it may be that logically splitting racks in half (--nodes 9
) or logically grouping racks (--nodes 36
) could lead to better performance on other workloads.--racks 4
informs DDL that there are 4 racks per aisle. Summit is a fat tree, but preliminary testing showed that grouping racks into logical aisles of 4 racks gave the best performance.--aisles 2
informs DDL that there are 2 total aisles.Nodes X Racks X Aisles
must equal the total number of nodes in the LSF reservation.
If running on 144 nodes, the following ddlrun
command should
give good performance.
ddlrun --nodes 18 --racks 4 --aisles 2 python script.py
For more information on ddlrun
, please see: DDLRUN.
Example¶
The following graph shows the scaling performance of the
tf_cnn_benchmarks
implementation of the Resnet50 model
running on Summit during initial benchmark testing.

Figure 1. Performance Scaling of IBM DDL on Summit¶
The following LSF script can be used to reproduce the results for 144 nodes:
#BSUB -P <PROJECT>
#BSUB -W 1:00
#BSUB -csm y
#BSUB -n 6049
#BSUB -R "1*{select[((LN) && (type == any))] order[r15s:pg] span[hosts=1] cu[type=rack:pref=config]}+6048*{select[((CN) && (type == any))] order[r15s:pg] span[ptile=42] cu[type=rack:maxcus=8]}"
#BSUB -q batch
#BSUB -J <PROJECT>
#BSUB -o /ccs/home/user/job%J.out
#BSUB -e /ccs/home/user/job%J.out
module load ibm-wml-ce
ddlrun --nodes 18 --racks 4 --aisles 2 python $CONDA_PREFIX/tf_cnn_benchmarks/tf_cnn_benchmarks.py \
--variable_update=horovod\
--model=resnet50 \
--num_gpus=1 \
--batch_size=256 \
--num_batches=100 \
--num_warmup_batches=10 \
--data_name=imagenet \
--allow_growth=True \
--use_fp16