Score-P

Overview

The goal of Score-P is to simplify the analysis of high performance computing software and enable developers to find performance problems. The Score-P measurement infrastructure is a highly scalable and easy-to-use tool suite for profiling, event tracing, and online analysis of HPC applications. Score-P supports analyzing C, C++ and Fortran applications that make use of multi-processing (MPI, SHMEM), thread parallelism (OpenMP, PThreads) and accelerators (HIP, CUDA, OpenCL, OpenACC, OpenMP-offload) and combinations of these. It works in combination with Scalasca, Vampir, TAU and Periscope, generating traces in OTF2 format (Vampir, Scalasca, TAU) and profiles in CUBE4 and TAU formats.

See https://www.vi-hps.org/projects/score-p/ for details about Score-P.

Usage

Steps in a typical Score-P workflow to run on an OLCF machine (e.g. Frontier or Andes). (More info about OLCF Systems).

Connect to an OLCF system: ssh <user_id><machine>.olcf.ornl.gov
Instrument your code with Score-P
Perform a measurement run with profiling enabled
Perform a profile analysis with CUBE or cube_stat
Use scorep-score to define a filter
Perform a measurement run with tracing enabled and the filter applied
Perform in-depth analysis on the trace data with Vampir

Instrumentation

To instrument your code, you need to re-compile the code using the Score-P instrumentation command (scorep) added as a prefix to your compile statement. In most cases the Score-P instrumentor is able to automatically detect the programming paradigm from the set of compile and link options given to the compiler. Some cases may, however, require some additional link options within the compile statement e.g. ROCm HIP or CUDA instrumentation.

Note

You will need to unload the darshan-runtime module if it is loaded. In some instances you may need to unload the xalt and xl modules.

$ module unload darshan-runtime

Get a list of available versions of Score-P and select the one you want to use.

# Find available scorep modules
$ module spider scorep
...
# Returned choices on Frontier: scorep-gcc-amd, scorep-amd, scorep-amdclang, scorep-cray

# Unload the darshan-runtime module if it is loaded
$ module unload darshan-runtime

# Load the desired version, e.g.: Score-P with GNU compiler
$ module load scorep-gcc-amd

# If you want to see how scorep was configured
# module show scorep-gcc-amd

# Getting information on scorep flags
$ scorep --help

# Info about compiler wappers (e.g. scorep-amdclang, scorep-mpicc, etc)
$ scorep-wrapper --help

Below are some basic examples of the different ways to instrument your code with Score-P. The examples below are for the GNU compiler, but the same principles apply to other compilers (e.g. Intel, Cray, etc).

# Prepend compile and link commands with scorep
# scorep <COMPILER/LINKER> mytestcode.ext
# scorep <scorep-options> <COMPILER/LINKER> <options> mytestcode.ext

# For C (similar for C++, Fortran)
$ scorep gcc -c test.c
$ scorep gcc -o test test.o

# For Fortran
$ scorep gfortran -c test.f90
$ scorep gfortran -o test test.o

# For MPI (using mpicc, mpiCC, mpifort as needed)
$ module load <mpi-module>
$ scorep mpicc -c test.c   # Use -fopenmp and -pthread as needed
$ scorep mpicc -o test test.o

# For GPU code (ROCM, CUDA, etc)
# The scorep wrapper will usually autodetect the GPU compiler,
# but you can force it with --hip, --cuda, etc
# Using ROCm HIP
$ scorep --hip hipcc -L${OLCF_CUDA_ROOT}/lib64 -c test.c
# Using CUDA
$ scorep --cuda --user nvc++ -cuda -L${OLCF_CUDA_ROOT}/lib64 -c test.c

CMake / Autotools Instrumentation

CMake and Autotools based build systems run a number of small configuration-tests to probe the system, and these configuration-tests will often fail if scorep is used as above. To get around this, use the provided scorep-wrapper scripts (e.g. scorep-gcc, scorep-mpicc) together with the variable SCOREP_WRAPPER=off. This switches the scorep-wrapper off during the configuration time, but scorep still gets used at application build time. For more detailed information on using Score-P with CMake or Autotools visit Score-P

For CMake and Autotools based builds, run configure in the following way:

# Get information on the scorep-wrapper scripts
$ scorep-wrapper --help

# Example for CMake build generation with GNU compiler-wrappers
$ SCOREP_WRAPPER=off cmake ..
     -DCMAKE_C_COMPILER=scorep-gcc \
     -DCMAKE_CXX_COMPILER=scorep-g++ \
     -DCMAKE_Fortran_COMPILER=scorep-ftn

# Example for autotools with GNU compiler-wrappers
$ SCOREP_WRAPPER=off ../configure \
     CC=scorep-gcc \
     CXX=scorep-g++ \
     FC=scorep-ftn \
     --disable-dependency-tracking

Makefile Instrumentation

Setting a flag variable, such as PREP = scorep variable within a Makefile will simplify enabling and disabling instrumentation control while using make. Additionally, one can add other Score-P options within the PREP variable e.g. --hip. To disable the instrumentation, simply set the PREP variable to an empty string. Below is an example of a Makefile that uses Score-P with ROCm HIP.

## Makefile for Score-P with ROCm HIP

CC = hipcc
CFLAGS =
PREP = scorep --hip

INCLUDES  = -I<Path to Includes>/include # if needed
LIBRARIES = -L<Path to Libraries>/lib64 # if needed

test: test.o
   $(PREP) $(CC) $(CFLAGS) $(LIBRARIES) test.o -o test

test.o: test.c
   $(PREP) $(CC) $(CFLAGS) $(INCLUDES) -c test.c

.PHONY: clean

clean:
   rm -f test *.o

Measurement

Once the code has been instrumented, it is time to begin the measurement runs. The measurement calls will gather information during the execution and store for later analysis.

By default Score-P is configured to run with profiling set to true and tracing set to false. Measurement types are configured via environment variables and the default values can be checked using the scorep-info command. The environment variables can be set in your batch script or interactively.

# Environment variable examples
$ export SCOREP_ENABLE_TRACING=true

# Check what current Score-P environment variables are set:
$ scorep-info config-vars --full
# Output of scorep-info config-vars
SCOREP_ENABLE_PROFILING
Description: Enable profiling
      Type: Boolean
      Default: true

SCOREP_ENABLE_TRACING
Description: Enable tracing
      Type: Boolean
      Default: false

SCOREP_VERBOSE
Description: Be verbose
      Type: Boolean
      Default: false
 .....

Profiling

To generate a profile run of your instrumented code on a compute node, you will first need to get a node allocation using a batch script or an interactive job. Additionally you will need to load the otf2 and cubew modules.

$ module load otf2
$ module load cubew

For more information on launching batch jobs on Frontier, please see the Running Jobs section of the Frontier User Guide. Here is an example batch script to run a profiling measurement on Frontier:

Example Batch Script for Frontier

#!/bin/bash
#SBATCH -A ABC123       # Project account
#SBATCH -t 1:00:00      # Walltime
#SBATCH -p batch        # Queue
#SBATCH -N 1            # Number of nodes
#SBATCH -J MyJobName    # Job Name
#SBATCH -o %x-%j.out    # Job output file

cd <path to my ScoreP instrumented executable>

export SCOREP_ENABLE_PROFILING=true
export SCOREP_ENABLE_TRACING=false
export SCOREP_EXPERIMENT_DIRECTORY=executable_scorep_outdir

srun -n 1 ./<executable>

By default, the output files generated when the profile measurement runs are successful will be placed in a folder named scorep-yyyymmdd_hhmm_uniqueid. A preferred folder name can be set using the SCOREP_EXPERIMENT_DIRECTORY env variable. After the profile run, the folder will contain a file with the name profile.cubex. The .cubex file can be analyzed using a presentation tool called Cube developed by Scalasca.

For a more detailed description of profiling measurements with Score-P, please visit the ScorepP_Profiling homepage.

Tracing

Since tracing measurements acquire significantly more output data than profiling, we need to design a filter to remove some of the most visited calls within your instrumented code. There is a tool developed by Score-P, scorep-score that allows us to estimate the size of the trace file (OTF2) based on information attained from the profiling generated cubex file.

To gather the needed information to design a filter file, first run scorep-score on the generated profile file:

$ scorep-score -r <profile cube dir>/profile.cubex

Output scorep-score generated example:

Estimated aggregate size of event trace:                   40GB
Estimated requirements for largest trace buffer (max_buf): 10GB
Estimated memory requirements (SCOREP_TOTAL_MEMORY):       10GB
(warning: The memory requirements can not be satisfied by Score-P to avoid
intermediate flushes when tracing. Set SCOREP_TOTAL_MEMORY=4G to get the
maximum supported memory or reduce requirements using USR regions filters.)

Flt type      max_buf[B]         visits  time[s]  time[%]  time/visit[us]      region
     ALL  10,690,196,070  1,634,070,493  1081.30    100.0            0.66         ALL
     USR  10,666,890,182  1,631,138,069   470.23     43.5            0.29         USR
     OMP      22,025,152      2,743,808   606.80     56.1          221.15         OMP
     COM       1,178,450        181,300     2.36      0.2           13.04         COM
     MPI         102,286          7,316     1.90      0.2          260.07         MPI

     USR   3,421,305,420    522,844,416   144.46     13.4            0.28  matmul_sub
     USR   3,421,305,420    522,844,416   102.40      9.5            0.20  matvec_sub

The first line of the output gives an estimation of the total size of the trace, aggregated over all processes. This information is useful for estimating the space required on disk. In the given example, the estimated total size of the event trace is 40GB. The second line prints an estimation of the memory space required by a single process for the trace. Since flushes heavily disturb measurements, the memory space that Score-P reserves on each process at application start must be large enough to hold the process’ trace in memory in order to avoid flushes during runtime.

In addition to the trace, Score-P requires some additional memory to maintain internal data structures. Thus, it provides also an estimation for the total amount of required memory on each process. The memory size per process that Score-P reserves is set via the environment variable SCOREP_TOTAL_MEMORY. In the given example the per process memory is about 10GB. When defining a filter, it is recommended to exclude short, frequently called functions from measurement since they require a lot of buffer space (represented by a high value under max_tbc) but incur a high measurement overhead. MPI functions and OpenMP constructs cannot be filtered. Thus, it is usually a good approach to exclude regions of type USR starting at the top of the list until you reduced the trace to your needs. The example below excludes the functions matmul_sub and matvec_sub from the trace:

$ cat scorep.filter
SCOREP_REGION_NAMES_BEGIN
 Exclude
   matmul_sub
   matvec_sub
SCOREP_REGION_NAMES_END

One can check the effects of the filter by re-running the scorep-score command with the new filter file.

$ scorep-score <profile cube dir>/profile.cubex -f scorep.filter

Now you are ready to submit a batch request with your instrumented code to run with tracing enabled. To run a tracing measurement, we will need to enable the environment variable SCOREP_ENABLE_TRACING. To apply the filter to your measurement run, you must specify this in an environment variable called SCOREP_FILTERING_FILE.

$ export SCOREP_ENABLE_TRACING=true
$ export SCOREP_FILTERING_FILE=scorep.filter

This measurement will generate files of the form traces.otf. The .otf2 file format can be analyzed by the Vampir visualization tool that provides a visual GUI to analyze and understand large .otf2 trace files generated with Score-P.

OLCF Vampir Documentation gives more details on how to use Vampir on OLCF systems.

Note

Small trace files can be downloaded and viewed locally on your machine if you have the Vampir client downloaded, or they can be viewed on your local machine using a remote X-display from the OLCF machine.

For large trace files, it is strongly recommended to run vampirserver on the OLCF machine, reverse-connected to a Vampir client on your local machine. See the Vampir Tunneling to VampirServer section for more details.

Manual Instrumentation

In addition to automatically profiling and tracing functions, there is also a way to manually instrument a specific region in the source code. To do this, you will need to add the --user flag to the scorep command when compiling:

$ scorep --user gcc -c test.c
$ scorep --user gcc -o test test.o

Now you can manually instrument Score-P within the source code as seen below:

C/C++

#include <scorep/SCOREP_User.h>

void foo() {
   SCOREP_USER_REGION_DEFINE(my_region)
   SCOREP_USER_REGION_BEGIN(my_region, "foo", SCOREP_USER_REGION_TYPE_COMMON)
   // do the work of foo here
   SCOREP_USER_REGION_END(my_region)
}

Fortran

#include <scorep/SCOREP_User.inc>

subroutine foo
   SCOREP_USER_REGION_DEFINE(my_region)
   SCOREP_USER_REGION_BEGIN(my_region, "foo", SCOREP_USER_REGION_TYPE_COMMON)
   ! do the work of foo here
   SCOREP_USER_REGION_END(my_region)
end subroutine foo

In this case, “my_region” is the handle name of the region which has to be defined with SCOREP_USER_REGION_DEFINE. Additionally, “foo” is the string containing the region’s unique name (this is the name that will show up in Vampir) and SCOREP_USER_REGION_TYPE_COMMON identifies the type of the region. Make note of the header files seen in the above example that are needed to include the Score-P macros. See the Score-P User Adapter page for more user configuration options.

Below are some examples of manually instrumented regions using phase and loop types:

#include <scorep/SCOREP_User.h>

SCOREP_USER_REGION_DEFINE(sum_hdl)
SCOREP_USER_REGION_BEGIN(sum_hdl, "sum", SCOREP_USER_REGION_TYPE_PHASE)
if (x < 1){
   //do calculation
}
else{
   //do other calculation
}
SCOREP_USER_REGION_END(sum_hdl)

#include <scorep/SCOREP_User.h>

SCOREP_USER_REGION_DEFINE(calculation_hdl)
SCOREP_USER_REGION_BEGIN(calculation_hdl, "my_calculations", SCOREP_USER_REGION_TYPE_LOOP)
#pragma omp parallel for ...
   for (int i=0; i <num; i++){
      //do calculation
   }
SCOREP_USER_REGION_END(calculation_hdl)

The regions “sum” and “my_calculations” in the above examples would then be included in the profiling and tracing runs and can be analysed with Vampir. For more details, refer to the Advanced Score-P training in the OLCF Training Archive.

Score-P and Vampir Demo

Please see the 30-minute video below on 2023 Trace-Based Performance Analysis with Score-P + Vampir to get a brief introduction to Vampir and Score-P. This recording is taken from the Frontier Training Workshop (August 2023), Friday, August 25th, 2023, presented by Bill Williams, TU-Dresden.

You can watch the video here: https://vimeo.com/858484450

2023 Trace-Based Performance Analysis with Score-P + Vampir from OLCF on Vimeo.