Training

Introduction to Supercomputing, part II

If you need to perform many calculations, or analyses that are too big for your own system, clusters and supercomputers provide the computing power you need. In this course, you will learn to work with the national supercomputer Snellius.

Snellius
15 Apr 2024
Time
9:00 tot 14:00
Location
Neuron 0.262, TU/e Campus, Eindhoven

What will you learn in this course?

This course is a follow-up to the first introductory Supercomputing course, where you can take a deeper dive into the use of supercomputers with a special focus on efficiency and good practices and a very practical approach.

The format of this course includes the following modules:

  • Fundamentals of performance analysis. This introductory technical presentation introduces high-performance hybrid systems, abstractly covering the architecture and configuration of the system. Our aim is to enhance the understanding of HPC complexity before delving deeper into the importance of performance analysis models. Special attention is paid to the Roofline model.
    • Abstract Modelling hybrid supercomputers. Presenting an abstract modelling approach for hybrid supercomputers, condensing their complexity into three key parameters: peak performance, memory and network bandwidth.
    • Performance analysis. Explore performance analysis, starting with an overview of different models and going deeper into the specifics of the roofline model.
    • The roofline model. Describing the roofline model and presenting its practical application through clear explanations and demonstrations.
  • File systems. This practical session covers the proper use of file systems on HPC systems, especially on Snellius.
  • Slurm hybrid tasks. Slurm, a widely used job scheduler for high-performance computing (HPC) systems, has been introduced in earlier sections for a basic understanding. This module covers the specific resource allocation parameters for hybrid jobs with shared and distributed memory.
    • Nodes, cores and tasks. This segment covers the fundamental concepts of nodes, cores and tasks, and highlights their role within the context of HPC systems.
    • Bindings. The concept of bindings is explored, providing insight into how tasks are associated with specific resources, improving participants' understanding of resource allocation mechanisms.
    • Hands on. We will run the vector optics kernel with multiple configurations using a set of scripts.
  • QCG pilot job. In some cases, users need to run a large number of lightweight cases. However, the nodes of supercomputers are too powerful and only allow relatively large partitions. For example, the smallest possible allocation on Snellius is 1/4 of a node: 32 cores and 64 GB. Job concurrency is a common strategy to efficiently launch multiple lightweight jobs on such large partitions.
    • Fundamentals of job concurrency. This segment discusses the basic principles underlying job concurrency. Job concurrency is a methodological approach that allows the simultaneous execution of multiple smaller jobs within a larger allocated partition. The goal is to optimise resource utilisation and improve efficiency in scenarios where lighter tasks are executed on nodes designed for heavier workloads.
    • Hands-on QCG PilotJob. This hands-on session provides participants with hands-on experience of the QCG Pilotjob framework. Participants will gain practical insights into the strategies and techniques of using job concurrency to launch and manage multiple lightweight jobs within the context of bulky node partitions.

Prerequisites

Participation in the course Introduction to Supercomputing, Part I

The language of instruction is English

Location

TU/e Eindhoven