Supercomputers

14 August 2020 hardware, supercomputer, code

Apart from learning about computational neuroscience, I’ve learned a little bit about computing today as well. Coming from particle physics, I was used to using supercomputers for simulations. Of course, when using a supercomputer, you don’t run programs yourself. You define your workload or job: which program to run, in which directory, with which data, what ind of resources do you need, how many, how much memory, expected runtime, at minimum. Then, you submit this definition to a job or workload scheduler, which then prioritizes all the jobs submitted to it (usually FIFO I guess?) such that you and all the other users make optimal use of resources in a fair manner.

Such a supercomputer, or cluster, is then primarily interfaced with the scheduler. I thought computing cluster equalled qsub/qstat/qdel (because in particle physics, I never encountered another system), but it turns out that’s only one of many schedulers (Portable Batch System, see here for a typical particle physics oriented user guide). My new supercomputer at the Jülich Supercomputing Centre, actually a set of a few clusters, uses Slurm.

The JSC does not only do research with supercomputers, but also on them: what are good ways to use these machines? I guess that’s why the use the more recent scheduler Slurm. Research also goes into how to use an increasingly more diverse computational landscape, next to CPUs now GPUs are common for (certain) computational workloads. A sort of intermediate attempt is perhaps the Xeon Phi, which is in essence (currently) an 64-72 core Atom, with AVX(2/256/512) units and 4-fold hyperthreading, a small amount of fast integrated RAM, leading to thread counts somewhere between what you find on a regular CPU and an regular GPU. (I don’t know if these CPUs downclock under sustained AVX use. Update: they do too) Then there are a some ARM-based supercomputers and of course more specialized coprocessors. This leads to the problem of knowing how to 1) use all these different types of processors and 2) how to schedule on such diverse architectures. This field is called heterogeneous computing, or heterogeneous system architecture (HSA).

Another thing that makes life simple for particle physicists is the trivial parallelizable nature of (most?) simulations: shooting billions of particles in hundreds of configurations is a ton of independent runs, so you can easily split these over a number of cores and merge the outputs in post-processing. In other kinds of simulations, runs may not be (that) independent, because certain quantities accumulate or generate feedbacks. You can think of material science simulations, and, of course, neuroscientific simulations.

A common tool for multithreaded computation is OpenMP. This older project is all about parallel execution, of the kind found in particle physics. Efficient use of heterogeneous systems is a bit trickier however, because since the execution pipelines are not identical, the way to break up your problem must be handled differently. You probably want to break up your computation according to how well suited it is for a particular architecture. Massively parallel atomic and memory-light to the GPU, hard-to-break-down memory-intensive tasks to a CPU. For such computations, you probably also need to bring some scheduling logic into your program: if a GPU is present, how will we use it? Whether we’re assigned 4 cores, 40 or 400, whether those cores have SSE4.x, AVX, AVX2, may raise the same question.

Thus, man invented the Message Passing Interface, a format for communication between threads. If we write a program from the ground up based on the assumption that not only will there be various execution units but also that they may differ in type, we make sure to think of how we might parallelize performance at every step of the program. MPI is one of the ways to it. How does MPI work? Thanks to Stackoverflow:

As to your question: processes are the actual instances of the program that are running. MPI allows you to create logical groups of processes, and in each group, a process is identified by its rank. This is an integer in the range [0, N-1] where N is the size of the group. Communicators are objects that handle communication between processes. An intra-communicator handles processes within a single group, while an inter-communicator handles communication between two distinct groups.

By default, you have a single group that contains all your processes, and the intra-communicator MPI_COMM_WORLD that handles communication between them. This is sufficient for most applications, and does blur the distinction between process and rank a bit. The main thing to remember is that the rank of a process is always relative to a group. If you were to split your processes into two groups (e.g. one group to read input and another group to process data), then each process would now have two ranks: the one it originally had in MPI_COMM_WORLD, and one in its new group.

So, tools to help break down the computation into a topology that seems most efficient to you. How that is done, I imagine is a topic for much debate. Thus far theory as I understand it. Now it’ll be time to understand how MPI is used in Arbor, a simulation tool I’ll be working on. A colleague actually gave a lecture roughly following the task management system in Arbor: see the video here. I’ll end the post here, I’m sure it won’t be the last on the subject!

Small update: NVIDIA call their parallel programming model SIMT - “Single Instruction, Multiple Threads”. Two other different, but related parallel programming models are SIMD - “Single Instruction, Multiple Data”, and SMT - “Simultaneous Multithreading”. Each model exploits a different source of parallelism:

In SIMD, elements of short vectors are processed in parallel.
In SMT, instructions of several threads are run in parallel.
SIMT is somewhere in between – an interesting hybrid between vector processing and hardware threading.