Casey Stark / HDF5-parallel performance

I have used HDF5 in serial and parallel on NERSC machines for several years now, and overall, I think it's a great tool. It provides some much-needed infrastructure for managing scientific data. The fact that multidimensional data is supported natively makes it much easier to work with my 3D data. Especially for small tasks, using the h5py interface is super slick. The HDF5 C API can be quite verbose (making extra property lists, constructing extra objects for hyperslab selection, etc.), but I would say the advantages are still worth that.

Of course, the convenience of the HDF5 infrastructure is going to come with some performance cost. Reading and writing structured data is going to be more expensive than flat binary files no matter what, right? And I assumed for a long time that this overhead would be noticeable, but small enough to still be worth it. At one point, I noticed that our analysis runs on large grids (2048^3 and larger) were spending the majority of their time doing IO, and I decided to test HDF5 performance against binary flat files. It's true that in serial, the HDF5 overhead is marginal, but the parallel features can be very slow. I want to be clear that this could just be me using HDF5-parallel incorrectly, and this test is not a good reason to write off HDF5-parallel. However, I spent a while trying to get this right and at some point, I had to scrap it and start over.

Test setup

My most common HDF5 use is reading and writing 3D gridded data. In terms of data structures beyond a flat array, this is about as simple as it gets and allows for plenty of optimization. In fact, for performance reasons, we usually store the gridded data as a flat array and use the 3D structure only for indexing, e.g. i = (ix * nx + iy) * ny + iz (see this excellent writeup of multidimensional arrays in C/C++ for more info).

Typically, we use double arrays during calculations and write to disk in single precision for the space savings (a nice rule of thumb is that a float 1024^3 grid takes 4 GB). So in serial, we want a function that takes a double array of size n^3 and writes a binary blob of floats with size n^3. A simple way to do this is to create a float buffer array of size n^2, then to copy the data and write it n times. In HDF5, we simply create the dataset with a float type and dataspace like (n, n, n) and call HD5write with the double array. In parallel, we have a few more choices to make.

In parallel, we are still usually dealing with a single grid, but the grid is distributed over MPI ranks or nodes. The simplest distribution is where the grid is split over the first axis, sometimes called a slab distribution. This means that each MPI rank holds a part of the grid with shape (n / mpi_size, n, n) and offset (mpi_rank, 0, 0). Of course, this makes the IO a bit more complicated. In HDF5, we can write each slab by selecting the matching hyperslab in the dataset and taking advantage of collective parallel writes. For the flat file, a simple but slow method is to order the writes sequentially, looping over ranks. I also wrote a more complicated version writes to multiple files in parallel to get something close to best case performance.

The IO functions are a bit too long to include inline, but I pushed the test code on github if you would like to know more. I'm only testing the write performance in this case, so it might not be the same for reading.

Test results

I ran these tests on NERSC's Edison machine, using the cray-hdf5-parallel/1.8.13 module (the Cray build of HDF5-1.8.13 with the parallel build enabled), in my scratch directory.

First, the serial test gave the expected result that HDF5 was just a bit slower than my custom write. I ran with 256^3 and 512^3 grids and 10 samples. This table shows the walltime per call, averaged over the samples:

	Custom (s)	HDF5 (s)
256^3	0.162	0.205 (1.27x)
512^3	1.099	1.328 (1.21x)

So the HDF5 version is about 25% slower. In production, I think this difference would be noticeable but not significant, and definitely worth not dealing with binary flat files later on.

When we switch to the parallel version though, HDF5 hits the brakes hard. In this case, I ran 1024^3 and 2048^3 grids, distributed so that the local grid size was the same as 256^3 and 512^3.

	Custom fast (s)	Custom slow (s)	HDF5 (s)
1024^3 (8 ranks)	1.880	7.486 (4x)	94.281 (50x)
1024^3 (64 ranks)	0.568	9.636 (17x)	45.730 (81x)
2048^3 (64 ranks)	3.771	59.194 (16x)	218.868 (58x)
2048^3 (512 ranks)	4.641	77.007 (17x)	134.635 (30x)

The 'fast' custom implementation simultaneously writes to separate files (up to 32), which might not be a fair comparison. However, the 'slow' implementation writes to a single file sequentially and can't take advantage of parallel IO. If HDF5 was using the worst IO strategy and was also writing sequentially, we would expect it to be just a bit slower than the slow custom result, like the serial result. At the same time, HDF5-parallel has the advantage of MPI-IO with collective writes. I expected the HDF5 version to perform somewhere between the slow and fast custom versions, but instead, the HDF5-parallel version performs 2 - 10x worse than the custom worst case. Another thing to note is that my 'fast' implementation isn't even that fast, since I'm no IO expert. The max IO rate on the Edison scratch file system is about 30 GB/s, which other applications are able to achieve, yet my implementation isn't even hitting 10 GB/s.

Again, I really hope I'm using HDF5-parallel incorrectly and an expert can achieve performance closer to what I expected. I still prefer using HDF5 for my data when I don't need good parallel performance.

The real performance hit

Unfortunately, I spent a lot of time learning how to use HDF5-parallel and getting it to work, and in the end, I had to scrap my work and start over. I could spend more time posting on forums and contacting HDF5 developers, but as other time-starved engineers know, I just had to call it at some point.

The "don't reinvent the wheel" proverb is tossed around a lot in the programming community, but I've been burned enough times to rarely give this advice. There are high costs to adopting new tools that are often ignored. In my experience, there are at least four reasons outsourcing something to a third- party tool is risky:

Learning the tool might take longer than just doing it yourself. I spent a lot of time going through the HDF5 tutorials and the API docs, and debugging the HDF5 functions. I would guess that it took over 10x times longer than it took me to whip up the flat file IO functions.
You don't learn anything about the problem under the hood. The time I spent learning HDF5 was spent specifically on HDF5 API and best practices. I didn't get to learn about general IO performance/tuning or underlying tech (MPI-IO, B-trees, etc.).
The tool might perform poorly or turn out not to work for your specific problem. I couldn't get HDF5-parallel to perform well for my application and had to start over.
A new dependency can be a burden if users need to install the it separately. HDF5-parallel requires an MPI compiler with MPI-IO support, typically only available on HPC systems. This meant adding #ifdef blocks to separate HDF5-parallel specific API, so I could still build on my laptop. Not a big deal, but definitely a pain for maintenance.

I'm lucky because HDF5 is fairly popular won't be going anywhere for a while. I'm sure I will get to use these skills in the future. For now, I'm using custom binary files when I need parallel performance and still converting to HDF5 down the line. But in general, adopting the wrong tool can be a real headache.