HDF5 Java Parallel

java
hdf5
tiff
performance
multi-threading
#1

Hi @joshmoore @ctrueden @wolny @emilmelnikov @maarzt,

I have a recurring scenario where I have 100 hdf5 files, each 2GB in size. I need to re-save each of them to Tiff. Since all relevant methods in the hdf5 library are public static synchronised I cannot do this in parallel. The only idea I had was to write an ImageJ command that would do the re-saving for one of the stacks, and then, from the “master” ImageJ, spawn processes like ImageJ.exe -run ConvertHdf5ToTiff filename. I guess this would launch as many JVMs, allowing the hdf5 methods to run in parallel.

My question is whether this would work and whether there is are alternative way to achieve this aim.

For example, I think that in Imaris they hacked the C implementation of the HDF5 library to be able to have multiple HDF5 libraries with different names (thereby circumventing the static synchronised issues which seem to even exist on the C level). That’s why, in fact, Imaris is super fast in creating HDF5 files!

Thanks a lot for any ideas!

#2

My understanding is that reading from persistent storage from multiple threads varies widely in effectiveness, depending on the type of storage (e.g. SSD vs HDD vs RAMDisk). It may speed things up a lot, or you may end up being I/O-bound all the time and actually slow things down by proliferating lots of expensive seek operations.

Should be straightforward to try it, though, and see how performance changes.

My understanding of HDF5 is that it can handle parallel reads but not parallel writes. But maybe that is not true of the Java implementation? Which Java implementation are you using? The official one, or the cisd:jhdf5 one?

CC @axtimwalde @hanslovsky @tpietzsch who have more experience in this area.

#3

The .n5 dataformat comes to mind when thinking of writing to hdf5 in parallel:

For our cluster processing @tpietzsch worked around this by writing to separate .h5 files and combining all of them with an .xml

2 Likes
#4

Even if you are CPU-bound, the problem is embarrassingly parallel (assuming that the number of HDF5 files >= the number of your CPU cores and all HDF5 files contain roughly the same data), therefore using multiple processes should be as fast as using multiple threads.

The only downside (?) of this approach is somewhat poor (to my knowledge) multiprocessing support in Java.

Another, more exotic, approach is to use a special classloader that would load a “new” instance of java.lang.Class for each task.

1 Like
#5

@Christian_Tischer: can you point to example code? I’m forgetting which library you’re using. I’d very much agree that we want a library with which it is possible to write to different files from separate threads.

~J

#6

@ctrueden @joshmoore
I am using as a dependency cisd:jhdf5 which internally ships ncsa.hdf.hdf5lib, and sometimes I use the jhdf5 api, sometimes the native ncsa.hdf.hdf5lib api, the issue is the same for both (see also here: BigDataViewer Hdf5 Performance, where @tpietzsch states that he also thinks that even reading is non-parallel).

@ctrueden When you say “the official one”, which one do you have in mind? @joshmoore and I struggled (and failed) during our last mini-hackathon to find an “official” and up-to-date hdf5 library for java :frowning:

I think one could maybe split the topic in three questions (at least for me):

  1. What is the official hdf5 library for java?
  2. Can one find a parallel hdf5 implementation for java?
  3. Assuming that all current implementations have are not parallel, is launching multiple instances of ImageJ the only hack or are there other hacks?
1 Like
#7

Yes, when having a cluster it is not a problem, because one can just spawn many jobs. I can do that here at my institute, where we have a cluster, but my code is also used by people without a cluster…

#8

Sounds like the kind of hack that the Imaris people are using. Do you know how to do this?

#9

Starting to dig from e.g. https://github.com/tischi/fiji-plugin-bigDataProcessor2/blob/fec8bc452d3c9f0688a2b42b60820b44d8be3871/src/main/java/de/embl/cba/bdp2/saving/FastHDF5StackWriter.java I can definitely see how http://svnsis.ethz.ch/repos/cisd/jhdf5/trunk/source/java/ch/systemsx/cisd/hdf5/HDF5BaseWriter.java would lead to this. @Christian_Tischer, have you tried one of the other HDF5 APIs we discussed to see if the lower level ones don’t have the multi-file synchronization issue?

Another, more exotic, approach is to use a special classloader that would load a “new” instance of java.lang.Class for each task.

Do you know how to do this?

To use this, you’ll need to be careful with return types. e.g. two threads would have their own version of the class HDF5BaseWriter and therefore (likely) be able to do two things at the same time. However, higher level code can’t treat objects of the same class from those two threads as the same type. In other words, it can get confusing. If you can find a place where you are only returning (e.g.) byte[] then it should be fine.

~J.

2 Likes
#10

Thanks for digging :slight_smile:

Nope, and actually I would not know where to start looking…did you find other promising HDF5 APIs? Anyway, I think they all end up in the same method:

public synchronized static native int H5Dread(int fid, int filetype, int memtype, int memspace, Object data);

I paste some text from here: https://support.hdfgroup.org/ftp/HDF5/releases/HDF-JAVA/hdfjni-3.3.2/src/hdfjava-3.3.2-javadoc/hdf5_java_doc/hdf/hdf5lib/H5.html#H5Aread(int,%20int,%20byte[])

General Rules for Passing Arguments and Results
In general, arguments passed IN to Java are the analogous basic types, as above. The exception is for arrays, which are discussed below.

The return value of Java methods is also the analogous type, as above. A major exception to that rule is that all HDF functions that return SUCCEED/FAIL are declared boolean in the Java version, rather than int as in the C. Functions that return a value or else FAIL are declared the equivalent to the C function. However, in most cases the Java method will raise an exception instead of returning an error code. See Errors and Exceptions below.

Java does not support pass by reference of arguments, so arguments that are returned through OUT parameters must be wrapped in an object or array. The Java API for HDF consistently wraps arguments in arrays.

For instance, a function that returns two integers is declared:

   h_err_t HDF5dummy( int *a1, int *a2)

For the Java interface, this would be declared:
public synchronized static native int HDF5dummy(int args);
where a1 is args[0] and a2 is args[1], and would be invoked:
H5.HDF5dummy(a);

All the routines where this convention is used will have specific documentation of the details, given below.

Arrays

HDF5 needs to read and write multi-dimensional arrays of any number type (and records). The HDF5 API describes the layout of the source and destination, and the data for the array passed as a block of bytes, for instance,

  herr_t H5Dread(int fid, int filetype, int memtype, int memspace,
  void * data);

where ``void *’’ means that the data may be any valid numeric type, and is a contiguous block of bytes that is the data for a multi-dimensional array. The other parameters describe the dimensions, rank, and datatype of the array on disk (source) and in memory (destination).

For Java, this ``ANY’’ is a problem, as the type of data must always be declared. Furthermore, multidimensional arrays are definitely not layed out contiguously in memory. It would be infeasible to declare a separate routine for every combination of number type and dimensionality. For that reason, the HDFArray class is used to discover the type, shape, and size of the data array at run time, and to convert to and from a contiguous array of bytes in synchronized static native C order.

The upshot is that any Java array of numbers (either primitive or sub-classes of type Number) can be passed as an ``Object’’, and the Java API will translate to and from the appropriate packed array of bytes needed by the C library. So the function above would be declared:

public synchronized static native int H5Dread(int fid, int filetype, int memtype, int memspace, Object data);
#11

…which likely all just means that HDF5BaseWriter is an example of trying to workaround that root bottleneck, and class loaders are not going to help.

#12

Could you elaborate a bit on this? How did they try to workaround? And do you think they managed? I tried to read the code, but it is beyond me.

#13

I think @joshmoore is speculating based on the fact that they declared the native method static synchronized. In other words: they probably had a good reason for doing that.

At this point it probably makes most sense to ask the HDF5 community directly.

1 Like
#14

Something like this: https://stackoverflow.com/questions/14257357/loading-a-class-twice-in-jvm-using-different-loaders

However, I wouldn’t try to do that unless absolutely necessary, and also I don’t known how classloaders interact with JNI.

If there is really no way to use HDF5 library from multiple threads, doing it with multiprocessing should be fine: the processes will be created only once, and the execution time will dominate the process creation time (at least on Linux because forking a new process is cheap unless you have a very large heap).

1 Like
#15

I assume something that interacts with the native library itself will be needed. E.g. https://stackoverflow.com/questions/33040652/how-to-create-multiple-instances-of-the-same-library-with-jna

1 Like
#16

I cross-posted at https://forum.hdfgroup.org/t/multithreaded-writing-of-multiple-files-in-java/5792

1 Like
#17

maybe you can also ask them for a recent version of a java hdf5 library?
:slight_smile:

#18

I assume you just did.

2 Likes