Memory leak with Bioformats in Matlab (tested on 5.9.2)

bio-formats
matlab

#1

Hello,
I have a weird memory leak problem when I am using Bioformats 5.9.2 in Matlab (R2018b). Accessing the readers multiple times is causing “java.lang.OutOfMemoryError: Java heap space” error.
Depending on Java Heap Memory settings of Matlab the error message appears earlier (for smaller heap memory settings; step 22 in the attached example when heap memory is 256 Mb) or later (for larger; step 44 in the example below when heap memory is 512 Mb).

The script below

  1. generates two test images (16000 x 16000 pixels)
  2. saves them as TIF
  3. generates array of bio-format readers for these 2 images: 25 sets of these two files (=50 virtual stacks, readers)
  4. runs 2 tests:
    Test 1: in a loop loads 1500x1500 fragment from each of these 50 readers
    demo:
    http://www.biocenter.helsinki.fi/~ibelev/temp/misc/test1.mp4
    Test 2: in a loop loads 50 times 1500 x 1500 areas from only reader 1 and 2.
    demo:
    http://www.biocenter.helsinki.fi/~ibelev/temp/misc/test2.mp4

Test 1 is causing the error, while the test 2 works fine. It looks that readers do not release memory after obtaining data from files. Could this be fixed?

A demo script is attached, details inside.

bio_test.m (2.4 KB)


#2

Hi @Ilya_Belevich,

thanks for supplying a comprehensive minimal script reproducing your issue.

Once you do not need an initialized Bio-Formats reader anymore, you want to call the IFormatReader.close() API in order to reduce Java memory usage.

In the case of your script, calling

bioreaders{readerId}.close();

after the call to bfGetPlane was sufficient to let the Test 1 complete successfully with 256m heap space locally (R2017b). Let us know if that works for you.

Best,
Sebastien


#3

Hi @s.besson
thank you for the suggestion! Unfortunately it is not an option, because I need to access those readers later at different areas. Initializing them again will take too much time :frowning:

Best regards,
Ilya


#4

Hi Ilya,

if the reader initialization time is your main bottleneck, you might want look into caching the initialized readers using the Memoizer API as described here?

Doing so would allow you to immediately close the readers right after loading planes and free up resources while being able to quickly reinitialize the reader later in your process.

Best,
Sebastien


#5

Hi Sebastien,
thank you for your suggestion! I can test that, but there are two issues:

  1. I think the point is that there is a memory leak that should not be technically there. For example, I can read tif file with normal reader unlimited number of times without any memory issues, but if the bio-formats reader is used I start to loose memory quite quickly. I may be wrong, but it looks that bio-formats reader does not clears a data variable with the acquired image block after use. May be it was intention, but in this case there should be a method to purge the used memory.
  2. I have concerns on performance in this case, but that can be tested.

So to make Memoizer to work I have to wrap bfGetReader inside of it. I replaced this part of the code with:

fprintf('Generating the Memoizer wrappers...');
readerId = 1;
for setId = 1:maxNoSets
    for fileId = 1:numel(filenames)
        bioreaders{readerId} = loci.formats.Memoizer(bfGetReader(), 0);
        bioreaders{readerId}.setId(filenames{fileId})
        bioreaders{readerId}.setSeries(0);  % set series
        bioreaders{readerId}.close();
        readerId = readerId + 1;
    end
end
fprintf('done\n');

However, how can I get the image, because bfGetPlane does not work with Memoizer?

img = bfGetPlane(bioreaders{i}, 1, Xlim(1), Ylim(1), Xlim(2)-Xlim(1)+1, Ylim(2)-Ylim(1)+1);

I attached the updated script bio_test_memo.m (1.8 KB)

Best regards,
Ilya


#6

Dear Ilya,

I think the bfGetPlane problem is not specific to Memoizer per se but rather to the fact the API needs to be used with an initialized reader.

Please find below is a modified version of bio_test_memo.m which should work and should report some performance metrics for each phase of reader initialization + plane loading so that you can assess whether it does the right job for you.

bio_test_memo.m (1.8 KB)

Re memory issues, ultimately the IFormatReader.close() call is responsible for clearing all the variables created during a reader lifetime (initFile as well as calls to openBytes). That being said, it is very possible there are intermediate places in the code where the amount of used memory could be reduced. In the context of Bio-Formats 6, we have already reviewed our readers for obvious file leaks. We will take another look at MinimalTiffReader and TiffParser and get back to you if we have any obvious candidates.

Best,
Sebastien


#7

Hi Sebastien,
thank you for the code, I will check it on Monday!

Best regards,
Ilya


#8

I will vouch that Memoizer significantly improves performance. Another idea I had here was whether saving as OME-TIFF might help also decrease the initialization time of the readers.

I also wonder if the problem may not be in Bioformats itself, but how MATLAB handles Java classes and garbage collection. You may want to look into something like this:

You may also want to consider using the MATLAB static java path rather than the dynamic path:
https://www.mathworks.com/help/matlab/matlab_external/static-path.html
Classes are loaded in MATLAB differently using the static java path than the dynamic path that you are currently using to load Bioformats, which may help garbage collection.


#9

Thank you for suggestions, I’ve made some tests and results indicate that use of Memoizer solves Java heap memory problem with the cost of about 10% performance loss.
I was reading blocks [1500x1500 pixels] from 50 consecutive images [16k x 16k pixels]. The average reading times for the standard bio-format readers were 0.145879 sec / image, while for Memoizer 0.160358 sec / image.
It is slower, but I consider it acceptable. Here the two test scripts, bio_test_std_readers.m (1.5 KB) and bio_test_memo.m (1.7 KB)

@s.besson
1.does Memoizer remembers the series number (I defined it inside the Memoizer init loop)

r = loci.formats.Memoizer(bfGetReader(), 0);
r.setId(filenames{fileId});
r.setSeries(0);
r.close();

or it has to be always defined inside the image acquisition loop after ‘bioreaders.setId(filenames{i});’?

  1. In my test example (bio_test_memo.m) above, a new reader is created from Memoizer to read each image, while in your example these readers are stored inside bioreaders cell array. Is there any benefit of keeping the instances of each reader? I potentially may want to access those files again and again.

@markkitt
The initialization performance in my case is tolerable, the problem is memory leak. I think the function that you’ve suggested may help with it, but the problem is to find a particular moment when I need to use it. One option can be to include “img = bfGetPlane(…)” function inside try/catch construction, but I would prefer a cleaner way.

Best regards,
Ilya


#10

Are the Memoizer performance measurements for the first time the file is opened or for subsequent opens? I would expect slightly slower performance the first time you run the script and much faster performance on subsequent runs.


#11

@markkitt
these are the numbers for the subsequent reads; the initialization performance is better for the standard reader, but not terribly.
These are results for initialization of 50 tif images:
Memoizer: in 3.21 seconds
Standard: in 2.61 seconds


#12

Hi @s.besson,
can you please comment on these notes:

1.does Memoizer remembers the series number (I defined it inside the Memoizer init loop)

r = loci.formats.Memoizer(bfGetReader(), 0);
r.setId(filenames{fileId});
r.setSeries(0);
r.close();

or it has to be always defined inside the image acquisition loop after ‘bioreaders.setId(filenames{i});’?

  1. In my test example (bio_test_memo.m) above, a new reader is created from Memoizer to read each image, while in your example these readers are stored inside bioreaders cell array. Is there any benefit of keeping the instances of each reader? I potentially may want to access those files again and again.

Thank you!


#13

Hi @Ilya_Belevich,

answering your 2 questions:

1- the memo file on disk is created as part of the initial setId call i.e. series number have not been set and default to zero. So r.setSeries(0); after calling r.setId(filenames{fileId}); will be a no-op but in general, the caller needs to set the correct series index if different from zero after initializing the reader (whether the initialization was loaded from a cached file or not)

2- I think I had created a cell array as per the structure in the initial script. As the script stands, there is indeed no value in sharing the readers outside the loop. Creating a local reader per iteration of the loop as in your bio_test_memo.m script makes complete sense.

Best,
Sebastien


#14

@s.besson
Thank you Sebastien!
one other quick question:
“the memo file on disk is created” - where are those files stored? Do I need to delete them after closing of the dataset?


#15

The location where memo files are stored largely depends on how the memoizer was constructed.

By default, the cache files are named .<filename>.memo where <filename> is the full name of the file passed to setId. With the base constructor, they will be stored in the same folder as the initialized file

sbesson@ls30630:~ $ ls -alh /tmp/test/
total 88
drwxr-xr-x   4 sbesson  wheel   128B 23 Jan 08:39 .
drwxrwxrwt  32 root     wheel   1.0K 23 Jan 08:39 ..
-rw-r--r--   1 sbesson  wheel    40K 23 Jan 08:39 .test.fake.bfmemo
-rw-r--r--   1 sbesson  wheel     0B 23 Jan 08:39 test.fake

It is possible to store these cache files under a given top-level folder using another constructor. For instance, this strategy is the one used by OMERO to manage Bio-Formats cache files.

Best,
Sebastien


#16

Thank you once again!