PixelData threads and pyramid generation issues

Dear all,

TL;DR: pyramid generation has been breaking spectacularly for us. We think we’ve found a solution that works, but we don’t understand it completely.

we’ve been having a heck of a time lately with pyramid generation - in particular, with large-plane multi-series LIF files. This first showed up recently after a large import with a few very large (20-70 GB) LIF files, generating something like 100 new (large plane, multi-channel) images. A few days after the import, a user reached out to us to say their files didn’t have thumbnails and they couldn’t open them in iviewer. Sure enough, those files didn’t have any pyramids, and PixelData logged activity generating exactly 4 pyramids right after the import (the number of pixeldata threads we had set), and absolutely nothing ever since. Most of the images imported since that day have not had pyramids created.

We did our research and read over every forum topic and discussion on similar issues in the past, but those did not help much our understanding of issues, so I’m writing this to try and put all the info we have together in the same place - hopefully it helps someone in the future!

We have since been trying to pinpoint what the problem actually is. Here are facts about the issue and things we have tried:

  1. Restarting the server (or restarting PixelData-0 using omero admin ice stuff) nudges it into life again, a few pyramids are generated and then it stops again. Most of the time PixelData stops logging activity after one “batch” of pyramids, sometimes it gets as far as two, but then stops again;

  2. We have reproduced the same behavior in our dev server, where ManagedRepository is on a different kind of storage altogether;

  3. The start of these issues coincides with moving Pixels and most of ManagedRepository to a NFS mount, but given item #2 I don’t think there’s causality here;

  4. Sometimes the stoppage is preceded by something like ome.conditions.SessionTimeoutException: Session (started=2021-02-26 15:38:03.723, hits=13, last access=2021-02-26 15:38:03.81) exceeded timeToIdle (600000) by 1638496 ms appearing on the PixelData log, but sometimes it isn’t;

  5. After processes get “stuck”, we have seen things like exporting a PDF from OMERO.Figure also stop working (with a “No Processor Available” pop-up error), but this hasn’t been reliably reproducible;

  6. Starting/completing a new import does not “nudge” PixelData-0 into action. If it is on its “stuck” state, it does not generate pyramids for new imports;

  7. We have not found any issues on the files per se. When nudged, pixeldata CAN generate pyramids from pretty much any file there, it just refuses to after a while.

Our working theory is that PixelData threads are timing out (and somehow dying) when generating very large pyramids, and subsequent pyramids that need to be generated just aren’t. We have bumped up the number of PixelData threads (and min_threads accordingly) and substantially increased omero.sessions.timeout - that seems to have solved it for the moment. There are still a lot of things we do not understand:

  1. some pyramids take longer than the (new, extended) timeout and still complete without issue and don’t make PixelData get “stuck”. Does it only stop completely if ALL threads time out?

  2. Are these threads obeying omero.sessions.timeout? If so, why? Should there be separate timeouts for regular sessions and these threads?

  3. Are there substantial downsides to the changes we’ve made? We have already seen an increase in the number of live sessions (particularly Public ones), but it has not been an issue so far.

  4. What is a “normal” amount of time to generate pyramids for a, say, 15k x 20k 4-channel image? We’re seeing it take almost two hours (!!!) and our impression is that it is severely limited by write speed on our storage (as an example, an increase in the number of PixelData threads resulted in a similar increase in time to completion). Anything we can do about it?

We would really like to understand the problem better to make sure it doesn’t come up again, so any feedback is appreciated!

Hey Erick,

thanks for all the sleuthing!

In addition, have you observed any of the processes, specifically PixelData-0, with jstack in the “hung” state? Have you seen OutOfMemory exceptions anywhere?

To your questions:

  1. If I understand you correctly, I imagine you are running into All pixel data threads hang per batch until all complete · Issue #122 · ome/omero-server · GitHub. i.e. a batch must complete on all threads in order to continue.
  2. Threads should be independent of omero.sessions.timeout so I find it interesting that you said bumping the limit produced better results (?).
  3. In general, no, no downside. But if the issue is memory or other resource exhaustion, having more threads at the same time could be an issue.
  4. “It depends”. It might be worth experimenting with pyramid generation ouside of OMERO in order to give you a feel for that.

Or indeed, is converting to OME-TIFF with bioformats2raw and raw2ometiff and option?

~Josh

Just to mention : in theory big lif files comes with lifext files which already contain pyramids. It would be so much easier if Leica could put some effort to make bioformats read these directly.

3 Likes

We haven’t done any jstacking, to be fair - no OutOfMemory exceptions anywhere. For the specific points:

  1. Yep, that part we understood - this should probably be part of the timeout question. Some batches contain one or two files that definitely go over the timeout limit, but it seems like as long as one pyramid creation in the batch ends before timeout, it doesn’t hang everything and a new batch starts when the current one ends.
  2. We are also very puzzled by that, but it was pretty clear that this is what sorted it out, both in our dev server and in production. The fact that we got SessionTimeoutExceptions in our PixelData log is also not what we expected to see.
  3. We’re not TOO too worried about resources - no OOM anywhere, we have plenty of cores to throw at this and so on. Other than the concurrent writes when creating pyramids being slower, we’ve seen nothing that worries us. We were just a bit concerned about upping timeout causing unintended consequences and wanted to double-check!
  4. We might give pyramid generation outside OMERO a try and see how it goes.

OME-TIFF conversion is a possibility, but I think at that point I’d just tell the users to export their data as OME-TIFFs to begin with.

Hey @erickratamero.

If you have a chance, the jstack would interest me especially if you’ve seen no OutOfMemory exceptions. But investigating the timeouts will likely require being able to reproduce it locally anyway…

Let us know how things go.
~Josh

We’ve got our best person (i.e. @mellertd ) on the case now. He’s been doing A LOT of testing, and when we have some more results we’ll report back!

2 Likes

Hello @joshmoore . I have done a ridiculous amount of testing on this odd issue. Apologies for this monster post, but I took a somewhat experimental approach there are a lot of data here :slight_smile:

Before the data dump, I will say that I am happy to continue doing some testing and using jstack, but I will need some guidance as to what exactly I should be looking for. Do I run it after the pixeldata service crashes to look for stuck processes? It seems that jstack just gives a current snapshot of things, and I am not quite sure how I would use it to help diagnose :frowning:

I’ll start with the TLDR – I am fairly certain that omero.sessions.timeout matters, but only for a very specific case, as you will see. Anyway, without further ado:

Testing setup

Server: Using our Dev server – so not much user activity, if any, outside of these specific tests.

48 CPUs (24 cores)

Data: Two lif files. Each contains many images, each image is large enough to require a pyramid.

lif #1: 31 images
lif #2: 13 images

Variables:

omero.sessions.timeout
omero.pixeldata.threads
omero.threads.min_threads

Approach: omero admin restart before each attempt. Import with CLI (omero import *.lif)

Summary of results: In general, it seems that omero.sessions.timeout does matter, but only when omero.threads.min_threads does not exceed omero.pixeldata.threads, and only when both files are imported in a single import.

Example exception, when pyramid generation fails:

2021-03-09 12:50:06,527 ERROR [ ome.services.pixeldata.PixelDataThread] (1-thread-1) ExceptionException!

ome.conditions.SessionTimeoutException: Session (started=2021-03-09 11:50:32.006, hits=842, last access=2021-03-09 12:04:08.096) exceeded timeToIdle (600000) by 2158423 ms

at ome.services.sessions.state.SessionCache.getDataNullOrThrowOnTimeout(SessionCache.java:470) ~[omero-server.jar:5.6.0]

at ome.services.sessions.state.SessionCache.getSessionContext(SessionCache.java:368) ~[omero-server.jar:5.6.0]

at ome.services.sessions.state.SessionCache.getSessionContext(SessionCache.java:353) ~[omero-server.jar:5.6.0]

at ome.security.basic.CurrentDetails.login(CurrentDetails.java:164) ~[omero-server.jar:5.6.0]

at ome.services.util.Executor$Impl.execute(Executor.java:439) ~[omero-server.jar:5.6.0]

at ome.services.util.Executor$Impl.execute(Executor.java:392) ~[omero-server.jar:5.6.0]

at ome.services.pixeldata.PixelDataThread.go(PixelDataThread.java:302) ~[omero-server.jar:5.6.0]

at ome.services.pixeldata.PixelDataThread.access$000(PixelDataThread.java:51) ~[omero-server.jar:5.6.0]

at ome.services.pixeldata.PixelDataThread$1.call(PixelDataThread.java:250) ~[omero-server.jar:5.6.0]

at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264) ~[na:na]

at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515) ~[na:na]

at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264) ~[na:na]

at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) ~[na:na]

at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) ~[na:na]

at java.base/java.lang.Thread.run(Thread.java:834) ~[na:na]

Specific testing scenarios:

Test scenario #1 - WORKS
Files: Both lifs

Config:
omero.sessions.timeout=6000000
omero.pixeldata.threads=10
omero.threads.min_threads=15

Purpose: Initial configuration that we got to work

Result: Successful import, successful pyramid generation

Test scenario #2 - WORKS
Files: Both lifs

Config:
omero.pixeldata.threads=10
omero.threads.min_threads=15

Purpose: Test whether timeout setting was needed in Test scenario 1.

Result: Successful import, successful pyramid generation

Test scenario #3 (Tested 2x) - FAILS
Files: Both lifs

Config:
omero.pixeldata.threads=10
omero.threads.min_threads=10

Purpose: Test whether it is necessary for omero.threads.min_threads to exceed omero.pixeldata.threads

Result: Successful import, pyramid generation fails after generation of many but not all pyramids

Test scenario #4 (Tested 2x) - WORKS
Files: Both lifs

Config:
omero.sessions.timeout=6000000
omero.pixeldata.threads=10
omero.threads.min_threads=10

Purpose: Test whether raising omero.sessions.timeout prevents error encountered in scenario 3.

Result: Successful import, successful pyramid generation

Test scenario #5 - WORKS
Files: Both lifs

Config:
omero.pixeldata.threads=10
omero.threads.min_threads=11

Purpose: Test whether omero.sessions.timeout is necessary when omero.threads.min_threads exceeds omero.pixeldata.threads by 1.

Result: Successful import, successful pyramid generation

Test scenario #6 - WORKS
Files: Both lifs

Config:
omero.pixeldata.threads=20
omero.threads.min_threads=21

Purpose: Test whether omero.sessions.timeout is necessary when omero.threads.min_threads exceeds omero.pixeldata.threads by 1, at a higher pixeldata thread count than scenario 5

Result: Successful import, successful pyramid generation

Test scenario #7 (Tested 2x) - FAILS
Files: Both lifs

Config:
omero.pixeldata.threads=20
omero.threads.min_threads=20

Purpose: Test whether omero.sessions.timeout is necessary when omero.threads.min_threads matches omero.pixeldata.threads, at a higher pixeldata thread count than scenario 3

Result: Successful import, pyramid generation fails after generation of many but not all pyramids

Test scenario #8 - WORKS
Files: Both lifs

Config:
omero.sessions.timeout=6000000
omero.pixeldata.threads=20
omero.threads.min_threads=20

Purpose: Test whether raising omero.sessions.timeout rescues pyramid generation as in scenario 7

Result: Successful import, successful pyramid generation

Test scenario #9 - WORKS
Files: Only file #2 (13 images)

Config:
omero.pixeldata.threads=20
omero.threads.min_threads=20

Purpose: Test whether the number of total images being imported from one import command seems to affect this.

Result: Successful import, successful pyramid generation

Test scenario #10 - WORKS
Files: Two separate, simultaneous imports of file #2 (26 images, 2 imports of 13 images each)

Config:
omero.pixeldata.threads=20
omero.threads.min_threads=20

Purpose: Test whether the number of total images being imported from one import command seems to affect this (vs. total number of simultaneous images being imported by OMERO and queued for pyramid generation)

Result: Successful import, successful pyramid generation

Test scenario #11 - WORKS
Files: Two separate copies of file #2, imported in one import command (26 images total)

Config:
omero.pixeldata.threads=20
omero.threads.min_threads=20

Purpose: Test whether the number of total images being imported from one import command seems to affect this (vs. total number of simultaneous images being imported by OMERO and queued for pyramid generation)

Result: Successful import, successful pyramid generation

Test scenario #12 - WORKS
Files: File #1 only, 31 images from one file

Config:
omero.pixeldata.threads=20
omero.threads.min_threads=20

Purpose: Test whether it matters if one file is producing more images than pixeldata threads

Result: Successful import, successful pyramid generation

3 Likes

:clap: You’re likely to get a trophy.

Understood. I’ve never found an easy way to explain how and/or write a parser to diagnose via jstacks. Often, you’re looking for things that “keep doing the same thing” but that requires filtering out the things that are supposed to be doing the same thing.

This (along with all your proof) does point a strong finger. My interpretation for now would be that the thread pool is being depleted so that the login logic can no longer progress. I’m not sure what is doing the depleting though! A workaround at the server level would be to just always bump min_threads to be sufficient. What are the dimensions of each image in the LIF? I wonder if a fake file would let me reproduce…

N.B. I assume two File #1’s would also fail with threads == min_threads.

~Josh

1 Like

FWIW this post triggered me to check through my logs thoroughly again:

We can also see those errors (our config has omero.threads.min_threads == omero.pixeldata.threads) from time to time. I assumed our user reports of pyramid creation errors all stemmed from the batchwise multithreading issue (All pixel data threads hang per batch until all complete · Issue #122 · ome/omero-server · GitHub) but apparently we additionally also run into this error, which then halts the pixeldata process entirely.
For us, this occurs for large images in general from time to time, very much indepenent of the imported file format (have seen it happen with .czi, .lsm and .tiff).

So +1 reproduced for me, if you need any diagnostics I’m also happy to help.

//
Julian

3 Likes

Further preliminary observation from our logs after bumping omero.threads.min_threads from 8== omero.pixeldata.threads) to 10:
It seems like the processing speed of our images in general went up, so not only does the process not stop anymore, but each image seems to process faster than before. I’ll watch this over the weekend (there’s a big backlog of images to process) and then I’ll try to come up with an reproducible example.
But if this holds true it would suggest to me that background tasks interfere with the pyramid threads if omero.threads.min_threads == omero.pixeldata.threads and therefore slow down the generation in general.

//
Julian

1 Like

Thanks for the help in sleuthing here, everyone. Very valuable! ~J.

Each image in the lif is ~10K, ~20K, 1, 4, 1 (XYZCT).

These *.lif files cause all sorts of problems beyond the pyramid issues, by the way. I may start a separate thread. In short, even when pyramids are built, thumbnails take a very long time to load, and you can’t see any thumbnails in a dataset until all of the lif thumbnails load (which, again, takes a while. several minutes.)

What’s very weird is that the thumbnail has clearly been rendered, but still no thumbnail in middle pane (it’s been 15 minutes on this particular dataset). See here:

Edit: I got the thumbnails to load by individually selecting each image, waiting for the thumbnail to appear in the preview pane, then refreshing the page after doing this for all images in the dataset.

Hi Dave,
The middle pane of the webclient loads thumbnails in batches, which is more intensive if calculating new thumbnails and it can fail if 1 or more thumbnails is missing, or the image is still importing or creating pyramids.
You might find that refreshing after the images are viewable will load the thumbnails.
Certainly if you “Save” the rendering settings in the right panel then the thumbnail for that image should update in the centre.

Will