Dear all,
TL;DR: pyramid generation has been breaking spectacularly for us. We think we’ve found a solution that works, but we don’t understand it completely.
we’ve been having a heck of a time lately with pyramid generation - in particular, with large-plane multi-series LIF files. This first showed up recently after a large import with a few very large (20-70 GB) LIF files, generating something like 100 new (large plane, multi-channel) images. A few days after the import, a user reached out to us to say their files didn’t have thumbnails and they couldn’t open them in iviewer. Sure enough, those files didn’t have any pyramids, and PixelData logged activity generating exactly 4 pyramids right after the import (the number of pixeldata threads we had set), and absolutely nothing ever since. Most of the images imported since that day have not had pyramids created.
We did our research and read over every forum topic and discussion on similar issues in the past, but those did not help much our understanding of issues, so I’m writing this to try and put all the info we have together in the same place - hopefully it helps someone in the future!
We have since been trying to pinpoint what the problem actually is. Here are facts about the issue and things we have tried:
-
Restarting the server (or restarting PixelData-0 using
omero admin ice
stuff) nudges it into life again, a few pyramids are generated and then it stops again. Most of the time PixelData stops logging activity after one “batch” of pyramids, sometimes it gets as far as two, but then stops again; -
We have reproduced the same behavior in our dev server, where ManagedRepository is on a different kind of storage altogether;
-
The start of these issues coincides with moving
Pixels
and most ofManagedRepository
to a NFS mount, but given item #2 I don’t think there’s causality here; -
Sometimes the stoppage is preceded by something like
ome.conditions.SessionTimeoutException: Session (started=2021-02-26 15:38:03.723, hits=13, last access=2021-02-26 15:38:03.81) exceeded timeToIdle (600000) by 1638496 ms
appearing on the PixelData log, but sometimes it isn’t; -
After processes get “stuck”, we have seen things like exporting a PDF from OMERO.Figure also stop working (with a “No Processor Available” pop-up error), but this hasn’t been reliably reproducible;
-
Starting/completing a new import does not “nudge” PixelData-0 into action. If it is on its “stuck” state, it does not generate pyramids for new imports;
-
We have not found any issues on the files per se. When nudged, pixeldata CAN generate pyramids from pretty much any file there, it just refuses to after a while.
Our working theory is that PixelData threads are timing out (and somehow dying) when generating very large pyramids, and subsequent pyramids that need to be generated just aren’t. We have bumped up the number of PixelData threads (and min_threads accordingly) and substantially increased omero.sessions.timeout
- that seems to have solved it for the moment. There are still a lot of things we do not understand:
-
some pyramids take longer than the (new, extended) timeout and still complete without issue and don’t make PixelData get “stuck”. Does it only stop completely if ALL threads time out?
-
Are these threads obeying
omero.sessions.timeout
? If so, why? Should there be separate timeouts for regular sessions and these threads? -
Are there substantial downsides to the changes we’ve made? We have already seen an increase in the number of live sessions (particularly Public ones), but it has not been an issue so far.
-
What is a “normal” amount of time to generate pyramids for a, say, 15k x 20k 4-channel image? We’re seeing it take almost two hours (!!!) and our impression is that it is severely limited by write speed on our storage (as an example, an increase in the number of PixelData threads resulted in a similar increase in time to completion). Anything we can do about it?
We would really like to understand the problem better to make sure it doesn’t come up again, so any feedback is appreciated!