Question on OMERO's "BioFormatsCache" directory

Dear @OMETeam,

as we are in the process of maintaining our OMERO installation currently :wink: , I was wondering about the impact of where (in terms of an I/O perspective) the BioFormatsCache directory is located.

Currently it is on the same storage as the ManagedRepository, meaning on GPFS in our case. Considering performance, I would expect to have a better experience if the cache is on a local (fast, meaning at least SSD or rather NVMe) volume.

Looking at it from the other way round, what would be the “risk” of loosing the cache’s content? From my understanding I wouldn’t bother at all to back it up as it should be re-generated upon demand (or at least upon a service restart).

Any input is highly appreciated!

Thanks,
~Niko

2 Likes

Moin, Niko.

The amount of data being read is usually not huge (though a histogram of log sizes of your files might be interesting), but it does happen frequently.

Here the metric is how long it takes to regenerate your files. In the case of IDR, for example, there are pathological filesets which take 20 minutes if that’s the case, it will be painful. For most filesets though, regenerating is relatively fast and transparent, such that the first user will see a hang and then subsequent uses will be faster. However, if you happen to get many simultaneous requests with no cache, you’ll have a bumpy but recoverable ride.

You might want to test how long a full re-caching takes. See GitHub - glencoesoftware/omero-ms-image-region: OMERO image region Vert.x asynchronous microservice server for a fully out-of-OMERO process for generating them.

~Josh

2 Likes

Moin Josh!

Running the following command(s)

find . -type f -print0 | \
    xargs -0 ls -l | \
    awk '{size[int(log($5)/log(2))]++}END{for (i in size) printf("%10d %3d\n", 2^i, size[i])}' | \
    sort -n

gives the following (on our production server):

     16384 5911
     32768 162297
     65536 29011
    131072 52685
    262144 20748
    524288 15337
   1048576 9785
   2097152 3504
   4194304 690
   8388608 3124
  16777216  47
  33554432   5
  67108864   2
 268435456   2
 536870912   1

All in all ~120 GB, which doesn’t seem too much to me, given our ManagedRepository on this machine is now beyond 220 TB…

That’s something I anyway wanted to try.

Nevertheless, a bumpy but recoverable ride (with nothing to do on the admin side) is an acceptable risk (for us), given that it’s only relevant for the (unlikely) case of a full crash of the local RAID storage.

Thanks for the input!

~Niko

1 Like

Another follow-up question regarding the cache…

In the official upgrade instructions it is stated that

cached Bio-Formats memoization files created at import time will be invalidated by the server upgrade

If this is true, my interpretation is that we would need to go through the re-generation step anyways (or is this just a leftover from older upgrade instructions referring to version changes that included specific Bio-Formats upgrades which were invalidating the cache…?).

If yes, I’m wondering if (how) this can be done “upfront” - in other words, preparing a new BioFormatsCache directory before actually shutting down the “old” instance and performing the other upgrade steps. Especially in the situation where this might be a multi-hour step.

Currently I’m running the regen-memo-files.sh script on our test server. That one is already on 5.6.3, so I’m wondering how to specify what OMERO version to use in case I’d be doing this “pre-upgrade” on our production machine. The config.yaml contains no reference but the DB connection that is used to assemble the file list (admittedly, there is omero.script_repo_root but I couldn’t find any script that looks like being part of that procedure). Is it built (hard-coded) into the MemoRegenerator ?

Or am I missing something entirely here?

Puzzled,
~Niko

For significant bumps of Bio-Formats, yes, this is still true, and should be mentioned in the release notes.

That should be what the regen scripts get you.

That would be in the build for the repo:

~J.

1 Like

Hi Josh,

Would be great to know somehow when it’s necessary and when not :slight_smile:

That’s what I assumed, I was just confused with the version stuff.
So that means running the regen script to create the cache off-site, then performing the OMERO upgrade and moving the previously generated cache to the right location will get the job done, right?

Yes and no, unfortunately. I can’t make up my mind what omero-blitz:5.5.6 means in terms of compatibility regarding OMERO and / or Bio-Formats :exploding_head: - sorry :confused:

Now, for the even more confusing part…

To prepare the cache without affecting our users too much, I did a test-run on our production machine using the following command (resorting to a single job only, to keep the load on the production machine low):

./regen-memo-files.sh \
    --jobs 1 \
    --cache-options /export01/BioFormatsCache-rebuild

The process took slightly more than 45 minutes, which is a very acceptable time frame. However, unlike on our test server, the specified directory is completely empty afterwards :pleading_face: :astonished: : - did I do something wrong here?

Thanks,
~Niko

Sorry, with “should” I meant, “it is”:

The Bio-Formats Memoizer cache will not be invalidated on upgrade from OMERO.server 5.6.2.

Yeah, that’s all definitely more trouble than we’d like.

omero-blitz/build.gradle at 9a0bfb76167d21137548b192f4f9fe54be067e3c · ome/omero-blitz · GitHubomero-server/build.gradle at 75c82f40071a9222b4f2bbfd039951e2bcc251e5 · ome/omero-server · GitHubomero-renderer/build.gradle at 9baa98e231a582011c9abf18fe7ed74917746bbd · ome/omero-renderer · GitHubomero-renderer/build.gradle at 9baa98e231a582011c9abf18fe7ed74917746bbd · ome/omero-renderer · GitHubomero-romio/build.gradle at 53553686cdf96eae8a2cbc1bb0c0c5b72b09e9c4 · ome/omero-romio · GitHubomero-common/build.gradle at 0389efd1313af5228343d8781ccfe2e249bed5dd · ome/omero-common · GitHubomero-model/build.gradle at fa9d4a4431691b2efa4bcf13d3344fe8cb6778f7 · ome/omero-model · GitHub → formats-gpl 6.3.1

Uh. I don’t know if you did something wrong or if something is just broken.
~J.

I’ll re-run and see what happens.

1 Like

Unfortunately all the same. Process took ~45 minutes with nothing in the results (--cache-options) directory.

I’ll have to see if I can find some time to debug this more.

If not, what would be the fallback strategy regarding our 5.4.105.6.3 upgrade?

  • leave the existing BioFormatsCache directory untouched?
  • wipe it and wait for it to be repopulated on-the-fly?

Thanks,
Niko

Hi @ehrenfeu,

In most instances, leaving the BioFormatsCache untouched is fine. Individual memo files should be accessed, invalidated and regenerated on-the-fly as the data gets accessed. This is the strategy used for the production servers at the University of Dundee, including the OME demo server and the SLS production research instance and likely many other instances.

One advantage of wiping out or renaming the old cache directory would be to save the time associated with opening, invalidated and deleting the old files. But I think in most cases, this is not the bottleneck. It might also be useful if you are interested in measuring how many filesets have been re-accessed/regenerated post-upgrade.

1 Like

Thanks @s.besson this resolves my remaining doubts, highly appreciated!

Guess we’re having green lights :vertical_traffic_light: for starting the upgrade tomorrow! :partying_face:

2 Likes

Glad it clears the remaining uncertainties. As always let us know how things go

1 Like