Omero keeps crashing

Hi,

we’re currently experiencing massive problems with our Omero server, when we’re importing files.
Imports constantly fail after a few files and the Omero server stops responding. Only after a restart we’re able to use it again.

Omero version is 5.5.1
Omero-insight version is 5.5.3

Logs are attached below.
From what I could see, it seems to be a Postgres problem.

master.err.txt (202.1 KB) Indexer-0.log.txt (1.3 MB) PixelData-0.log.txt (483.7 KB)

Any ideas?

Best,
Gebhard

Hi @Gebhard,

Why do you suspect Postgres? Is it because of these lines in in master.err?

Jul 29, 2019 9:06:49 AM org.postgresql.Driver connect
SEVERE: Connection error: 
org.postgresql.util.PSQLException: Connection to localhost:5432 refused.

If so, those errors would prevent OMERO from starting at all. If you are able to login but then OMERO disconnects on import, I’d be more inclined to suspect resource exhaustion. How much memory do your servers have?

bin/omero admin jvmargs

Also, what type of files are you importing and how large are they?

Cheers,
~Josh

HI @joshmoore,

I didn’t have time to take a close look, but the postgres was the first thing to pop inti my eye.

Here’s what bin/omero admin jvmcfg gave me

blitz=-Xmx3792m -XX:MaxPermSize=1g -XX:+IgnoreUnrecognizedVMOptions
indexer=-Xmx2528m -XX:MaxPermSize=1g -XX:+IgnoreUnrecognizedVMOptions
pixeldata=-Xmx3792m -XX:MaxPermSize=1g -XX:+IgnoreUnrecognizedVMOptions
repository=-Xmx2528m -XX:MaxPermSize=1g -XX:+IgnoreUnrecognizedVMOptions

The server runs on a VM with 16 cores and 24G of RAM

We’re importing .CZI files from our AxioScan.
We sequentially run tasks with 20 - 50 files, ranging from 200MB to 5GB

Best,
Gebhard

Hi @Gebhard,

I’d suggest by starting with increasing the memory to the blitz process here. 8GB might be a good starting point, depending on what else is running on this machine. See https://docs.openmicroscopy.org/omero/5.5.1/sysadmins/server-performance.html#examples for info on how to do so.

All the best,
~Josh

@joshmoore, I changed the settings to this:

blitz=-Xmx8000m -XX:MaxPermSize=1g -XX:+IgnoreUnrecognizedVMOptions # Settings({‘system_memory’: ‘24000’, ‘heap_size’: ‘8000’})
indexer=-Xmx8000m -XX:MaxPermSize=1g -XX:+IgnoreUnrecognizedVMOptions # Settings({‘system_memory’: ‘24000’, ‘heap_size’: ‘8000’})
pixeldata=-Xmx8000m -XX:MaxPermSize=1g -XX:+IgnoreUnrecognizedVMOptions # Settings({‘system_memory’: ‘24000’, ‘heap_size’: ‘8000’})
repository=-Xmx8000m -XX:MaxPermSize=1g -XX:+IgnoreUnrecognizedVMOptions # Settings({‘system_memory’: ‘24000’, ‘heap_size’: ‘8000’})

However, I’m wondering why this didn’t happen before? We’re having this only since the update to 5.5.0

Best,
Gebhard

You probably only want to go up to 8GB for the blitz process, but the others can be reduced later to prevent another restart.

Hmm… that’s a good question. If you were previously doing these imports on 5.4 with the previous, lower memory settings and not seeing these issues, then something is fishy.

~J.

It’s still happening…

I’ve tested it with two types of our typical files (still, both were CZI files from AxioScan)

  1. Brain slices with 5 Scenes per file, about 3GB per file
  2. Spinal cord slices, 8-12 Scenes per file, 100 - 300 MB per file

I’ve watched the upload procedure and noticed two things.

  1. The Import process (reading metadata, thumbnails, etc.) seems to be much slower for the smaller files than for the 3GB files

  2. The whole process got slower over time up to the point where nothing seemed to be happening anymore. Then single imports started to fail and finally we got a message that the server was not reponding and the import has been aborted.

  3. I’ve watched the CPU and memory utilization with htop on the server.

  • RAM usage was around 8GB (as it was before the changes to the java config)
  • At some point CPU usage on the server dropped. I noticed this around the same time when we had the impression that the import was not making any progress anymore. From previously almost all 16 cores only 2-3 cores were used.

Gebhard

Could you submit your Blitz-0.log too please? (It might compress well.) If you can let us know when a problematic import starts that will help us find it in the log.

@mtbc, sure.

I Had to truncate the blitz log a bit, so it only covers the last 3 days. I’ve also included the the other logs in which I found some errors.

Blitz-0.zip (12.5 MB) logs.zip (381.9 KB)

Ah, great, thank you. The DatabaseBusyExceptions are interesting. From https://docs.openmicroscopy.org/omero/5.5.1/sysadmins/server-performance.html have you tried much increasing omero.db.poolsize to see if that helps? (Though I don’t know why upgrade to 5.5 would affect that.) If there are any other configuration settings that could be relevant feel free to ask about or share them for comment.

the omero.db.poolsize is set to 100. But it didn’t help

And what is Postgres‘ max_connections set to? ~J

max_connections = 100
shared_buffers = 512MB

These settings have been working fine in the past. I don’t see why they would cause problems now…?

Is this a dedicated OMERO server, or are there other applications running? Have you checked the system logs in-case there’s a low level problem, such as hardware or kernel errors? Has your operating system or any applications including postgres been patched recently, and if so did it coincide withe these problems?

Could you also upload your postgres logs?

Here are the postgres logs I could find.

postgresql-9.6-main-2.txt (4.3 KB) postgresql-9.6-main-1.txt (3.6 KB)

The server is dedicated to omero. Even omero-web is running on a different VM.
I did run a dist-upgrade after I noticed the problems, but it didn’t change anything.

I’ve also searched the syslog and the kernel log for “Error”, “Critical”, and “Warning”, but didn’t find anything. the only that poped out was a truckload of messages like this:

ipmi-sensors:20670 map pfn expected mapping type uncached-minus for [mem 0xbfeef000-0xbfeeffff], got write-back

There is certainly some per-file overhead in import which penalizes formats with many small files. I’m afraid we still have little idea why your server is ailing so offhand I am trying to think of what might yield clues.

Was anything else upgraded at the same time as OMERO? Are you importing using Insight or CLI? (Though they should be similar in how they affect the server for import, I think the CLI should avoid the issue fixed in https://github.com/ome/omero-gateway-java/pull/16 .) Are you adding any interesting non-default import options? Could you attach an “bin/omero admin diagnostics” and “bin/omero config get” just in case we notice anything that could be a clue? Perhaps even ulimit -a for the OMERO server user is worth a look.

When your server is having trouble, could you use the JDK’s jstack tool to capture the server state and attach that output here? Probably the Blitz process is the only relevant one. I’m also wondering about watching the DB activity – something with netstat or lsof about the connections or watching the pg_stat_activity table or using logback.xml to bump org.hibernate.SQL up to DEBUG or whatever to see what is different when things go wrong – whatever you might be confident with, but with luck the jstack might tell us plenty enough.

It may yet be that the storage backend isn’t the problem but I do concur that it sounds like some resource exhaustion issue, perhaps with the server’s threads or Ice callbacks to the client or whatever.

We’ve just noticed two things.

  1. Upload is getting slower over time
  2. The problem seems to occur mostly with small files (<200MB), when about 10 upload/import processes are running simultaneously. Just now we’re running an import with files mostly >1GB and except for slowing down after some 30+ files it is still running.

Could it be some sort of concurrency / deadlock / resource (e.g. releasing DB-handles, finalizing transactions, entering critical sections, …) issue when there are “too many” import threads are accessing the DB simultaneously

To answer your questions:

I ran a dist-upgrade after we noticed the issues, but it didn’t help.

We are using omero-insight. CLI and biologists don’t mix well.
The only parameters we set are the owner (not necessarily the person uploading the files), the project and the dataset.

bin/omero admin diagnostics: omero-diag.txt (3.6 KB)
bin/omero config get: omero-config-get.txt (1.5 KB)
ulimit -a: ulimit.txt (708 Bytes)

Since it’s currently running with some larger files, I’ll post the jstack when we encounter the next crash.

OMERO.insight 5.5.5 may fix the issue for you in including the PR I mentioned above. Given your site’s issue’s severity, it’s definitely worth a try, though I appreciate that it might be difficult to prevent earlier versions of Insight from connecting and slowing your server. If you can find a way to test that hypothesis please do let us know how it goes.

Incidentally, while I don’t think it’s related to this issue, I noticed that your ulimit -n seems a bit low; OMERO.server can be quite greedy for file descriptors. On our default server deployments we normally set the open file limit to 65,535.

1 Like

OK, I upgraded to OMERO.insight 5.5.6 and set ulimit -n to 65.535.
I’ll let you know if it helped, when we have the next bunch of files to upload.