OMERO.server hangs now and then after image import with insight

Hi everyone,

our OMERO.server is hanging now and then after importing big (> 1 GB) image files via OMERO.insight. The issue is not caused by a specific image, but can be more or less reproduced randomly when uploading many big images. Sometimes it happens during the upload itself, sometimes during the processing step.

Now what does “hanging” for us mean: the web interface can be reached, but the login is no longer possible (login seem to take ages and reacts with error “405 Not allowed”). The insight client doesn’t throw any error during the event. Also in the server log files I cannot see any special. After some time (mostly hours) the server manages to fix himself, the master.err log shows then the following example output:

-! 3/23/20 17:05:58:083 warning: Blitz-0-Ice.ThreadPool.Server-79: dispatch exception:
   identity: session-161b15c0-34bd-41c5-8d03-b16ec285a983/3132c665-9e00-4529-abf7-adcaff352b2d
   facet:
   operation: getByName
   remote host: XXX.XXX.XXX.XXX remote port: 42462
   ome.conditions.SessionTimeoutException: Session (started=2020-03-23 12:19:59.357, hits=5384, last access=2020-03-23 12:54:55.072) exceeded timeToIdle (600000) by 14462968 ms
        at ome.services.sessions.state.SessionCache.getDataNullOrThrowOnTimeout(SessionCache.java:470)
        at ome.services.sessions.state.SessionCache.getSessionContext(SessionCache.java:368)
        at ome.services.sessions.state.SessionCache.getSessionContext(SessionCache.java:353)
        at ome.services.sessions.SessionManagerImpl.getEventContext(SessionManagerImpl.java:887)
        at ome.security.basic.BasicSecuritySystem.loadEventContext(BasicSecuritySystem.java:369)
        at ome.security.basic.EventHandler.doLogin(EventHandler.java:210)
        at ome.security.basic.EventHandler.invoke(EventHandler.java:146)
        at org.springframework.aop.framework.ReflectiveMethodInvocation.proceed(ReflectiveMethodInvocation.java:179)
        at org.springframework.orm.hibernate3.HibernateInterceptor.invoke(HibernateInterceptor.java:119)
        at org.springframework.aop.framework.ReflectiveMethodInvocation.proceed(ReflectiveMethodInvocation.java:179)
        at org.springframework.transaction.interceptor.TransactionInterceptor$1.proceedWithInvocation(TransactionInterceptor.java:99)
        at org.springframework.transaction.interceptor.TransactionAspectSupport.invokeWithinTransaction(TransactionAspectSupport.java:282)
        at org.springframework.transaction.interceptor.TransactionInterceptor.invoke(TransactionInterceptor.java:96)
        at org.springframework.aop.framework.ReflectiveMethodInvocation.proceed(ReflectiveMethodInvocation.java:179)
        at ome.tools.hibernate.ProxyCleanupFilter$Interceptor.invoke(ProxyCleanupFilter.java:249)
        at org.springframework.aop.framework.ReflectiveMethodInvocation.proceed(ReflectiveMethodInvocation.java:179)
        at ome.services.util.ServiceHandler.invoke(ServiceHandler.java:121)
        at org.springframework.aop.framework.ReflectiveMethodInvocation.proceed(ReflectiveMethodInvocation.java:179)
        at org.springframework.aop.framework.JdkDynamicAopProxy.invoke(JdkDynamicAopProxy.java:213)
        at com.sun.proxy.$Proxy80.doWork(Unknown Source)
        at ome.services.util.Executor$Impl.execute(Executor.java:447)
        at ome.services.util.Executor$Impl.execute(Executor.java:392)
        at ome.services.blitz.impl.ServiceFactoryI.isGuest(ServiceFactoryI.java:571)
        at ome.services.blitz.impl.ServiceFactoryI.disallowGuest(ServiceFactoryI.java:585)
        at ome.services.blitz.impl.ServiceFactoryI.getByName(ServiceFactoryI.java:438)
        at ome.services.blitz.impl.ServiceFactoryI.getByName(ServiceFactoryI.java:430)
        at omero.api._ServiceFactoryTie.getByName(_ServiceFactoryTie.java:143)
        at omero.api._ServiceFactoryDisp.___getByName(_ServiceFactoryDisp.java:990)
        at omero.api._ServiceFactoryDisp.__dispatch(_ServiceFactoryDisp.java:1275)
        at IceInternal.Incoming.invoke(Incoming.java:221)
        at Ice.ConnectionI.invokeAll(ConnectionI.java:2536)
        at Ice.ConnectionI.dispatch(ConnectionI.java:1145)
        at Ice.ConnectionI.message(ConnectionI.java:1056)
        at IceInternal.ThreadPool.run(ThreadPool.java:395)
        at IceInternal.ThreadPool.access$300(ThreadPool.java:12)
        at IceInternal.ThreadPool$EventHandlerThread.run(ThreadPool.java:832)
        at java.lang.Thread.run(Thread.java:748)

[... and a few more dispatch exception looking almost the same...]

The Indexer-0.log showed:

2020-03-23 17:05:57,792 ERROR [        ome.services.util.ServiceHandler] (1-thread-1) Method interface ome.services.util.Executor$Work.doWork invocation took 16891001

Afterwards the server is available again. The problematic image is not uploaded, but everything else is fine again without any human interaction.

We can not perfectly reproduce the issue, but think it is related to uploads with many big images. Since there seem to be no error message during the event, the error analysis is quite difficult. But I was wondering, if it could be due to resource issues. Maybe not enough memory on the linux server? Database connectivity issues? Does anyone had a similar issue?

We have server version 5.5.1 now, but we had a similar incident already in the earlier version 5.4.3. Since images are getting bigger and bigger, the issue got now more importance for us.

================================================================================
OMERO Diagnostics (admin) 5.5.1-ice36-b122
================================================================================
Commands:   java -version                  1.8.0     (/usr/bin/java)
Commands:   python -V                      2.7.5     (/usr/bin/python)
Commands:   icegridnode --version          3.6.5     (/usr/bin/icegridnode)
Commands:   icegridadmin --version         3.6.5     (/usr/bin/icegridadmin)
Commands:   psql --version                 9.4.26    (/usr/bin/psql)
Commands:   openssl version                1.0.2     (/usr/bin/openssl)

Any hints are much appreciated.
Thanks, Anna

Sorry to hear you’re running into ongoing issues here. I wonder what your server is doing while it hangs. omero admin diagnostics reports the pid of each active server process. From the standard Java Development Kit one can use jstack and give that pid as the argument to get a printout of the current stack trace for that server process’ activity. I would suggest that during the hang you use top or somesuch to see which OMERO processes are eating a lot of CPU and/or memory then running jstack a few times on the offender to get a sample over time of what it is doing. We’d be happy to look at a collection of such captures to help you to figure it out, can even just zip them then upload to http://qa.openmicroscopy.org.uk/qa/upload/. If you could also send us a problematic sample image (public is best, let us know if not) or otherwise make it available to us then we can try it on a test server at OME. I would be surprised if indexer activity prevented login, that may just be a separate issue.

Hi Mark,

thanks for the ideas on the further analysis of this topic. I will try to produce some output with jstack next week and come back to you afterwards.

Have a nice weekend!

Hi @mtbc,

we reproduced the issue today to get some more debug output. Unfortunately the omero processes seem all fine to me. There is also no excessive resource usage on the system by the omero processes. But what I’ve seen now is that our NFS share has been hanging. We store our binary repository and pgsql data on a mounted NFS share. Even in the shell I couldn’t access this share anymore. I’ve seen the hint about proper locking in your documentation, but to be honest, I have no clue how I can influence this. Are these specific mounting options I need to be aware of (e.g. hard/soft, timeout)? Something that needs to be configured on the NFS server? Any more details about “omero best practices” with the binary repository on NFS shares would be much appreciated. In addition I will contact our NFS server admins and verify if our NFS client version might cause issues.

Thanks in advance
Anna

Hi Anna,

Did you manage to get any stack traces during the event yesterday? Seeing those might help us understand what’s blocking. Additionally, could you upload the Blitz-0.log for the time period as well as the import log (from under /OMERO/ManagedRepository) as well as master.err and master.out if they contain anything?

If OMERO starts and permits any import to proceed, then I wouldn’t be immediately concerned by the fact that it’s on NFS. Obviously, if it’s the copying of the data to the NFS which is hanging, then we will need to re-review that.

Having PostgreSQL on NFS, however, isn’t something I’ve ever been comfortable doing. It might be worth looking in the PostreSQL log file to see if there’s anything related.

It’s unlikely that the database will grow significantly soon so your best course of action may be to move it to a local filesystem if you have the space available.

:+1:

~Josh

Hi Josh,

I have a log & jstack package for you omero_logs_20200331.zip (373.8 KB) . The logs from ManagedRepository might be indeed interesting. I didn’t touch the system to fix the incident so the logs should show the self-cleaning/self-fixing. The system was unavailable this time from 16:17 until 19:37 which is where the Blitz-0 log is basically empty. That’s what we already observed before. During the test import the first three images succeeded, but then the process stopped at this point:

Our internal disks have enough space to host the pgsql data dir, so I will move it tomorrow and re-test the issue. I will let you know afterwards.

Cheers, Anna

Edit: I didn’t find an error message in the system or database log so far.

Hi Anna,

Indeed, from ManagedRepository/16-11-08.408.log:

2020-03-31 16:17:47,303 DEBUG [    o.s.blitz.repo.ManagedImportRequestI] (1-thread-2) Plane:399 X:0 Y:0 TileWidth:2048 TileHeight:2048 arrayBuf.length:8388608
2020-03-31 16:17:47,517 DEBUG [    o.s.blitz.repo.ManagedImportRequestI] (1-thread-2) Plane:400 X:0 Y:0 TileWidth:2048 TileHeight:2048 arrayBuf.length:8388608
2020-03-31 19:37:45,038 DEBUG [    o.s.blitz.repo.ManagedImportRequestI] (1-thread-2) Plane:401 X:0 Y:0 TileWidth:2048 TileHeight:2048 arrayBuf.length:8388608
2020-03-31 19:37:45,771 DEBUG [    o.s.blitz.repo.ManagedImportRequestI] (1-thread-2) Plane:402 X:0 Y:0 TileWidth:2048 TileHeight:2048 arrayBuf.length:8388608

Various IO “stuff” is proceeding then hangs for over 3 hours then continues. Likely this then leads to your sessions timing out which may have led to some of the downstream exceptions.

From Blitz-0.after:

"Blitz-0-Ice.ThreadPool.Server-403" #63750 prio=5 os_prio=0 tid=0x00007fb8c0008080 nid=0x304c waiting for monitor entry [0x00007fb85ccbc000]
   java.lang.Thread.State: BLOCKED (on object monitor)
        at ch.qos.logback.core.AppenderBase.doAppend(AppenderBase.java:64)
        - waiting to lock <0x00000000b19a1cf0> (a ome.util.logback.SiftingWhileRejectingDefaultAppender)
        at ch.qos.logback.core.spi.AppenderAttachableImpl.appendLoopOnAppenders(AppenderAttachableImpl.java:48)
        at ch.qos.logback.classic.Logger.appendLoopOnAppenders(Logger.java:273)

makes me worry that your logging was hanging as well. Are the logs on NFS, too?

There are also some threads waiting on PostgreSQL:

"Blitz-0-Ice.ThreadPool.Server-433" #63783 prio=5 os_prio=0 tid=0x00007fb8d00d6bd0 nid=0x3198 waiting for monitor entry [0x00007fb85ada5000]
   java.lang.Thread.State: BLOCKED (on object monitor)
        at bitronix.tm.resource.common.XAPool.getConnectionHandle(XAPool.java:161)
        - waiting to lock <0x00000000b199cf48> (a java.lang.Object)
        at bitronix.tm.resource.common.XAPool.getConnectionHandle(XAPool.java:150)
        at bitronix.tm.resource.jdbc.PoolingDataSource.getConnection(PoolingDataSource.java:

but it certainly doesn’t seem to be the only thing hanging.

If you can, I’d suggest also getting your logs off of NFS if that’s the case so that we can perhaps track more of what’s going on, but it does look like there may be significant issues with your NFS mount.

Let us know how it goes.
~Josh

Hi Josh,

the omero internal logs like Blitz-0.log have been always on internal disks, so the gap of messages in the logs during the incident cannot be caused by connectivity issues. I moved the pgsql database yesterday to an internal disk and could still reproduce the incident (downtime roughly from 18:16 up to 21:56). At least this helped to get some new error messages. omero_logs_20200402.zip (312.1 KB)

Since the system itself now produced the follwing messages, I’m back at my thoughts of resource issues (IO settings, memory/swap capacity):

Apr  2 18:28:37 localhost kernel: INFO: task java:12032 blocked for more than 120 seconds.
Apr  2 18:28:37 localhost kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Apr  2 18:28:37 localhost kernel: java            D ffffffff816a76b0     0 12032   3846 0x00000080
Apr  2 18:28:37 localhost kernel: ffff880191af39b0 0000000000000082 ffff8800b36e5ee0 ffff880191af3fd8
Apr  2 18:28:37 localhost kernel: ffff880191af3fd8 ffff880191af3fd8 ffff8800b36e5ee0 ffff88023fc96cc0
Apr  2 18:28:37 localhost kernel: 0000000000000000 7fffffffffffffff ffff88023ff98488 ffffffff816a76b0
Apr  2 18:28:37 localhost kernel: Call Trace:
Apr  2 18:28:37 localhost kernel: [<ffffffff816a76b0>] ? bit_wait+0x50/0x50
Apr  2 18:28:37 localhost kernel: [<ffffffff816a9589>] schedule+0x29/0x70
Apr  2 18:28:37 localhost kernel: [<ffffffff816a7099>] schedule_timeout+0x239/0x2c0
Apr  2 18:28:37 localhost kernel: [<ffffffff810ce55e>] ? dequeue_task_fair+0x41e/0x660
Apr  2 18:28:37 localhost kernel: [<ffffffff810c7705>] ? sched_clock_cpu+0x85/0xc0
Apr  2 18:28:37 localhost kernel: [<ffffffff816a76b0>] ? bit_wait+0x50/0x50
Apr  2 18:28:37 localhost kernel: [<ffffffff816a8c0d>] io_schedule_timeout+0xad/0x130
Apr  2 18:28:37 localhost kernel: [<ffffffff816a8ca8>] io_schedule+0x18/0x20
Apr  2 18:28:37 localhost kernel: [<ffffffff816a76c1>] bit_wait_io+0x11/0x50
Apr  2 18:28:37 localhost kernel: [<ffffffff816a71e5>] __wait_on_bit+0x65/0x90
Apr  2 18:28:37 localhost kernel: [<ffffffff816a76b0>] ? bit_wait+0x50/0x50
Apr  2 18:28:37 localhost kernel: [<ffffffff816a7291>] out_of_line_wait_on_bit+0x81/0xb0
Apr  2 18:28:37 localhost kernel: [<ffffffff810b19e0>] ? wake_bit_function+0x40/0x40
Apr  2 18:28:37 localhost kernel: [<ffffffffc05a6db3>] nfs_wait_on_request+0x33/0x40 [nfs]
Apr  2 18:28:37 localhost kernel: [<ffffffffc05abe93>] nfs_updatepage+0x153/0x8d0 [nfs]
Apr  2 18:28:37 localhost kernel: [<ffffffffc059b381>] nfs_write_end+0x141/0x350 [nfs]
Apr  2 18:28:37 localhost kernel: [<ffffffff81182549>] generic_file_buffered_write+0x189/0x2a0
Apr  2 18:28:37 localhost kernel: [<ffffffff81184902>] __generic_file_aio_write+0x1e2/0x400
Apr  2 18:28:37 localhost kernel: [<ffffffff81184b79>] generic_file_aio_write+0x59/0xa0
Apr  2 18:28:37 localhost kernel: [<ffffffffc059a7eb>] nfs_file_write+0xbb/0x1e0 [nfs]
Apr  2 18:28:37 localhost kernel: [<ffffffff812001ed>] do_sync_write+0x8d/0xd0
Apr  2 18:28:37 localhost kernel: [<ffffffff81200cad>] vfs_write+0xbd/0x1e0
Apr  2 18:28:37 localhost kernel: [<ffffffff81201abf>] SyS_write+0x7f/0xe0
Apr  2 18:28:37 localhost kernel: [<ffffffff816b5089>] system_call_fastpath+0x16/0x1b

But I couldn’t yet look at the nfs client & server in detail, so this might provide some new findings. In any case I will perform further testings during the next two weeks and let you know if I can solve it.

Cheers, Anna

Understood, but then all the more unusual that the outage would be so complete. Is this a virtual machine?

:thumbsup: Keep us posted.
~Josh

Hi Josh,

yes, it is a virtual machine. During the event the memory is not fully used, but a high amount of the swap. I was wondering if optimizing the following parameters might already improve the situation rather than simply increasing RAM & swap:

# sysctl -a|egrep '^vm.(dirty|swap|over)'
vm.dirty_background_bytes = 0
vm.dirty_background_ratio = 10
vm.dirty_bytes = 0
vm.dirty_expire_centisecs = 3000
vm.dirty_ratio = 30
vm.dirty_writeback_centisecs = 500
vm.overcommit_kbytes = 0
vm.overcommit_memory = 0
vm.overcommit_ratio = 50
vm.swappiness = 30

Do you guys have any recommendations for the omero usage on VMs?

In any case I will try different settings for the above parameters and probably also increase RAM and swap to see if the issue is still reproduced.

Have a nice Easter weekend!
Anna

Sounds good. We don’t have any immediate ideas, but it sounds like something pretty substantial is happening for a period of time. It may be strictly at the storage layer rather than the VM itself, but having local logs with a three hour gap is certainly surprising.

Done and done. :smile: Hope your’s was relaxing as well!
~Josh