Omero.matlab disconnecting upon synchronising events on server

Hello,

I operate an omero server and we process the data using omero.matlab. We currently have the latest 5.5.3 client for matlab and server 5.6.3 (diagnostics below). From time to time, matlab loses the connection to omero. It appears to coincide with events about synchronizing and listing session (I don’t know exactly what it means). I put below an excerpt of the log and attach a longer Blitz log around the time of the event. Is there a way to fix that issue as it crashes the running analyses (That lasts hours up to days).

Thank you

Best

Jacques

------------------------- log excerpts ---------------------------
2021-02-04 17:01:03,465 INFO  [                 org.perf4j.TimingLogger] (Server-723) start[1612454463463] time[2] tag[omero.call.success.ome.services.RawPixelsBean.setPixelsId]
2021-02-04 17:01:03,465 INFO  [        ome.services.util.ServiceHandler] (Server-723)  Rslt:    null
2021-02-04 17:01:03,467 INFO  [        ome.services.util.ServiceHandler] (Server-717)  Meth:    interface ome.api.RawPixelsStore.getPlane
2021-02-04 17:01:03,467 INFO  [        ome.services.util.ServiceHandler] (Server-717)  Args:    [0, 0, 2881]
2021-02-04 17:01:03,469 INFO  [         ome.security.basic.EventHandler] (Server-717)  Auth:    user=1652,group=3,event=null(User),sess=18da3b66-d13e-459a-8a58-1654068814e2
2021-02-04 17:01:03,475 INFO  [                 org.perf4j.TimingLogger] (Server-717) start[1612454463467] time[7] tag[omero.call.success.ome.services.RawPixelsBean.getPlane]
2021-02-04 17:01:03,475 INFO  [        ome.services.util.ServiceHandler] (Server-717)  Rslt:    [1, 118, 1, ... 1 more]
2021-02-04 17:02:37,332 INFO  [ome.services.sessions.state.SessionCache] (1-thread-2) Synchronizing session cache. Count = 9
2021-02-04 17:02:37,332 INFO  [        ome.services.util.ServiceHandler] (1-thread-2)  Executor.doWork -- ome.services.sessions.SessionManagerImpl.reload[ea045082-fc3e-40ea-9726-fd23f69b13d3]
2021-02-04 17:02:37,332 INFO  [        ome.services.util.ServiceHandler] (1-thread-2)  Args:    [null, InternalSF@1857451279]
2021-02-04 17:02:37,336 INFO  [         ome.security.basic.EventHandler] (1-thread-2)  Auth:    user=0,group=0,event=null(Sessions),sess=5fb04c63-3726-4667-afc7-1c37ab7bc720
2021-02-04 17:02:37,343 INFO  [                 org.perf4j.TimingLogger] (1-thread-2) start[1612454557332] time[11] tag[omero.call.success.ome.services.sessions.SessionManagerImpl$6.doWork]
--------------------------------- omero diagnostics ---------------------------------

================================================================================
OMERO Diagnostics (admin) 5.8.0
================================================================================

Commands:   java -version                  11.0.8    (/usr/bin/java)
Commands:   python -V                      3.7.3     (/usr/local/OMERO/OMERO.server-Current/venv3/bin/python)
Commands:   icegridnode --version          3.6.5     (/opt/ice-3.6.5-0.1.0/bin/icegridnode)
Commands:   icegridadmin --version         3.6.5     (/opt/ice-3.6.5-0.1.0/bin/icegridadmin)
Commands:   psql --version                 11.9      (/usr/bin/psql)
Commands:   openssl version                1.1.1     (/usr/bin/openssl)

Server:     icegridnode                    running
Server:     Blitz-0                        active (pid = 9321, enabled)
Server:     DropBox                        inactive (disabled)
Server:     FileServer                     inactive (disabled)
Server:     Indexer-0                      active (pid = 9346, enabled)
Server:     MonitorServer                  inactive (disabled)
Server:     OMERO.Glacier2                 active (pid = 9352, enabled)
Server:     OMERO.IceStorm                 active (pid = 9355, enabled)
Server:     PixelData-0                    active (pid = 9357, enabled)
Server:     Processor-0                    active (pid = 9367, enabled)
Server:     Tables-0                       active (pid = 9360, enabled)
Server:     TestDropBox                    inactive (enabled)

Log dir:    /usr/local/OMERO/OMERO.server-Current/var/log exists
Log files:  Blitz-0.log                    240.6 MB      errors=0    warnings=468
Log files:  Blitz-0.log.1                  524.3 MB      errors=3    warnings=535
Log files:  Blitz-0.log.2                  524.3 MB      errors=61   warnings=4376
Log files:  Blitz-0.log.3                  524.3 MB      errors=34   warnings=2070
Log files:  Blitz-0.log.4                  524.3 MB      errors=46   warnings=210
Log files:  Blitz-0.log.5                  524.3 MB      errors=18   warnings=141
Log files:  Blitz-0.log.6                  524.3 MB      errors=33   warnings=79
Log files:  Blitz-0.log.7                  524.3 MB      errors=40   warnings=361
Log files:  Blitz-0.log.8                  524.3 MB      errors=55   warnings=183
Log files:  Blitz-0.log.9                  524.3 MB      errors=13   warnings=95
Log files:  DropBox.log                    n/a
Log files:  FileServer.log                 n/a
Log files:  Indexer-0.log                  12.8 MB       errors=0    warnings=10
Log files:  MonitorServer.log              n/a
Log files:  PixelData-0.log                1.1 MB        errors=42   warnings=44
Log files:  Processor-0.log                506.5 KB      errors=1    warnings=17
Log files:  Processor-0.log.1              5.0 MB        errors=3    warnings=197
Log files:  Processor-0.log.2              5.0 MB        errors=0    warnings=207
Log files:  Tables-0.log                   9.6 KB        errors=0    warnings=11
Log files:  TestDropBox.log                n/a
Log files:  master.err                     14.6 KB       errors=0    warnings=2
Log files:  master.out                     empty
Log files:  Total size                     4985.48 MB


Environment:OMERO_HOME=(unset)
Environment:OMERODIR=/usr/local/OMERO/OMERO.server-Current
Environment:OMERO_NODE=(unset)
Environment:OMERO_MASTER=(unset)
Environment:OMERO_USERDIR=(unset)
Environment:OMERO_TMPDIR=(unset)
Environment:PATH=/usr/local/OMERO/OMERO.server-Current/venv3/bin:/opt/ice-3.6.5-0.1.0/bin:/usr/local/bin:/usr/bin:/bin:/usr/local/games:/usr/games:/usr/local/OMERO/OMERO.server-Current/lib/omero-csv-annotation-tool/bin
Environment:PYTHONPATH=(unset)
Environment:ICE_HOME=/opt/ice-3.6.5-0.1.0
Environment:LD_LIBRARY_PATH=(unset)
Environment:DYLD_LIBRARY_PATH=(unset)

OMERO SSL port:4064
OMERO TCP port:4063
OMERO data dir:'/OMERO'                       Exists? True      Is writable? True
OMERO temp dir:'/home/cedre-agent/omero/tmp'  Exists? True      Is writable? True   (Size: 4622109657)

JVM settings: Blitz-${index}                -Xmx20000m -XX:MaxPermSize=1g -XX:+IgnoreUnrecognizedVMOptions
JVM settings: Indexer-${index}              -Xmx10000m -XX:MaxPermSize=1g -XX:+IgnoreUnrecognizedVMOptions
JVM settings: PixelData-${index}            -Xmx20000m -XX:MaxPermSize=1g -XX:+IgnoreUnrecognizedVMOptions
JVM settings: Repository-${index}           -Xmx5000m -XX:MaxPermSize=1g -XX:+IgnoreUnrecognizedVMOptions

Jar:        lib/server/formats-api.jar     Bio-Formats API      6.5.1   7 July 2020     6f50e4d52c9d96112635fd8b2dde737f31041cf0
Jar:        lib/server/formats-bsd.jar     BSD Bio-Formats readers and writers  6.5.1   7 July 2020     6f50e4d52c9d96112635fd8b2dde737f31041cf0
Jar:        lib/server/formats-gpl.jar     Bio-Formats library  6.5.1   7 July 2020     6f50e4d52c9d96112635fd8b2dde737f31041cf0
Jar:        lib/server/ome-codecs.jar      OME Codecs   0.3.0   3 March 2020    8287f33d3bff3d08a9925fa45e0b225e27d71fe0
Jar:        lib/server/ome-common.jar      OME Common Java      6.0.4   21 October 2019 650939cfa026e33d2f38c1f56c74715dfb44b974
Jar:        lib/server/ome-jai.jar         OME JAI      0.1.0   28 June 2017    96ed34bf59f5ba3b51e3d60e342f30962a46c292
Jar:        lib/server/ome-mdbtools.jar    MDB Tools (Java port)        5.3.2   7 August 2018   68ffca2a95750cbb96f0d11785851aa4a554c0a4
Jar:        lib/server/ome-poi.jar         OME POI      5.3.3   6 August 2018   3887f4b2b21e195fa76ec4378858f6278aed5dcd
Jar:        lib/server/ome-xml.jar         OME XML library      6.1.0   7 April 2020    8f490f5987062978fafe77a3e35f7bd24d211281
Jar:        lib/server/omero-blitz.jar     jar  5.5.7
Jar:        lib/server/omero-common.jar    jar  5.5.6
Jar:        lib/server/omero-gateway.jar   jar  5.6.4
Jar:        lib/server/omero-model.jar     jar  5.6.1
Jar:        lib/server/omero-renderer.jar  jar  5.5.6
Jar:        lib/server/omero-romio.jar     jar  5.6.1
Jar:        lib/server/omero-server.jar    jar  5.6.0

Blitz-0.log_2021-02-04_17-01-*.log.zip (5.8 KB) Blitz-0.log_2021-02-04_15-48-*.log.zip (35.0 KB)

Hi @jacques2020,

I’ll take a look a look through the logs but I suspect the issue may be due to “That lasts hours up to days”. Is there any correlation between how long one runs and whether it crashes? When one crashes do all crash? Can you share the code you’re using?

All the best,
~Josh

Hi Josh,
the code is long and complex. Indeed the longer the code runs, the higher the chance to get the issue but nothing systematic. The same code with the same parameters does not stop always at the same place, and may or may not disconnect. Nothing really reproductible.
Thank you for your help

Best regards

Jacques

Hi Jacques,

perhaps then just the connection & keep alive code?

I didn’t see anything in the Blitz files you uploaded. Do you see anything relevant in the master.out or master.err files? Also, any errors in the postgres logs?

Thanks for the info,
~Josh

Hi Josh,

thank you for your answer. I did a kind of minimal code planned to run against OMERO.matlab. I added only the m file I modified. There is indeed a reconnect workaround to diminish the issue as far as possible.
Thank you for your help
Best
Jacques
exemple_code.zip (28.7 KB)

Hi Josh,

did the log and exemple code evoked some possible improvements to avoid omero disconnecting ?
Thank you for your help.
Best
Jacques

Hi Jacques
I ran the code but did not have any disconnect. I probably did not run for long enough. I have to vpn in in order to run matlab so it is not ideal.

I will check again to see if I can find anything

Cheers

Jmarie

Hi Jean-Marie,
Thank you for investigating. I updated the server and computer running it on Wednesday and will let you know if I got the issue again. My best clue is the message

in the log that occurs at the same time than the disconnecting. But I don’t know what it means. Could it be that this sync of cache would be ressource intensive and hang the server for a few tens of seconds ? I put a jconsole to see the state of blitz because I failed to configure jmx to connect to our Munin or even jconsole (If you have some more docs on top of the ten lines in the grid configuration page, I am interested). Anyway, I will keep you posted if I have the issue again. Do you see other monitoring/debugging steps that would be sensible. Note that usually, we are about 10 people to connect with multiple connections for each of us.
Thank you
Best

Hi,

I am somewhat progressing. I still got the issue and found nothing on server side. The jconsole looks normal (Attached). But I noticed that preceding the issue, matlab is unresponsive doing a computation and miss a keep alive time. 11 minutes elapse between the “correct” keep alive and the one detecting the disconnecting. It seems not related to garbage collection in java, which time remain small. It happens sometimes for some function and is likely not a worry if omero were not disconnecting. Thus, is there a way to increase the delay before the server is disconnected ? I already put omero.sessions.timeout to 3600000 but it does not make the trick. Would you have some suggestions ?
Thank you for your help
Best
Jacques



Hi Jacques

Could it be that your code created too many sessions?
This is maybe something to check
Cheers

Jmarie

Hi Jean-Marie,
I checked for that but we have a very moderate number. Currently, there is 9 (if I count the one listed by omero sessions who logging in as root.

However, in the meantimes, I can I make omero requests keep alive signal less often than every minute ?
Best
Jacques

Hi Jacques

You pass a value to the keepAlive method to indicate how often you want the check to happen. The default is 60 seconds, passing 120 for example will ping every 2 minute.

Cheers

Jmarie

Hi Jean-Marie,
Two points:

  1. I guess I am confused about the keepAlive. I understand that I can change how often I test the connection from client-side. But from server-side, how can I set the time before the server close unilaterally the connection? is it the omero.sessions.timeout ? But then I don’t understand why the connection is interrupted after 11 min without keep alive.

  2. reviewing my code once again, I realise that we do maybe something bad. I used to read that non secured session was faster for transactions with the server. Because the computing system and the server are on the same rack, we opted for non secured. But I found no way to create a non-secured session straight from the beginning. In contrast, we classically create a secure session and then create the non-secure on top of it using

    client_ = secure_client_.createClient(false); 
    session_ = client_.getSession();  

And the keep-alive is done on client_ not on secure_client_. But then, I don’t understand why it ran smoothly for years and only disconnected in the last months…
I am anyway retrying with sending keepAlive on both clients. I will keep you posted.

Thank you for your help
Best
Jacques

Hi,
I did the test and it improved the situation. After 24h running, about 60% of the connection timed out while previously a couple of hours was sufficient for 100% disconnecting. Still, there is the puzzling disconnecting. Is there a way to get the server more verbose on the disconnecting reasons ?
Thank you for your help
Best
Jacques