PSQLException when uploading large datasets with Omero-Cli

Dear all,

We are currently trying to upload different datasets using the omero-cli from our cluster. The datasets are all using the PerkinElmer Operetta format and are between 2GB (1.5k Files) and 80GB (50k files). The upload works great for smaller datasets but for large datasets we receive a ‘org.postgresql.util.PSQLException: This connection has been closed’ error at the end of the upload. (full log of the error is attached)

2020-08-13 00:33:18,606 43731592   [l.Client-6] INFO   ormats.importer.cli.LoggingImportMonitor - METADATA_PROCESSED Step: 4 of 5  Logfile: 104767
2020-08-13 00:33:18,653 43731639   [l.Client-5] INFO   ormats.importer.cli.LoggingImportMonitor - OBJECTS_RETURNED Step: 5 of 5  Logfile: 104767
2020-08-13 00:48:53,162 44666148   [l.Client-6] ERROR     ome.formats.importer.cli.ErrorHandler - INTERNAL_EXCEPTION: /gpfs/scratch/bknorlei/spheroids/data2/spheroids/Index.idx.xml
java.lang.RuntimeException: Failure response on import!
Category: ::omero::grid::ImportRequest
Name: run-fail
Parameters: {stacktrace=ome.conditions.InternalException:  Wrapped Exception: (org.springframework.transaction.TransactionSystemException):
Could not roll back JDBC transaction; nested exception is org.postgresql.util.PSQLException: This connection has been closed.

The data seems to have uploaded correctly and is visible in omero but we are concerned that some steps are not properly executed during upload.

Are there any timeouts in omero which can be modified or mechanisms to keep the connection open. Any help or other ideas what the source of the problem could be would be greatly appreciated.
Kind regards and thanks!
Benjamin

omero_upload_error.txt (9.0 KB)

Hi @Benjamin_Knorlein,

Odd. The CLI log shows that the import progressed through the most onerous stages. I’m surprised the connection was closed at this point. This happens for all the larger datasets?

Could you also include the log from:

omero download 104767 -

It may also be necessary to take a look at the PostgreSQL log as well as the Blitz log to figure out who closed the connection and why.

Hi @joshmoore,
Thanks for the fast reply. Yes. It only happens for large datasets.

And i tried to download the log, but the file is not present anymore as i deleted the dataset in omero. I will upload the data again and attached the new logs as well as the postgreSQL.log and Blitz.log from the server tomorrow morning
Thanks,
.

1 Like

And yes. it happened for all larger datasets i tried.
And the data is acquired with the same instrument as the smaller datasets.

Hi @joshmoore,

I uploaded the following files in a zip here:
import_log.txt - The output from the import with verbose set to debug
download_log.txt - The output from omero download 267088 -
Blitz-0.log
master.err

The timings in the import_log and the other files are different as the machines have a difference of 4h. The error occurred at around 2:29 in the import_log. In the other files it is around 6:29.

I did not find a sql log file on the omero server machine. Which log-file are you exactly looking for.

Thanks and best regards,
Ben

I looked through the other files and unfortunately, I haven’t yet see any clear reason for the disconnection.

This would be on the Postgres server (prschpgdbcit.services.brown.edu) and likely under the pg_log directory, perhaps something like /var/lib/pgsql/9.6/data/pg_log. The query

select format('%s/%s', 
    current_setting('data_directory'),
    current_setting('log_directory'));

will tell you where it is.

~Josh

Depending on your platform it might also be worth taking a look somewhere like /var/log/postgresql/ or trying, find /var/log/ -name 'postgresql*.log'.

The only entry in our log file was

2020-09-02 21:31:00.100 EDT [66699] LOG:  could not receive data from client: Connection reset by peer
2020-09-02 21:31:00.100 EDT [66699] LOG:  unexpected EOF on client connection with an open transaction

At the moment I have no idea what the issue could be. Are there any settings in omero I can tune, e.g. timeouts, memory? Any advice would be highly appreciated

PS: The log is EDT while the omero-logs are UTC

Best,
Ben

Goodness, I’m baffled. How is the connection between the OMERO server and the database server set up, there’s nothing interesting about that connection below the OMERO/SQL level, right?

As far as I know there is nothing special. We are using the omero-server docker with slight modifications and a psql 12 database. The docker runs from a VM. I will reach out to the database team to see if there is anything special but i doubt so.

Hi @joshmoore and @mtbc

We found some spring parameters we would like to test to see if they solve the problem. Is there an easy way to set these parameters for the omero-db connection? E.g.

 spring.datasource.testOnBorrow=true
 spring.datasource.validationQuery=SELECT 1

Thanks,
Ben

Hi @Benjamin_Knorlein,

Unfortunately it looks like the datasource properties aren’t all configurable via omero config set. See:

SELECT 1 should be used already, but I’d need to check the debug output to check if testOnBorrow happens to be set.

It’s possible to modify that file in your omero-server.jar, but I don’t have a recipe for doing that to hand, and it would probably be best done on a test server. I’ll assume you want to proceed and try to come up with something.

All the best,
~Josh

Thanks. I will have a look at it.
And we are still on our test-server and did not go into production. So if there is anything you think might be worth testing out let me know. Being able to transmit large datasets from the Opera Phenix is crucial for rolling out the service in our case.

Thanks and all the best,
Ben

Hi @Benjamin_Knorlein,

I managed to play around with this some. The recipe for modifying the file above is:

cp lib/server/omero-server.jar omero-server.bak
unzip -d server-jar lib/server/omero-server.jar ome/services/datalayer.xml
echo EDIT server-jar/ome/services/datalayer.xml

Then for every edit that you want to test repeat this:

(cd server-jar; zip ../lib/server/omero-server.jar ome/services/datalayer.xml)
OMERODIR=$PWD ../venv3/bin/omero admin ice server stop Blitz-0

Looking at the properties you listed, I think they are for a different connection pool. Some options you might try include:

  • “tcpKeepAlive” which sounds like it may do what you need
  • “loggerLevel” whicih dumps a lot of output (surprisingly to var/log/master.err)

More are available under https://jdbc.postgresql.org/documentation/head/connect.html

  <bean id="psqlDataSourceProperties" class="org.springframework.beans.factory.config.PropertiesFactoryBean">
    <property name="properties">
      <props>
        <prop key="driverClassName">${omero.db.driver}</prop>
<! -- next line was modified-->
        <prop key="url">jdbc:postgresql://${omero.db.host}:${omero.db.port}/${omero.db.name}?tcpKeepAlive=true&amp;loggerLevel=TRACE</prop> 
        <prop key="user">${omero.db.user}</prop>
        <prop key="password">${omero.db.pass}</prop>
      </props>
    </property>
  </bean>

Hope that helps,
~Josh

Great. Thanks Josh.

I also tried another upload with more logging on the psql server and some things puzzle me a little bit.

It seems to me as if everything transfers alright, so i do not know why the error occurs. If you look at the logs it inserts all the metadata until around 21:55. This is when the import log reports METADATA_IMPORTED Step: 1 of 5. After that it seems to be idle until the error occurs on the psql server at 23:42, but which is directly followed by some other commands

[119416] [5f5abe00.1d278] [2020-09-10 23:42:12.612 EDT] [4187796]: LOG:  could not receive data from client: Connection reset by peer
[119416] [5f5abe00.1d278] [2020-09-10 23:42:12.612 EDT] [4187796]: LOG:  unexpected EOF on client connection with an open transaction
[96420] [5f5a89f8.178a4] [2020-09-10 23:42:30.004 EDT] [0]: LOG:  execute S_22: DELETE FROM _current_admin_privileges WHERE transaction = $1
[96420] [5f5a89f8.178a4] [2020-09-10 23:42:30.004 EDT] [0]: DETAIL:  parameters: $1 = '4187799'
[96420] [5f5a89f8.178a4] [2020-09-10 23:42:30.004 EDT] [4187808]: LOG:  execute S_22: DELETE FROM _current_admin_privileges WHERE transaction = $1
[96420] [5f5a89f8.178a4] [2020-09-10 23:42:30.004 EDT] [4187808]: DETAIL:  parameters: $1 = '4187798'

I saw that these commands are somehow related to some cleanup from LightAdminPrivilegesCleanup but am not 100% sure. After that it seems to be idle again until about 02:35 when it starts to write the pixeldata. The exception itself is then reported on the import job at 03:09 but nothing on the psql server side shows a problem. Is there anything still supposed to happen or is it only reporting the previous error at the end of the import during a cleanup?

I will play around with the settings and the instructions you sent but was wondering if you have an idea why the exception is only reported at the end and if you think it even matters. If you think it does not matter is there a way to check if the data was transmitted correctly and a way to suppress the error message.

The detailed logs you find here For the sql logs on Thursday I only copied the end of the log as the full one was huge. If you want the full log you can download them here

Thanks again so much for your help.
All the best,
Ben

Hi @joshmoore
Good news. Seems like the tcpKeepAlive option worked. I am going to run more tests and if no problems come up i will mark the issue as solved. Do you think there is a chance for future releases to make the option configurable with omero parameters? Right now we can replace the datalayer.xml but it will be harder to maintain during upgrades.

Thanks so much again for your help.
All the best,
Ben

whew Very glad to hear it. I didn’t have any great next steps.

Definitely doable. It may be a couple of weeks until a release though. I have some small hope that one can embed the value into the omero.db.name. I’ll give it a try.

Thanks for letting us know. ~J.

EDIT: indeed, the following seems to do the trick:

$ omero config get omero.db.name
omero?tcpKeepAlive=true&loggerLevel=TRACE

though you’ll likely need to update this with the next release. I imagine the proper fix will be to have something like:

omero.db.url=jdbc:postgresql://${omero.db.host}:${omero.db.port}/${omero.db.name}?${omero.db.properties}

and then you could set:

omero.db.properties=tcpKeepAlive=true&loggerLevel=TRACE

(i.e. no ? at the beginning).

Hi @joshmoore,

Some bad news. I removed the logging from the command and received the old error. So i checked the logs from the run which i thought passed and it turns out it received another exception but did not report it as an error which is why i missed it. In that case it did not even start the metadata upload but threw the error shortly after FILESET_UPLOAD_END.

2020-09-12 03:13:09,509 41760494   [1-thread-1] DEBUG  me.formats.importer.util.ClientKeepAlive - pinging
2020-09-12 03:13:09,511 41760496   [1-thread-1] DEBUG      ome.formats.OMEROMetadataStoreClient - KeepAlive ping.
2020-09-12 03:13:23,837 41774822   [2-thread-1] INFO   ormats.importer.cli.LoggingImportMonitor - FILESET_UPLOAD_END
2020-09-12 03:18:09,511 42060496   [1-thread-1] DEBUG  me.formats.importer.util.ClientKeepAlive - pinging
2020-09-12 03:21:21,755 42252740   [1-thread-1] DEBUG      ome.formats.OMEROMetadataStoreClient - KeepAlive failed.
2020-09-12 03:21:21,757 42252742   [1-thread-1] ERROR  me.formats.importer.util.ClientKeepAlive - Exception while executing ping(), logging Connector out: 
java.lang.RuntimeException: Ice.TimeoutException
	at ome.formats.OMEROMetadataStoreClient.ping(OMEROMetadataStoreClient.java:845) ~[omero-blitz.jar:5.5.7]
	at ome.formats.importer.util.ClientKeepAlive.run(ClientKeepAlive.java:77) ~[omero-blitz.jar:5.5.7]
	at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) [na:1.8.0_111]
	at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308) [na:1.8.0_111]
	at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180) [na:1.8.0_111]
	at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294) [na:1.8.0_111]
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) [na:1.8.0_111]
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) [na:1.8.0_111]
	at java.lang.Thread.run(Thread.java:745) [na:1.8.0_111]
Caused by: Ice.TimeoutException: null
	at IceInternal.AsyncResultI.__wait(AsyncResultI.java:276) ~[ice.jar:na]
	at omero.api.ServiceFactoryPrxHelper.end_keepAllAlive(ServiceFactoryPrxHelper.java:5963) ~[omero-blitz.jar:5.5.7]
	at omero.api.ServiceFactoryPrxHelper.keepAllAlive(ServiceFactoryPrxHelper.java:5846) ~[omero-blitz.jar:5.5.7]
	at omero.api.ServiceFactoryPrxHelper.keepAllAlive(ServiceFactoryPrxHelper.java:5833) ~[omero-blitz.jar:5.5.7]
	at ome.formats.OMEROMetadataStoreClient.ping(OMEROMetadataStoreClient.java:838) ~[omero-blitz.jar:5.5.7]
	... 8 common frames omitted
2020-09-12 03:21:21,757 42252742   [1-thread-1] DEBUG        ome.formats.importer.ImportLibrary - Initializing log out
2020-09-12 03:21:21,761 42252746   [1-thread-1] DEBUG        ome.formats.importer.ImportLibrary - Shutting down fileset thread pool.
2020-09-12 03:21:21,762 42252747   [1-thread-1] DEBUG        ome.formats.importer.ImportLibrary - Shutting down upload thread pool.
2020-09-12 03:21:21,762 42252747   [1-thread-1] DEBUG      ome.formats.OMEROMetadataStoreClient - Logout called, shutting keep alive down.
2020-09-12 03:21:21,763 42252748   [1-thread-1] DEBUG      ome.formats.OMEROMetadataStoreClient - keepalive shut down.
2020-09-12 03:21:21,821 42252806   [1-thread-1] DEBUG      ome.formats.OMEROMetadataStoreClient - Connection lost; cannot close 299ec60b-29e5-48a5-9eb8-12185c348f7a/09c5f6c5-7610-4020-89d0-5018e8e30c8comero.api.RawFileStore -t -e 1.1:tcp -h 192.168.208.3 -p 38688 -t 60000
2020-09-12 03:21:21,822 42252807   [2-thread-1] ERROR        ome.formats.importer.ImportLibrary - Error on import
Ice.ConnectionLostException: java.io.IOException: Connection reset by peer
	at IceInternal.AsyncResultI.__wait(AsyncResultI.java:276) ~[ice.jar:na]
	at Ice.ObjectPrxHelperBase.__end(ObjectPrxHelperBase.java:2646) ~[ice.jar:na]
	at omero.cmd.HandlePrxHelper.end_close(HandlePrxHelper.java:441) ~[omero-blitz.jar:5.5.7]
	at omero.cmd.HandlePrxHelper.close(HandlePrxHelper.java:346) ~[omero-blitz.jar:5.5.7]
	at omero.cmd.HandlePrxHelper.close(HandlePrxHelper.java:336) ~[omero-blitz.jar:5.5.7]
	at ome.formats.importer.ImportLibrary.importImage(ImportLibrary.java:708) ~[omero-blitz.jar:5.5.7]
	at ome.formats.importer.ImportLibrary$1.call(ImportLibrary.java:354) [omero-blitz.jar:5.5.7]
	at ome.formats.importer.ImportLibrary$1.call(ImportLibrary.java:328) [omero-blitz.jar:5.5.7]
	at java.util.concurrent.FutureTask.run(FutureTask.java:266) [na:1.8.0_111]
	at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) [na:1.8.0_111]
	at java.util.concurrent.FutureTask.run(FutureTask.java:266) [na:1.8.0_111]
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) [na:1.8.0_111]
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) [na:1.8.0_111]
	at java.lang.Thread.run(Thread.java:745) [na:1.8.0_111]
Caused by: java.io.IOException: Connection reset by peer
	at sun.nio.ch.FileDispatcherImpl.read0(Native Method) ~[na:1.8.0_111]
	at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:39) ~[na:1.8.0_111]
	at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:223) ~[na:1.8.0_111]
	at sun.nio.ch.IOUtil.read(IOUtil.java:192) ~[na:1.8.0_111]
	at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:380) ~[na:1.8.0_111]
	at IceInternal.StreamSocket.read(StreamSocket.java:189) ~[ice.jar:na]
	at IceSSL.TransceiverI.read(TransceiverI.java:179) ~[ice.jar:na]
	at Ice.ConnectionI.read(ConnectionI.java:2825) ~[ice.jar:na]
	at Ice.ConnectionI.message(ConnectionI.java:844) ~[ice.jar:na]
	at IceInternal.ThreadPool.run(ThreadPool.java:395) ~[ice.jar:na]
	at IceInternal.ThreadPool.access$300(ThreadPool.java:12) ~[ice.jar:na]
	at IceInternal.ThreadPool$EventHandlerThread.run(ThreadPool.java:832) ~[ice.jar:na]
	... 1 common frames omitted
2020-09-12 03:21:21,822 42252807   [2-thread-1] INFO         ome.formats.importer.ImportLibrary - Exiting on error
2020-09-12 03:21:21,837 42252822   [1-thread-1] DEBUG      ome.formats.OMEROMetadataStoreClient - Connection lost; cannot close 299ec60b-29e5-48a5-9eb8-12185c348f7a/a3ad3e8f-d7d7-4e0a-83f8-040c25e26495omero.api.RawPixelsStore -t -e 1.1:tcp -h 192.168.208.3 -p 38688 -t 60000
2020-09-12 03:21:21,856 42252841   [1-thread-1] DEBUG      ome.formats.OMEROMetadataStoreClient - Connection lost; cannot close 299ec60b-29e5-48a5-9eb8-12185c348f7a/3d8aab51-26e2-46bf-ac82-07bf13f5c5c3omero.api.ThumbnailStore -t -e 1.1:tcp -h 192.168.208.3 -p 38688 -t 60000
2020-09-12 03:21:21,871 42252856   [1-thread-1] DEBUG      ome.formats.OMEROMetadataStoreClient - Connection lost; cannot close 299ec60b-29e5-48a5-9eb8-12185c348f7a/068a7918-7351-4cf5-bffe-38a4e7b7667aomero.api.MetadataStore -t -e 1.1:tcp -h 192.168.208.3 -p 38688 -t 60000
2020-09-12 03:21:21,872 42252857   [1-thread-1] DEBUG      ome.formats.OMEROMetadataStoreClient - closing client session.

==> Summary
55203 files uploaded, 0 filesets created, 0 images imported, 0 errors in 11:42:00.381
2020-09-12 03:21:21,993 42252978   [      main] DEBUG      ome.formats.OMEROMetadataStoreClient - closing client session.
2020-09-12 03:21:21,993 42252978   [      main] DEBUG      ome.formats.OMEROMetadataStoreClient - client closed.
Using session for bknorlei@pricaimcit.services.brown.edu:4064. Idle timeout: 2880 min. Current group: default

I will try again with more logging and will also try to import the data directly from the omero server to make sure it is not an issue with the import from the cluster. Any other ideas i can try?

Best,
Ben

Ah, Sorry to hear it.

The fact that it did not register as an error is certainly worrisome.

If it failed earlier that makes me worry about the impact of tcpKeepAlive which means we might should hold off on https://github.com/ome/omero-model/pull/60 (If it had failed later, I would have suspected that we’d progressed from one timeout to the next.)

The only thing I can think of currently is to try to figure out precisely what the server is doing during the long waits. The two best methods of doing that are:

jstack $(omero admin ice server pid Blitz-0)

as well as the output of

select * from pg_stat_activity;

when connected to postgres.
~Josh

Hi @joshmoore,

I made some progress, but am still a little bit confused about everything.

First of all, the fail due to the new connection options was not a result of tcpKeepAlive but was due to the increased logging set via loggerLevel=TRACE. I assume that the additional debug output slowed things down so much that the Ice connection timed out. So after all the possibility to better configure the connection might be interesting, but it did not solve our problem.

Now to the problem at hand. I checked all the log files again and realized that the error which is listed in the log-file from the server (the one you can download with omero download) shows a different exception than the logging from omero import. This was already the case for the log I attached in the previous message but I somehow missed it. I assumed it shows the same data and have no idea why the import output showed a psql error. Sorry for not noticing this earlier.

In the omero download log it showed an error which seems to be related to the postprocessing of the images (postProcess_async) which makes much more sense as it seems all the data and metadata was uploaded correctly to the server. (omero_postprocess_exception.txt (23.7 KB) )

I therefore tested the upload again and disabled all additional steps with ‘–skip all’ and the upload succeeded. I assume the problem is related to the generation of the thumbnails and will test again if it still works when disabling thumbnail generation only with --skip thumbnails

Do you have an idea what could be going on and if we can fix the thumbnail generation? The skipping of the thumbnails might be a workaround for us as it will generate the thumbnails when requested through the webpage. However, when i tested it yesterday the web-interface/server became unresponsive for about an hour as it was generating the thumbnails.

I also found this on trello which seems somewhat related especially as we use plates : https://trello.com/c/9PNdsw2b/255-parallelize-thumbnail-generation

All the best and thanks a lot,
Ben