Avoiding/recovering from Omero PostgreSQL DB corruption

As I mentioned in a couple of posts previously I’ve been experimenting with running Omero on AWS. So far we went for the simplest solution that we could think of. We took the Docker Compose example and translated it to use AWS Fargate. That mostly worked really nicely and let us get a test system up and running pretty quickly.

Let me prefix the rest of the discussion by stating that most likely we will consider moving the PostgreSQL container so that it uses the AWS RDS managed service which we hope means no DB maintenance. We may also rearrange the containers so that we have 2-3 tasks, one for each container, rather than one task running all three containers as is the case now.

Now when we wanted to add a new environment variable to the OMERO Server container we allowed Fargate to shutdown all containers and launch new ones. Either through misconfiguration (originally we allowed new instances pointing to the same filesystem directories hosted on EFS to launch while the old ones were still running), or through the container being killed too quickly (we didn’t do a graceful shutdown) I believe that the PostgreSQL DB became corrupted.

This manifests itself, by the Omero Web login page being available, but login being denied. Checking the PostgreSQL logs we see a stream of repeated logs as below:

2020-09-02 11:09:32.001 UTC [439] ERROR:  MultiXactId 2914 has not been created yet -- apparent wraparound
2020-09-02 11:09:32.001 UTC [439] CONTEXT:  SQL statement "SELECT 1 FROM ONLY "public"."experimenter" x WHERE "id" OPERATOR(pg_catalog.=) $1 FOR KEY SHARE OF x"
2020-09-02 11:09:32.001 UTC [439] STATEMENT:  insert into session (id,permissions,timetoidle,timetolive,started,closed,defaulteventtype,uuid,owner,node) values ($1,-35,$2,$3,$4,null,$5,$6,$7,$8)

I tried to login to the DB and run:
VACUUM FREEZE

In the end I gave up and created a new DB instance starting with a blank filesystem.

Let me postfix that by stating that I’m aware that just killing the container is probably not the recommended way to handle a shutdown. However, I’d like to expect that most of the time this shouldn’t result in data loss. I don’t want to have to restore from backups everytime AWS needs to kill my Fargate containers…

One final question. In case I had already got to the point of establishing a backup system for the DB and the filesystem. How do I make sure that they are time consistent. E.g. I realised that there were some stale thumbnails left after the DB was erased. Which resulted in new images uploaded being associated with old thumbnails!

As for the last point, omero admin cleanse /OMERO solved the thumbnail discrepency for me!

1 Like

Hi @jfkotw.

This isn’t a corruption state that I’ve ever run into, but there are certainly examples online, e.g. https://postgrespro.com/list/thread-id/2380690

If this were a production DB, it’d make sense to try to determine more about the corruption, but as is, I understand that this is more just an issue of lacking confidence of PG on docker, right?

If this is the case, then I could definitely see this leading to catastrophic corruption, especially on NFS since the locking semantics are not always guaranteed.

If you do decide to keep PostgreSQL on docker, you likely want to look into some for of high-availability configuration. One option might be https://www.crunchydata.com/

All the best,
~Josh

Nice hint. I just deleted the thumbnails directory which also seemed to work.

Yes, it goes in that direction. Since we’re more or less fixed on using AWS, we will most likely use their RDS solution which should allow us to figure out the HA options for PostgreSQL. It’s perhaps rather that I’d like to have that warm fuzzy feeling that killing the Omero Server container more or less randomly isn’t going to cause us big headaches in the future.

At the moment I think that in this case it was related to having two instances connecting to the DB which I guess we’ve fixed now.

Understood. :slightly_smiling_face:

I don’t think we’ve ever experienced any data corruption due to abrupt shutdown. Power outages, segfaults, even a catastrophic disk failure recently where the virtual disk was just gone have all been recoverable, since changes to the data directory stop abruptly as well. The only case I remember similar to yours was where writing still occurred due to silent errors. Much more dangerous.

Just to be clear, I don’t think having two OMERO instances connecting to one DB could cause this level of issue. Having two Postgres instances running against the same pgdata directory is a recipe for disaster!

~Josh

That is a good point, and is probably what I really meant. Thanks for clarifying.

1 Like