As I mentioned in a couple of posts previously I’ve been experimenting with running Omero on AWS. So far we went for the simplest solution that we could think of. We took the Docker Compose example and translated it to use AWS Fargate. That mostly worked really nicely and let us get a test system up and running pretty quickly.
Let me prefix the rest of the discussion by stating that most likely we will consider moving the PostgreSQL container so that it uses the AWS RDS managed service which we hope means no DB maintenance. We may also rearrange the containers so that we have 2-3 tasks, one for each container, rather than one task running all three containers as is the case now.
Now when we wanted to add a new environment variable to the OMERO Server container we allowed Fargate to shutdown all containers and launch new ones. Either through misconfiguration (originally we allowed new instances pointing to the same filesystem directories hosted on EFS to launch while the old ones were still running), or through the container being killed too quickly (we didn’t do a graceful shutdown) I believe that the PostgreSQL DB became corrupted.
This manifests itself, by the Omero Web login page being available, but login being denied. Checking the PostgreSQL logs we see a stream of repeated logs as below:
2020-09-02 11:09:32.001 UTC  ERROR: MultiXactId 2914 has not been created yet -- apparent wraparound 2020-09-02 11:09:32.001 UTC  CONTEXT: SQL statement "SELECT 1 FROM ONLY "public"."experimenter" x WHERE "id" OPERATOR(pg_catalog.=) $1 FOR KEY SHARE OF x" 2020-09-02 11:09:32.001 UTC  STATEMENT: insert into session (id,permissions,timetoidle,timetolive,started,closed,defaulteventtype,uuid,owner,node) values ($1,-35,$2,$3,$4,null,$5,$6,$7,$8)
I tried to login to the DB and run:
In the end I gave up and created a new DB instance starting with a blank filesystem.
Let me postfix that by stating that I’m aware that just killing the container is probably not the recommended way to handle a shutdown. However, I’d like to expect that most of the time this shouldn’t result in data loss. I don’t want to have to restore from backups everytime AWS needs to kill my Fargate containers…
One final question. In case I had already got to the point of establishing a backup system for the DB and the filesystem. How do I make sure that they are time consistent. E.g. I realised that there were some stale thumbnails left after the DB was erased. Which resulted in new images uploaded being associated with old thumbnails!