OMERO server administration: find and cancel user tasks

Hi,

I’m supporting the system administration of a growing OMERO installation. We run version 5.6.1 with a 3 server setup, i.e., one for OMERO.server, OMERO.web and the database each. We have users from different groups that use OMERO to store data from various instruments and run image analysis tasks using the web client, scripts, etc.

We have had a couple of instances where a user’s tasks has produced quite a lot of load on the OMERO.server for a significant amount of time without coming to an end. The problematic process limited the resources of the app server dramatically for other users. Also,the user did not seem to notice that something was wrong which made it difficult for us trace the issue back. From the top command, we were able to get the PID of the process in question, but we have not been able to find out the actual task/job in question and stop it without restarting the server.

This brings me to my questions

  1. From a system administration point of view, I would like to be able to see which user is running what process at the moment. What is the best approach for this? I would like to be able to do it on OMERO.server itself and not via OMERO.web if possible.

  2. I would also like to be able to stop a rogue process of which I know the PID without having to restart the entire server. Is this possible? And if yes, how? Because of the number of users, finding/contacting the user whose job is running, is not always a solution.

I’m fairly new to administrating OMERO and this is my first forum post. I hope that I have given all relevant information. If not, please let me know. So far, I have not found the information that I’m looking for in the documentation and the forum.

Best regards,
Robert

Hi @rschulz,

Thanks for the great question. Unfortunately, there isn’t an easy API for doing what you want, but …

omero sessions who

will tell you which users are logged in, their session ID, when their session was created, from where, and how many methods they’ve called. But not what they are currently doing. Java provides a number of tools like jstack, e.g.:

jstack $(omero admin ice server pid Blitz-0)

to see what’s going on within the server, but they aren’t resolved to the per-user level.

Most user actions are separate threads rather than individual processes. Once a thread has started in Java, there’s no general way to interrupt it. You can try killing the given users session which will prevent future method calls. If the long-running action is accessing Postgres, you can also kill the database connection, which is a separate connection.

~Josh

Dear Josh,

thank you very much for your quick reply.

omero sessions who is a already big help to get the session uuid and if necessary logout the user. Thanks.

Since a running Java thread cannot be interrupted, am I correct in concluding that the only solutions are to either wait for the job to finish or restart the entire server?
Restarting the omero server seems problematic to me in production because there could still be a few users who are able to us the system (e.g., for analysis or uploading data) despite the rogue process that I would like to stop. The processes of these users would be interrupted when I restart the server. Is there a best-practice recommendation for admins in such cases to avoid/minimize possible data loss/corruption?

Best regards,
Robert

1 Like

If you can’t identify the postgres connection and terminate that, then yes.

Not in a generic sense. What would help is understanding which user interactions (i.e. threads) are causing problems. Do you have any insight there?

~J

P.S. It occurs to me that it should be possible via writing Java code to inspect the state of Java threads, minimally via JMX, if you know someone who would like to attempt that. It’s not guaranteed, though, that what the thread is doing would be cancellable without modification to the OMERO server code (e.g. by checking a flag from within a for-loop.)

1 Like

Dear Josh,

thanks again for the quick reply.

Unfortunately, it was not possible for me to do a detailed trace back of the issue back then. Therefore, I don’t know what interaction could have caused it. I might still have a backup of the Blitz-log, but it is not easy to find the problem in there for me given the large amount of messages that are being written to the log.

At one instance though, it would not have been possible for me to use “omero sessions who”, because the load on the server was so high that I could not create a local session on the omero server (the login just timed out). That is why, I have written a simple python script now that parses the Blitz-log to determine who logged in and how as an alternative/backup to “omero sessions who”.

I think that I will have to wait until this issues occurs again to do a better trace back, but hopefully that won’t be soon. I will keep the option via JMX in mind. I think that it would be great to have an API (or omero sessions option) that allows the admin to see the current process of a user. The ability to interrupte the process might be difficult to realize, but showing the current process of a user might be easier to realize since the information is kind of written to the Blitz-log, so it is available in some form (if I’m not mistaken).

Best regards,
Robert

Thanks for the input, Robert. I’ve filed https://github.com/ome/omero-server/issues/118 to capture the idea. Of course, let us know if you continue having issues.

All the best,
~Josh