Clean up OMERO repository and database

Hi,

from time to time I ask your users to review the data they store on our server and clean it up if possible, e.g. archive data that belong to past projects.

After one of our users deleted some big files in the past days, I looked into his folder in ManagedRepository/ and realized that there are many files (which are not seen in OMERO any more) but which remain on disk.

I was curious and searched the database a bit. I queried the database (“select name,fileset,series from image where owner_id=XX;”), and found that the images are still in the database (which makes me wonder why they don’t show up in OMERO?).

I was looking for a way to do some clean up and found “omero admin cleanse”. Sounded like that’s what I’m looking for and tried it (with --dry-run), but unfortunately it stops with a ValidationException (/user_XX/2014-09/02/ does not exist).

So my question is:
How can I properly clean up

  • the file system, i.e. make sure files are deleted which were deleted in OMERO by users
  • the database, which obviously still contains entries that have been deleted

Thank you very much in advance for any suggestions and best wishes,

Benjamin

1 Like

Hi @bene.schmid!

If they are in Postgres, then you must be able to find them somewhere. Perhaps in “Orphaned images”?

:+1: but it will only clean files that are not in the database.

This is certainly odd. Can you show us more of the output?

This is certainly the purpose of cleanse.

As far as I know, this can’t happen. Did the user possibly delete the dataset but the image was multiply linked from somewhere else? What does OMERO.web show if you go to http://YOURSERVER/webclient/?show=image-IMAGEID ?

~Josh

Hi @joshmoore,

thanks for your reply.
You are completely right. This particular user is part of several groups, and the images that I haven’t found in OMERO belong to the “non-standard” group. I found them now, so this is fine, that’s very great.

So if I get cleanse to work, everything should be perfect. Please find the full output below. I guess it means that folders have been deleted manually. Maybe there is a way to just ignore the exception (since the folder is not there, there’s no need to clean it up) and continue instead of aborting?

Thanks a lot, I really appreciate it,
Benjamin

Traceback (most recent call last):
  File "bin/omero", line 118, in <module>
    rv = omero.cli.argv()
  File "/omero/omerowebenv/lib/python3.6/site-packages/omero/cli.py", line 1750, in argv
    cli.invoke(args[1:])
  File "/omero/omerowebenv/lib/python3.6/site-packages/omero/cli.py", line 1188, in invoke
    stop = self.onecmd(line, previous_args)
  File "/omero/omerowebenv/lib/python3.6/site-packages/omero/cli.py", line 1265, in onecmd
    self.execute(line, previous_args)
  File "/omero/omerowebenv/lib/python3.6/site-packages/omero/cli.py", line 1347, in execute
    args.func(args)
  File "/omero/omerowebenv/lib/python3.6/site-packages/omero/cli.py", line 666, in _check_admin
    return func(*args, **kwargs)
  File "/omero/omerowebenv/lib/python3.6/site-packages/omero/plugins/admin.py", line 1915, in cleanse
    dry_run=args.dry_run)
  File "/omero/omerowebenv/lib/python3.6/site-packages/omero/util/cleanse.py", line 254, in cleanse
    delete_empty_dirs(proxy, root, client, dry_run)
  File "/omero/omerowebenv/lib/python3.6/site-packages/omero/util/cleanse.py", line 262, in delete_empty_dirs
    is_empty_dir(repo, '/', False, to_delete)
  File "/omero/omerowebenv/lib/python3.6/site-packages/omero/util/cleanse.py", line 294, in is_empty_dir
    is_empty_dir(repo, subdirectory, may_delete_subdir, empty_subdirs):
  File "/omero/omerowebenv/lib/python3.6/site-packages/omero/util/cleanse.py", line 294, in is_empty_dir
    is_empty_dir(repo, subdirectory, may_delete_subdir, empty_subdirs):
  File "/omero/omerowebenv/lib/python3.6/site-packages/omero/util/cleanse.py", line 294, in is_empty_dir
    is_empty_dir(repo, subdirectory, may_delete_subdir, empty_subdirs):
  File "/omero/omerowebenv/lib/python3.6/site-packages/omero/util/cleanse.py", line 289, in is_empty_dir
    for entry in repo.listFiles(directory):
  File "/omero/omerowebenv/lib/python3.6/site-packages/omero_Repositories_ice.py", line 427, in listFiles
    return _M_omero.grid.Repository._op_listFiles.invoke(self, ((path, ), _ctx))
omero.ValidationException: exception ::omero::ValidationException
{
    serverStackTrace =
    serverExceptionClass =
    message = /jschmidpeter_452/2014-09/02/ does not exist
}

:+1:

Thanks. I see what’s throwing now, but can you include your arguments to the method as well as any output it gave beforehand? I’m wondering, for example, if this was the first thing it detected, or were some methods handled fine. ~J

This is how I called it:

bin/omero admin cleanse --dry-run /srv/omero_data/ > cleanse.dryrun.txt

The output is rather huge. I shortened it:

Reconciling OMERO data directory...
 /srv/omero_data/Pixels
   \_ /srv/omero_data/Pixels/Dir-046/46106_pyramid (keep)
   \_ /srv/omero_data/Pixels/Dir-046/46110_pyramid (keep)
   \_ /srv/omero_data/Pixels/Dir-046/46649_pyramid (keep)
   \_ ...
   \_ /srv/omero_data/Pixels/Dir-173/173142_pyramid (keep)
Reconciling OMERO data directory...
 /srv/omero_data/Files
   \_ /srv/omero_data/Files/Dir-046/46933 (keep)
   \_ ...
   \_ /srv/omero_data/Files/Dir-173/173330 (keep)
Reconciling OMERO data directory...
 /srv/omero_data/Thumbnails
   \_ /srv/omero_data/Thumbnails/855 (keep)
   \_ /srv/omero_data/Thumbnails/436 (keep)
   \_ /srv/omero_data/Thumbnails/311 (keep)
   \_ /srv/omero_data/Thumbnails/507 (keep)
   \_ /srv/omero_data/Thumbnails/954 (keep)
   \_ /srv/omero_data/Thumbnails/660 (keep)
   \_ /srv/omero_data/Thumbnails/292 (keep)
   \_ /srv/omero_data/Thumbnails/Dir-046/46739 (keep)
   \_ /srv/omero_data/Thumbnails/Dir-046/46673 (keep)
   \_ ...
   \_ /srv/omero_data/Thumbnails/622 (keep)
Cleansing context: 0 files (0 bytes)
Removing empty directories from...
 /srv/omero_data/ManagedRepository

Thanks again,
Bene

Hmm… nothing occurs to me, @bene.schmid. Do you know anything that may be different about that directory? The only course of action I can think of at this point is to try to give you a patch for cleanse.py and see if we can find a workaround.

~J

Thanks for looking into this, @joshmoore.

I can see in https://github.com/ome/omero-py/blob/0546d34066a42aaf3db07de0f6a8557a78892dfc/src/omero/util/cleanse.py#L249 that deleting empty folders in ManagedRepository is the final step, so maybe this is less important.

As far as I understand the code, cleanse doesn’t look for orphan files under ManagedRepository, though, is this true? With orphan I mean for files which are no longer in the database. Is there a way to clean up the MangedRepository as well?

Thanks a lot and best wishes,
Bene

Hi @joshmoore,

I wrote a small, probably really inefficient bash script to identify files under ManagedRepository which are no longer in the database:

for f in `find . -mindepth 4 -maxdepth 4 -type d`; do
  f="${f:2}/";
  a=`psql -t -A --field-separator ' ' \
          -U postgres -d omero_database \
          -c "select fileset.id,fileset.templateprefix,image.id \
                from fileset inner join image on fileset.id=image.fileset \
                where fileset.templateprefix like '$f' limit 1;"`
  echo "$f -> $a"
done

and redirected the output in a text file.

Then I filter the contents with

cat orphans.txt | grep  -v '\-> [0-9]'

Many of the resulting entries are empty folders, but there are also many folders with image data in it.

Because I’m not into the OMERO data model enough, I’d really appreciate your confirmation that the above query guarantees that those files are not used by OMERO any more, and that it is safe to delete them.

Thanks a lot and best wishes,
Bene

Hmm… I would expect empty directories to be deleted. I imported and then immediately deleted an image. Running cleanse --dry-run I see:

Removing empty directories from...
 /OMERO/ManagedRepository
   \_ /OMERO/ManagedRepository/root_0/2020-09/30/ (remove)

Removing these would of course be fine.

This worries me. The only data loss that I know of (or am remembering) came from an attempt to cleanse /OMERO and the cleanse script is the product of lots of checks.

I think it would be easier to catch the validation exception and see if your dry-run completes more or less naturally. If it throws an exception for every directory, then something else is going on.

~J.

Hi @joshmoore,

well, if it worries you, it worries me even more :wink:

I fully agree with you, but as I mentioned in my second to last post, my impression was that cleanse.py is not cleaning the ‘ManagedRepository’ folder anyway, is it?
At least it says

SEARCH_DIRECTORIES = {
    'Pixels': 'Pixels',
    'Files': 'OriginalFile',
    'Thumbnails': 'Thumbnail'
}

so I’m missing an entry 'ManagedDirectory' there, no?
(https://github.com/ome/omero-py/blob/0546d34066a42aaf3db07de0f6a8557a78892dfc/src/omero/util/cleanse.py#L51)

Cheers, Bene

Not exactly. The ManagedDirectory logic was add in a separate code block with a different strategy:

(The reason for this is that for the other directories there’s no service which provides a listing.)

And this is the code block that’s failing. The is_empty_dir method is called recursively, and when it reaches /jschmidpeter_452/2014-09/02/ it fails. This is maybe because one directory got deleted and another didn’t, but that’s what I haven’t been able to figure out or reproduce.

If you change the beginning of is_empty_dir to this:

def is_empty_dir(repo, directory, may_delete_dir, to_delete):
    empty_subdirs = []
    is_empty = True

    try:
        entries = repo.listFiles(directory)
    except omero.ValidationException:
        print("Failed to query: %s" % directory)
        return False  # EARLY EXIT

    for entry in entries:

then hopefully your dry-run will at least tell us how many of of these issues you have.

~Josh

Dear @joshmoore

I commented out the SEARCH_DIRECTORIES entries, to speed it up a bit, and then run your code. It fails to query directories of 10 users (for some of them multible subdirectories.
Afterwards, it says it would remove ~2700 directories. I haven’t checked yet whether these are all empty or if there are some which still contain data.

Thanks for your help and best wishes,
Bene

Hi again,

Just checked and all the directories found by cleanse.py are empty.
This worries me know since I suspect that there remain files under ‘ManagedRepository’ which are no longer referenced in OMERO.

I checked the output from my shell script mentioned in a previous post, identified a user and checked manually his OMERO account. He is member of a single group, and there is exactly one file (1.1 GB) visible under ‘Data’ in OMERO.web, no attachments. However, his folder under ‘ManagedRepository’, contains a lot of files, overall about 150 GB. I’d love to learn why cleanse.py isn’t picking those up.

Just to explain why I’m so insisting:
Our server runs now for several years, and data is accumulating very fast. Scientists and whole groups leave university or academia with time going by, and so the storage space they occupy on our servers should be freed for new projects. With the years, this old data might well make up a substantial amount of storage.

Thanks again for bearing with me… :wink:

1 Like

Hi again,

Some more information:
Querying the ‘image’ table shows that there is just this only image:

 select * from image where owner_id=6502;

However, querying the fileset table:

 select * from fileset where owner_id=6502;

shows all the filesets for which files remain in ManagedRepository.

I also double-checked:

select * from image where fileset=101754

but there’s no result.

So in summary, the database references filesets and fileset entries, but they don’t belong to images, and therefore don’t show up in OMERO clients.

BTW: I checked the import log, there was no error importing these files, as far as I could see.

Under which circumstances can this happen?

Best wishes

1 Like

I can’t think of any normal way to end up with successful imports causing filesets without images. Can you reproduce this workflow with a new import, perhaps by reusing files from an existing one, maybe then deleting it or whatever? If we could find how to make it happen we might be able to closely instrument each step.

So, for these filesets that have entries but not images, what are the files? E.g., for fileset 1234,

SELECT file.path || file.name
    FROM originalfile AS file, filesetentry AS entry
    WHERE file.id = entry.originalfile AND entry.fileset = 1234;

… and do these still exist on disk?

omero fs sets --without-images lists filesets without images but these ordinarily result from a failed import. Generally a failure is indeed fairly obvious toward the end of the import log though one could double-check Blitz-0.log from around that same time.

Some clues probably do result from finding out what these remaining files actually are. For instance, taking a simple example, the import logs themselves are files in the managed repository that are not indexed via images (but that should be deleted when the fileset is deleted).

Hi @mtbc,

Here’s what omero fs sets --without-images gives me:

 #  | Id     | Prefix                                       | Images | Files | Transfer
----+--------+----------------------------------------------+--------+-------+----------
 0  | 111751 | yariza_6102/2020-09/24/15-47-37.745/         | 0      | 1     |
...
(25 rows, starting at 0 of approx. 71722)

Your query above gives me for that fileset:

SELECT file.path || file.name
    FROM originalfile AS file, filesetentry AS entry
    WHERE file.id = entry.originalfile AND entry.fileset = 111751;
                      ?column?
----------------------------------------------------
 yariza_6102/2020-09/24/15-47-37.745/Experiment.lif
(1 row)

The file exists:

 ls -lh yariza_6102/2020-09/24/15-47-37.745/Experiment.lif
-rw-r--r-- 1 omero omero 17G Sep 24 17:52 yariza_6102/2020-09/24/15-47-37.745/Experiment.lif

Is it safe to post an excerpt of Blitz-0.log? Should I put it on pastebin? For me, there’s no obvious error, but of course I appreciate if you have a look at it.

Thank you very much,
Benjamin

It’s probably safe, yes, but if you’re not sure, feel free to just zip it up and mention this thread and the timepoint to look at when uploading to http://qa.openmicroscopy.org.uk/qa/upload/ or private message on this forum to pass some kind of download link. If you could also include the import log for that file, that’d be great.

So if you import this Experiment.lif now, what do you get: does the new fileset have any images?

Hi @mtbc,

I just imported the file successfully, and in the webclient, I see 13 new images, the same that I get with

 select image.id,image.name  from image inner join fileset on image.fileset=fileset.id where templateprefix like 'bschmid_2862/2020-10/05/14-21-27.959/';
   id   |            name
--------+----------------------------
 256413 | Experiment.lif [Series070]
 256412 | Experiment.lif [Series067]
 256411 | Experiment.lif [Series061]
 256410 | Experiment.lif [Series059]
 256409 | Experiment.lif [Series057]
 256408 | Experiment.lif [Series053]
 256407 | Experiment.lif [Series051]
 256406 | Experiment.lif [Image047]
 256405 | Experiment.lif [Series044]
 256404 | Experiment.lif [Series043]
 256403 | Experiment.lif [Image018]
 256402 | Experiment.lif [Image011]
 256401 | Experiment.lif [Image010]
(13 rows)

I’ll now try to delete them again, and see whether the file is then gone or not.

Best wishes

Hmm, after deleting the images in OMERO, the original file is gone, so no answer unfortunately.

@joshmoore: I received the logs and put them in the usual place.

However, I’m not sure what to make of them. There’s nothing obvious stopping the import, it’s as if it simply stopped partway for some reason, it didn’t even get as far as creating the images in OMERO, which is why there aren’t any. If you compare with the import log for your subsequent successful import I’m guessing you’ll see rather more happens with processing steps, further INSERTs, etc. I don’t suppose the previous coincides with some known period of network or server difficulties? If we find a normal workflow that reproduces this issue then that’ll certainly aid investigation! It could be that it is simply what happens if the client is terminated partway, at the right moment.

In the meantime, if you want to delete imageless filesets safely, simply omero delete Fileset:111751 (or whatever) should rid you of those files. (It probably takes an admin cleanse to clean empty directories up.) Options like --report and --dry-run may be of interest.