Parallel imports into datasets created using regular expressions

Hello,
I have been experimenting with parallel import. I am pleased to see an increase in the import speed.
I wanted to see how it would work with a more complex directory structure.

I have some test images in directories with a structure like:

|-images
    |-day1
        |-am
            |-d1im1.tif
    |-d1im2.tif
    |-day2
        |-d2im1.tif
        |-d2im2.tif

If I import my images without the parallel option using
omero import images -T "regex:+name:^.*images/(?<Container1>.*?)"
I get what I want, which is three datasets called day1, day1/am and day2 with the appropriate images in each one. If I add the options for parallel import using
omero import images -T "regex:+name:^.*images/(?<Container1>.*?)" --parallel-fileset 2 --parallel-upload 2
I get four datasets called day1, day1, day2 and day2, with one image in each. I can sort of see why it would do this.

I tried predefining the datasets, day1, day1/am and day2, and then running the parallel import and this worked better with most images ending up in the right dataset except d1im2.tif which was in day1/am instead of day1.

I realise that parallel importing is still experimental, and I may be asking too much of it, but I wondered if anyone had experience trying something similar?

Hi @Laura190,

thanks for the feedback and the clear reproducible scenario.

On the datasets duplication, what you describe is indeed a known limitation of the current implementation as mentioned in the reference documentation. The --parallel-fileset option currently does not work in a thread-safe manner when import targets are specified. The current workaround is to pre-create the datasets as you did. This is also the strategy the IDR team is using to load the images from the Human Protein Atlas.

I am surprised that the the images get wrongly linked after pre-creating the datasets. I will have a go at reproducing your set-up. Can you confirm your original layout looks as follows with d1im2.tif under day1 rather than the root images folder?

|-images
    |-day1
        |-am
            |-d1im1.tif
        |-d1im2.tif
    |-day2
        |-d2im1.tif
        |-d2im2.tif

Best,
Sebastien

Hello Sebastien,
Thank you for looking at this. Yes, the d1im2.tif is under day1, rather than the root images folder. I realise that this is an “extreme case”, i just wanted to see how far I could currently push it, in case one of our users did something like this. I did try the simpler case first:

|-images
    |-day1
        |-d1im1.tif
        |-d2im1.tif
    |-day2
        |-d1im1.tif
        |-d2im1.tif

which worked fine.

I would be interested to hear about the strategy used to pre-create the datasets. My current thought is to assign the folder names to variables, pass these to omero obj, and then run the omero import command.

Thanks,
Laura

Hi @Laura190,

I have spent additional time investigating your original issue and I was able to reproduce the incorrect image-dataset assignment. I can confirm the second import scenario works as expected is --parallel-fileset is not passed or --parallel-fileset 2 is passed. However when higher values are passed, more images get imported into the day1 dataset.

This is most likely a bug in the import library that arises when combining --parallel-fileset and regex import targets. It looks like the import target is determined using one (the first?) of the paths concurrently imported and this target is used acrosss all parallel threads. I have opened https://github.com/ome/omero-blitz/issues/95 to capture this issue.

Our experience of using --parallel-fileset is limited to bulk imports with specified dataset targets where the parallelization happens within each dataset - see here for the import configuration and here for a typical list of import paths and targets.
In this case, all datasets are explicitly named in the TSV file so it is relatively straighforward to create a wrapper script that digest this file and convert it into a list of omero obj new commands.

Best,
Sebastien