Scikit image turn off output option?

In my amazing bad pursuit of muddling my way through image analysis with dask.

I have found that vectorized functions with a dask array for element-wise work amazingly fast because I couldn’t get a contrast stretch with scikit image to work.

I then have tried a median filter and this also didn’t work, I have tried with map_blocks and this just caused a memory issue. I stopped this after it did nothing for 1.5hrs …0% . I put an issue into dask and they suggested the functions in these libraries may be expanding the memory to sizes that it can’t control. So one request is, is there a way to know what intermediary steps that a function will take in scikit image?

When I tried with dask_image I used it as a drop in for my function and it caused a similar problem when it came to dropping in the dask_image version of median and guassian filter into a function it didnt work. When I did it outside of a function it worked perfectly.


The docs say it excludes the output parameter and is a wrapped version of scipy.ndimage.median filter excluding the output parameter. Is there a way to do the same thing with scikit image (not in the base code, thats too much work for you guys) but rather the same code used for me? Unless the dask version isnt wrapped and it is a copy omitting. I am hoping it is calling ndimage but has some little magic to say ‘forget about output’?

@Sh4zKh4n I recommend that you provide a complete failing example as code (not screenshots) that people can run on their own machine. StackOverflow has a good article on this.

This makes it much easier for people to debug your use case rather than try to think about it just from reading it, which is hard, or have to abstract away the problem.

In your case, I agree with Matt Rocklin that it seems the ndimage median filter is the failing function. Unfortunately, ndimage just does not do great for large kernel sizes. For example:

import numpy as np
from scipy import ndimage as ndi


def median3d(image):
    return ndi.median_filter(image, size=25)


res = median3d(
    np.random.randint(0, 256, size=(285, 245, 310)).astype(np.uint8)
)

ie running your code for just one chunk is taking forever on my machine (3min and counting), using up “only” 2GB of RAM. There is an open issue in SciPy about this excessive RAM usage, but it is really quite a difficult one to fix, which is why it remains open 5 years later.

Now, the thing I totally don’t understand is why dask_image.ndfilters.median_filter is working at all for you, since literally all it is doing is wrapping ndimage.median_filter in a map_overlap call. You can see the source code and find that it boils down to a map_blocks call just like yours. I can also reproduce that behaviour as follows:

import numpy as np
from dask_image.ndfilters import median_filter
from dask.diagnostics import ProgressBar

pbar = ProgressBar()
pbar.register()

arr = da.random.randint(
    0, 256, size=(10260, 1225, 1550),
    dtype=np.uint8,
    chunks=(285, 245, 310)
)

smoothed = median_filter(arr, size=25)
smoothed.to_zarr('smoothed.zarr')
[#                      ] | 3% Completed |  3min 36.8s

But, it does eventually start using too much RAM and spilling to disk, and brings my computer to a standstill…

On that I really must say I am stumped.

Things I would try next in your situation:

  • try using the dask.distributed scheduler and some local workers. By default, dask.array uses the threaded scheduler, which as I understand it will request as much RAM as the OS will give, and the OS will give more than it has and spill to disk, which is inefficient. Distributed gives you fine control over the resources you allocate to each worker. We saw that a single chunk takes up about 2GB of RAM to process, so you should be good with e.g. 3 workers with 5GB each (it’s good to have a bit of overhead space!).
  • We have an open PR for 3D rank filters in scikit-image that should not suffer from the ndimage excessive RAM issue here. The PR needs review but as I understand it it’s ready to go, so you should be able to build that branch locally and try it. It should be much faster than ndimage for large kernels such as the one you are using.

Again, I totally don’t understand why dask_image median_filter works at all, given that I can’t process even a single chunk with the SciPy version. That is something to discuss in the dask_image issue — I’ll add this comment there.

2 Likes

@jni

I think this one is best to carry on dask_image instead of repeating the discussion. Thanks for the link on MVCP , Ill have a read and try to do a bit better next time. As you may have guessed I am someone who throws themselves into something to learn how to do stuff. I end up making mistakes but also finding a solution sometimes.