Zarr file format and calculation of histograms

I jutst tried to get help on this subject but it appears to be a weird one? I cant use dask or xarray to calculate the histogram from, it cant unpack each chunk and calculate, scikit images histogram works but the whole file is unpacked before doing anything.

Any one here have any thoughts? I know its an early project but one to watch out for if your working on imaging and large data sets.

Hi @Sh4zKh4n could you please share a standalone code example here, it would make it much easier for scikit-image maintainers (hi! I’m one of them). You could write a script creating a large array, saving it as a zarr file, then add the commands you have been trying to compute the histogram. It’s on scikit-image’s roadmap to have a better/more integrated support for dask arrays, but concrete examples really help. Thanks!

Hi @emmanuelle (by the way I’m a big fan of your work with scikit image) . It was a uint16 bit file that’s about 280gb in size as a tiff, as a zarr file this reduced to 100gb. Unfortunately, im not sure if this is more because of the issue of the large size of my data and just my laptop. The unpacking looks like it bottle necks in this format when passing it through a function. I tried using dask.delayed at one point and saw that on the graph that there’s a bottle neck at bring the data together. My hard drive filled up very quickly. In fact I’ve stopped using dask. I’m going back to just looking at smaller roi’s.

I was trying to get the histogram because I had stiched the volumes together but it had unfortunately not ignored the black pixels during blending. I tried to red it but couldn’t get it to. I could with a smaller roi histogram see that a certain number of pixels had been shifted by exactly 1/2 in bin number. I wanted to try and rescale the histograms then match that with an unadulterated volume. If that makes sense? I just can’t work out how to split the array and have them indexed without resaving . I think that’s my lack of skills in python. No.array split splits the array but they aren’t indexed to then to work on individually. That I’m guessing would involve a loop.

I think on smaller files this would be fine but I haven’t got time at the moment to investigate to work out size of storage Vs ram to work on these data sets when attempting to parrallelise them. I had hoped to be able to chunk the data and work off the blocks without explicitly looping over volumes then concatenating.

I will work on smaller volumes and come back to this another time. I think the zarr format is still great just not for a naive user like me. I’m going back to the raw files and no.memmap on smaller regions. I’ll work on the the larger files as I become more experienced. I have a bad fault of trying the hardest thing first for a while and just getting into a mess.

Edit
I will put some code here when I get a chance. It might be next week but I think it’s hard to recreate unless you have large file.

@Sh4zKh4n I wouldn’t give up on using dask yet, as it is truly wonderful and seems to have a very bright future. In your case, can you use dask.array.histogram, which mirrors the numpy.histogram function? If not, you could do something like:

from dask import array as da
from skimage import exposure
from functools import partial


arr = da.from_zarr('path/to/file.zarr')
min_value = da.min(arr).compute()
max_value = da.max(arr).compute()
histogram = partial(
    exposure.histogram,
    nbins=256,  # or whatever you need
    source_range=(min_value, max_value)
)
histograms = arr.map_blocks(histogram)
histogram_total = arr.sum(histograms, axis=0).compute()

The axis argument might need to be changed, or you might even have to reshape the result of map_blocks, but I think the above strategy should work and perform quite well, assuming reasonable chunking of your .zarr file.

I hope this helps!

1 Like

@jni

That’s a great idea! I did finally get a histogram out last night (and realise that @emmanuelle was suggesting something similar too, it was late at night when I read it). I had been using the da.histogram but the issue with this and the volumous data for any of the histograms was that the result can’t fit into memory as the zarr file is still unpacked and a memory time out occurs. I worked around it and instead of keeping the histogram data in memory. I saved it to a mmap.mmap file. That works! It means that the histogram data is not kept in memory. Which means that the dask graph is not bottlenecked at bring the data together and data is saved on the fly.

The good thing about this method is that the ram is freed of information and the dask worker. The histogram can then be plotted and reviewed anytime a person wants to review. So the same idea as da.to_zarr or any other format instead of compute (I kind of stole the idea from dask!)

I will come back to dask but need to make sure the algorithms work before sticking into dask. I’m just worried that I have pushed my laptop quite far with the computation, she might need a clean up before continuing… The delayed value seems to work better with. As a noob user on the forum, you may have noticed by my posts I’ll keep everyone posted (annoyed).

The only thing I’ve noticed is that, in the dask proper use there is the possibility to optimise the graph computation. I can see the computation but even for s small roi , it’s pretty complex. So difficult to optimise/cull. So when I’ve tried dask.delayed, even seperating the above function up into seperate function then calling each one seperately was a massive improvement I saw that in the best practices docs and it does help alot. I’ll have a go at above this weekend

Update 7th June 2020 (with a solution for not big but annoyingly medium data):

So unfortunately I got a bit ill around this time and it took a while for me to feel a bit better. Then the pandemic happened and I was on full childcare duty. I tried lots of different option and what ultimately, this looked like was that the following: 1) if just using x.compute, the memory would very quickly fill up. 2) Using distributed with map would fill the hard drive with spill to disk and take hours but would hang and crash and not do anything because…it would compute (im guessing here but based on the graph and dask api) it would create a sub histogram array for every chunk… that would all need to be merged at some point. 3) The chunking of my data was sub optimal so the amount of tasks was massive but even then I couldn’t compute a histogram when i improved the chunking.

In the end I looked for a dynamic way of updating the histogram data. So I used Zarr to do it, by computing to it. Since it allows conccurrent reads and writing functions. As a reminder : my data is a zarr array in 3 dims x,y,z and uncompressed 300GB but compressed it’s about 100GB. On my 4 yr old laptop with 16GB of ram using the following worked (I should have said my data was 16 bit unsigned:

imgs = da.from_zarr(.....)

imgs2 = imgs.rechunk((a,b,c)) ## individual chunk dim per dim

h, bins = da.histogram(imgs2, bins = 255, range=[0, 65535]) # binning to 256 

h_out = da.to_zarr(h, "histogram.zarr")

I ran the progress bar alongside the process and to get a histogram from file took :

[########################################] | 100% Completed | 18min 47.3s

Which I dont think is too bad for a 300GB array. Hopefully this helps someone else as well. Thanks for your earlier help @jni and @emmanuelle . The time to do a histogram for a data set this big, has to be helpful for others I think as well.

1 Like