Ive been trying to work on using dask for bigger than memory data analysis for a while now. Ive actually started to get somewhere, such as I am beginning to be able to do pixel manipulation with dask and zarr on a >300GB float64 data set. It’s promising but early days. Hopefully I will have some luck in the future. I do wonder if there are others who are trying to do analysis of large volumes of data. One thing I notice is that current implementations people are using are for
da.map_blocks. Which is great if you are working with data sets that come in tiff slices or if you have a filter that may need to shar functions. The issue, that I have learnt though is if you have a contiguous file and save in Zarr. The delayed function doesnt really help and the map_blocks creates a loop, depending on the chunks you have this can create massive overhead in duplicate results. Even a simple contrast stretch need to be done sequentially, to get a data process completed.
So my question was, is there anyone else who is interested in sharing thoughts and practices in using dask for image processing on a single computer when you have a bigger than memory dataset? It would be great if we could start a new tag to share these practices.
Examples such as handling uint8 to float value and data size changes (i havent worked this out myself but continue to use zarr)
May be even nuances, such as I have learnt that using ufunc and breaking down computation into steps is faster than in one line and avoiding map_blocks for pixel calculations. breaking down the computation to seperate lines of code have improved my contrast stretch from a 1hr 20mins compuation and save to 1hr time. Which is a big reduction comparatively. This isnt even using dask distributed but the native dask chunking.
Just thought I’d see if anyone was interested to share practices and what they have achieved. Ill share some code later when I see if my attempt at a contrast stretch works by going back to basics.
I should say I am running all of this on a laptop with 16GB ram and an i7 processor running t 3.39GHz. So an ok laptop, not new anymore average but not a massive ram capacity.