Training StarDist with gputools support

Hi all!

So I’ve struggled a lot in getting StarDist training to work with gputools which is worth it for us, as we see a speed improvement of about 2x when training a new StarDist model.

Issue is that it’s rather contrived to install on Windows 10. Actually I had it working on one machine but the protocol failed on a newer installation.

So I made a 20 minute video on how to install TensorFlow with GPU support in general and gputools support for StarDist in particular.

The summarized documentation is here
https://c4science.ch/w/bioimaging_and_optics_platform_biop/computers-servers/software/tensorflow-gpu/miniconda/

This is useful to me and hopefully it can help people who think that this is a good idea.

All the best

Oli

PS: Feel free to go and check out the cool Webinar @mweigert @uschmidt83 and @superresolusian and I participated in explaining StarDist and showing how to use it.

12 Likes

Hi Oliver,
I con not have GPU my computer has CPU, can I install stardist in this conditions?

Thanks for the hard work, Oli!

I’m a bit amazed that you ‘only’ see a 2 or 3-fold faster processing on the GPU. I consistently observe improvements of 10 to15-fold (seen on different machines, for both training with Tensorflow-gpu & Keras, and Noise2Void for Fiji). For instance, my four-year old laptop GPU easily outperforms a 24-core Xeon gold processor.
Would this be inherent to the datasets? Or the way Stardist is built? Or maybe you just have a really kick-ass CPU?

Best regards,
Bram

2 Likes

Hi @Mariya_Timotey_Mitev Mariya,

You definitely can! the question is would you want to?
If you want to just ‘use’ StarDist and the pretrained models, you don’t need any of this.
Just Fiji with the right update sites enabled.

If you really want to train your own model on the CPU

Then you would need to install Stardist in a similar way as described above, albeit it’s rather simpler if you do not have a GPU.
However as per @mweigert and @uschmidt83’s comments during the webinar, you really would need a GPU, because training on the CPU is excruciatingly slow. You would be better off using the Collaboratory notebook that was provided by Martin.

To install StarDist for training without GPU support

You would just need to install Miniconda and then run something like

conda create -c conda-forge -n stardist-cpu tensorflow==1.12 jupyterlab
conda activate stardist-cpu
pip install stardist

That would install StarDist for CPU and Jupyter Lab.

Best

Oli

2 Likes

Hey @bramvdbroek,

My bad, I was not sufficiently clear. The 3-fold increase is

running on the GPU with gputools vs.running on the GPU without gputools.

I do clearly get the 10-15 fold increase when going from CPU to GPU. What the extra steps outlined above do is give you an additional 3x speed increase, so we are now maybe 30 fold faster than using the CPU.
But I wasn’t even considering the CPU only scenario as it is so slow.

2 Likes

A post was split to a new topic: How to Optimize StarDist Segmentation

Ah yes, sorry. I remember now, during the excellent NEUBIAS stardist webinar you were talking about this.
Thanks for clarifying!
Bram

2 Likes

A post was split to a new topic: Export QuPath Annotations for StarDist Training

Thanks for the tutorial, Oli!

For people who are interested, StarDist can do GPU/OpenCL-accelerated computation of the radial distances during training when gputools is installed. This is especially important in 3D and/or if you have a relatively slow CPU with few cores.

In other words, this prevents the computation of the radial distances to become a bottleneck for training.

Best,
Uwe

2 Likes

Thanks for the more detailed explanation!

I have however encountered an issue when trying to limit GPU memory with TF 1.14 specifically I think…

So far I have not needed this even with gputools enabled, but it might explain some instabilities I’ve had when doing multiple rounds of training? Should I switch back to TF 1.12?

2 Likes

Hi Oli,

This might be related to https://github.com/CSBDeep/CSBDeep/issues/40 and could be alleviated by downgrading keras:

pip uninstall keras
pip install keras==2.2.5

Let me know if that works,
M

2 Likes

Hello! Awesome this solved the issue, thank you very much

Oli

2 Likes

Note that this is already fixed in the repository and will be part of the next release of csbdeep (no schedule for that yet). In the meantime, you could use this instead (if you don’t want to downgrade keras).

Uwe

2 Likes

Thanks Uwe!

Indeed I noticed that I could run the limit_gpu_memory but then the cudnn library was not recognized anymore. You saved me hours :slight_smile:

Sorry Martin, I am marking Uwe’s question as the solution now :sweat_smile:

2 Likes

HI @uschmidt83,

I have two gpu, and I would like to run two instances of stardist training, one in each gpu. Do you have any idea how to do it? I don’t have nvlink to use both in the same process.

Thanks
Mafalda

1 Like

Hi Mafalda,

you can use this at the top of your script / Jupyter notebook (e.g. use 0 or 1 to select a specific GPU):

import os
os.environ["CUDA_VISIBLE_DEVICES"]="0"

Are you also gputools to speed up training? I don’t know how to choose the GPU there, but @mweigert should know if you’re interested.

Best,
Uwe

1 Like

Hi Uwe,

thank you for the tip, it works.
But I’m also using gputools, so if its possible to improve performance it would be good.

Thank you
Mafalda

1 Like

You can dynamically set the GPU/OpenCL device that is used by gputools via

import gputools                                                                                               

# use GPU 0
gputools.init_device(id_platform=0, id_device=0)  

# ...

# use GPU 1                                                            
gputools.init_device(id_platform=0, id_device=1)  

2 Likes

Maybe a side topic/question for @mweigert: is it possible to have two gputools processing in parallel on two GPUs? PyOpencl supports that, right?

Yes, in pyopencl one can in principle create multiple devices/queues and then execute programs on each of them (gputools in contrast assumes there is a single global device). Whether those queues are executed truly asynchronously however I never really investigated in detail (but worth looking into)

1 Like