Multi-GPU processing with imglib2-cache

Hi everyone

I’ve been experimenting with processing large images, cell by cell, with multiple GPUs. After some discussions on Gitter I’ve implemented a design suggested by @tpietzsch with additional input from @hanslovsky

The idea is to store pending jobs in a BlockingQueue with capacity numGPUs. For each GPU a thread is started to monitor the queue. I’ve implemented the monitor loop in a class called GPUQueueMonitor

public class GPUQueueMonitor extends Thread {
	private boolean stop;
	private BlockingQueue<FutureTask> queue;
	private int deviceNum;

	public GPUQueueMonitor(BlockingQueue<FutureTask> queue, int deviceNum) {
		this.queue = queue;
		this.deviceNum = deviceNum;
	}

	public void run() {
		// set device num for this thread
		YacuDecuRichardsonLucyWrapper.setDevice(deviceNum);

		while (!stop) {
			try {
				FutureTask job = queue.take();
				job.run(); // will "complete" the future
				System.out.println("finished decon on GPU " + deviceNum);
			} catch (InterruptedException e) {
				// TODO Auto-generated catch block
				e.printStackTrace();
			}
		}
	}

	public void setStop(boolean stop) {
		this.stop = stop;
	}
}

The queue, CellLoader and DiskCachedImage are setup as follows

int numGPUs = 4;

		// create a queue to hold the GPU jobs, max size of queue is the number of GPUS
		BlockingQueue<FutureTask> queue = new ArrayBlockingQueue<FutureTask>(numGPUs);

		// create a list of GPU monitors.  This class will loop and wait for GPU jobs
		ArrayList<GPUQueueMonitor> monitors = new ArrayList<GPUQueueMonitor>();

		// for each GPU create and add a GPU monitor
		for (int i = 0; i < numGPUs; i++) {
			GPUQueueMonitor looper = new GPUQueueMonitor(queue, i + 1);
			monitors.add(looper);
			looper.start();
		}

		// create a loader, the loader doesn't do the work, but defines a lambda
		// and puts the lambda in the queue
		CellLoader loader = (SingleCellArrayImg img) -> {
			FutureTask future = new FutureTask(() -> {
				deconvolver.compute(imgF, img);
			}, "finished");
			queue.put(future); // blocks until space is free in queue
			future.get();  // blocks until GPU thread has processed the job
		};

		// create a new image using the disk cache factory. Pass it the loader defined
		// above
		DiskCachedCellImg<FloatType, RandomAccessibleInterval<FloatType>> out = (DiskCachedCellImg) factory.create(
				new long[] { imgF.dimension(0), imgF.dimension(1), imgF.dimension(2) }, new FloatType(), loader,
				options().initializeCellsAsDirty(true));

Full code is here and here.

My initial testing indicates this is working. Though more testing is needed, especially to confirm that the setGPUDevice works as expected with java threads (that is device number is bound to each java thread).

One small issue. It seems that when creating a RandomAccess to the DiskCachedImg a synchronized access is done in GuardedStrongRefLoaderRemoverCache. So the first cell is always processed serially, while the remaining cells are processed in parallel (assuming the loader is triggered in parallel).

This is a work in progress. If anyone sees flaws or has suggestions let me know.

Thanks

8 Likes

Thanks for keeping us up-to-date @bnorthan
This looks really interesting!