Hi all involved in this dicussion,
I must admit I spent some time (years) thinking of the best way to represent ROI (mostly 3D), and we ended up with, internally, an object being a list of 3D voxel coordinates, and externally we use any type of labelled image (8,16 or 32 bits).
One point is also that usually a labelled image does not contain one ROI object but many.
This is why we have an additional layer of representation, an ObjectsPopulation, that is basically a list of 3D Roi, and quite fast algorithm extract this population from any labelled image.
Finally internally list of coordinates is usually quite efficient to process many objects, but sometimes you need an image (to compute co-localisation for instance), in this case we use a labelled image with only one object but restricted to the bounding box (similar to ICY); that means that this internal labelled image have information about its offset, so the ROI has correct coordinates