Deep learning image classifier for multichannel images / what is the best approach?


I am currently training an inception v3 network to classify images into 2 classes. My inputs data are fluorescence images containing 5 channels, one for each marker. Since the inception network only inputs 3 channels my first approach was to combine the different channels into an RGB image by recombining the different channels like that for example: (channel1 + channel4/2, channel2 + channel4/2, channel3). However my later goal is to try to interpret what the deep learning has learned so I think it would make more sense to modify the inception network to input additional channels, otherwise how to disantangle the different channels?

I have been thinking of several approaches to do it and I would like your opinion on what would be the most relevant !

The first approach would consist to only modify the 3x3x3 kernel size of the first convolutional layer of the inception network (keras-applications/ at bc89834ed36935ab4a4994446e34ff81c0d8e1b7 · keras-team/keras-applications · GitHub
line 169) to a 5x5x5 kernel or a 5x3x3 kernel (actually i don’t really know what would make more sense). However:
(1) I have no idea whether it could cause any accuracy issue because everything was built for 3 channels in the first place
(2) I feel I won’t be able later to disantangle the discriminative power of each channels using for instance saliency maps and say “this pixel in this channel is discriminative” rather than just “this combination of pixels for the 5 channels are discriminative”

The second approach i have been thinking of is to train 5 different inception networks in parallel (only the feature extraction part) with 5 different loss functions (one for each channel) and combine all features together before input everything to a dense classifier. It sounds very heavy but this way I may be able to disantangle the discriminative power of each of the 5 channels, but at the same time I think I would loose the information carried by the marker colocalization

What do you think? Of course any other idea of approach would be appreciated :slight_smile:


Hello @Fabien_B,
Very interesting question,

I had a similar question for single channel image: if it could be possible to cut the 2 extra input channels that I don’t need. It turns out in this case duplicating the grayscale to a fake RGB was the solution, otherwise the amputated network has a difference structure than the original network, same for the “convolutional volumes” after each layer, such that you can’t transfer the weights from a pretrained inception network to an inception-like network with more or less channels than it was designed for.

I would thus expect a similar issue when replacing the input layers with ones accepting 5 channels, unless you add new layers on top of the current network to squeeze the 5 channels into 3 channels (can be several layers to do this job) to match the expected network dimensions.
This is what you are doing more or less already, except that the new layers might find smarter weights to combine the channels, since they will be trained.

This applies if you want to fine tune a pre-trained network.

However if you train from scratch, assuming you have enough training data, you could do an inception-like network with 5 input channels instead of 3 and train it all by yourself.
However, I believe it’s a rather large network ie many parameters to tune and thus not ideal with few training data (ie less than 1000 images). A smaller architecture such as VGG-like architecture might be quicker to converge and prevent fine-tuning.

I might be wrong though, after all a given architecture usually accepts a range of 2D image dimensions (width x height), could it be along Z too ?

Finally, you can also use dimensionality reduction methods to turn your 5-channels images into pseudo 3-channel images (ex: PCA or UMAP). The advantage is that it will make an optimal channel combination to keep the maximum of information from the original 5 channels.

See the UMAP documentation, they did a dimensionality reduction to represent colours from the original 3 values for RGB, to only 2 values to be able to display it on a 2D plot.
Basic UMAP Parameters — umap 0.5 documentation (

However, these methods are a bit complex if you have never encountered them, so maybe adding network layers on the current network might be simpler. Still if you go for it, make sure to use the same dimensionality reduction step between the different data fraction (training/validation/test set).

Example for PCA, you should not recompute a PCA for every fraction, either do it on the training set and use the result to project all fractions. Or do the PCA computation on the full annotated dataset.


Thank you for your answer!

You are right, fortunately i have a large training data set I forgot to mention that I am training from scratch! I already got fairly good results by combining 4 channels into 3 but I am looking for to input the 5 channels without having to combine them in the preprocessing

The first conv layer in the inception architecture is a 32 filters 3x3x3 kernel so it already kind of squeeze the 3 channels of an RGB image from the beginning if I understand correctly how it works, so in principle one have just to change the first conv layer kernel size (to 5x5x5 for example) if we want to squeeze a 3+ channels images. However my fear is that I might not be able to disantangle the discriminative power of each channel later. I am wondering whether there is something smarter I can do like parallelize the training of filters on each channels separately and then merge those features without loosing the information about the colocalization

interesting suggestion! i will look into it, however I am afraid that by reducing dimensionality i won’t be able to interpretate what the network has learned (but i may be wrong)

I think you are right, the output of the first conv layer should be 32 2D-feature maps right ?
So indeed either using a (Z x X x Y) 5x3x3 or 5x3x3 kernels should work for 5D input images.
For the rest of the question, I don’t know but normally the network should notice the relations between the input layers, ie the colocalization

You are right, I forgot to mention that you loose the possibility to relate to the original channels since you the network will predict on pseudo RGB-images.

1 Like

Hi @Fabien_B, interesting discussion! I think it really depends what you want to extract from your data. If you train separately on each channel, and only combine the outputs of each separate training, your network won’t be able to exploit complex multi-scale relations between your channels. However I guess you will be able to tell which channel (or combination of channels) is the “most important” in determining the class. If you train using all channels, the initial 5x3x3 convolution “mixes” channels and therefore it will be able to exploit channel relations. In the end, when you calculate e.g. a saliency map you’ll get again a 3D image with 5 channels. I haven’t done this but I imagine that you can then try to extract information from this 3D volume. For example you could do a “argmax” projection, recovering for each 2D pixel, the channel with the highest saliency. Or you could look at pairs of channels and see if you can for example identify positifve/negative correlations in saliency, i.e. sometimes the same pixel in two channels is important, while in other cases, different regions in different channels are important etc. Again those are just ideas!

1 Like

Thank you for those interesting ideas, I think you are right about the multi scale relations between the channels, you expressed it better than me.
For the saliency map, I think you only get a 2D image from the saliency, the input has to be (x,y,5) and the output will be a 2D map (but i would be interested if there is a way to do otherwise!) so that in the end you know what is the contribution of the combination of 5 channels for a given pixel but not the contribution of one channel. But may be if I want to know the contribution of a single channel I could input the channel of interest + 4 other background planes (hidding the information in the 4 other channels), I don’t know whether it makes sense from a statistical point of view

The saliency map should really have the same dimensions as your input. If at the output of your network you obtain a tensor of size 2 (for your two categories), the idea is to find the gradient of the winning category with respect to the input. This tells which input pixel has the largest influence i.e. the output would change fastest if that pixel changed. But this should be totally independent of the dimensionality of your input and in your case, the saliency “map” would have multiple channels. Probably this volume is not called “saliency map” itself, but you’d have to do some projection along all channels to get an actual map.

1 Like

I know this is not deep learning but for the classification of multichannel images we always etraxt features of all channels to segment a pixel for all channels.

Or the second approach is to use a separate model or network per channel.

Yes you are right thanks! I was confused because it turns out that the library i was using actually automatically take the max of the 3 color channel derivative values at each pixel