Machine-Learning Classification: how much data for the training set?


I have a set of 45 movies (all conditions, different dates) and I want to refine my classification in CellProfiler Analyst of the cell cycle stage using H2B segmentation.

Question: how much of my data should be used as training set?
In the first runs I used a couple of movies that worked nice enough for a first pass but I did not necessarily include all siRNA conditions in the training set, and I need to improve the classification.

I thought I had heard/read this somewhere but can’t find the thread or reference. 1/5th ? only controls or also siRNA treateed? all dates?
What is the “golden” rule for ML-based classification?

And in the future, as I get more and more movies, what is the best practice: should I increase my training set as the data set grows? or if it works keep it as such and re-apply teh rules identified?

pinging @haesleinhuepf @mweigert @ilastik_team