What is a good reference for estimating training set size for machine learning?


Does anyone know a reference for the percent of images to use for creating a TWS-generated classifier? For example, if I have 360 images with a binary classification using the default settings, including FastRandomForest classifier option, would 10% or 36 images be reasonable?

Thank you kindly!

@iarganda - Do you know of a reference for estimating the training set size in such cases?

Here you find a nice reference of model evaluation of classifiers (scikit learn):


However in general I would always plot the accuracy (with the test data) of the classifiers in dependance of the training set to find an optimum.

It really depends on how homogeneous the dataset is… When using interactive learning you should try to cover the most typical cases but also the complicated ones.


But is there a general way to estimate the number of datasets and tracings needed? Is there a way folks reference this their manuscripts?

Sorry for the late replay. What I’ve seen is that you can include a study of the performance of your model based on the percentage of labels you use. An estimation is hard because it is a really data-dependent problem…

This is a generic answer to the generic question in the title.
The amount of training material needed to obtain adequate performance depends on both the data, what one considers adequate performance and the type of model chosen. However, the more parameters the model needs to estimate the more training data is required. Generally, you also want more samples than the number of features. In the case of imbalanced classes, you may also need to find/annotate more samples of the less abundant classes.
Also very often, training is an iterative process in that classes may need to be redefined in light of previous misclassification results, for example a new ‘artefact’ class may need to be added to catch ‘contaminants’ of other classes.