I’m afraid there’s no built-in way to do this in QuPath currently. Validating such classifiers meaningfully is hard to do, particularly when the classifier has been trained interactively by providing hard examples and when nearby objects have highly-correlated features. This makes the training samples far from representative of the data as a whole, and so splitting these into training/validation/test sets wouldn’t give very meaningful results – rather, somehow another set of annotations would need to be generated to assess performance.
You may be able to get some of what you want by exporting detection measurements (with their classifications) in QuPath and evaluating these elsewhere (e.g. R, Python). But I think that figuring out what would be meaningful will be quite tricky and application-specific.