Editing .h5 files to reduce training set size after labelling?


I’ve labeled a bunch of frames from 4 videos, with NumFramesToPick initially set as 300. However I understood this as the total number of frames that would be picked across all videos, but actually it picked 300 frames from each video. I have labelled 100 from each, and would like to use only those for training.

It seems that I could set NumFramesToPick as 100,delete the unlabelled frames from the labeled-data folder, and the corresponding lines from each of the CollectedData.csv files. However these folders also contain CollectedData.h5 files, which are not easily opened (they appear empty when viewed with HDFViewer, for example). Is there a need to edit these as well?

Thanks, and apologies if I’ve missed this somewhere.

Marcus Watson

The correct way would be to use dlc.dropimagesduetolackofannotation(config) which would delete unlabeled images and then use dlc.dropannotationfileentriesduetodeletedimages(config) to remove the entries from deleted images (I don’t think the first function does that after deleting images). In your case you should use the second function now, since the .h5 probably contains those entries still.

1 Like

OK, thanks for this, I will read the help files for these function, try them out later, and mark your response as the solution assuming it works as expected.

Another question: Is it perhaps useful to leave some unlabelled images in the set where none of the body parts are in view? I have several images before the animal has entered the environment, and hence there are no labels, but it seems it might be useful for the model to incorporate the positive information that there is no interesting information in those images. However my understanding of the deep learning process may be naive.

As far as I know model gets nothing from “not seeing stuff”. To simplify, it’s a math equation, if there is no X in your equation there is nothing for you to find. What model of pose estimation and object detection does, again very simplified, is multiple matrix calculations (filter x pixel values of the frame) and tries to amplify what would be most prominent features and help find solution to your question (then there’s weight changing, backpropagation and a lot of stuff I don’t even pretend to understand :smiley: ). From my understanding it does nothing if not hurts the performance (the model still has to perform calculations on empty frames with no way to even check the performance since there are no human labels to compare to with it’s prediction and no way to adjust weights).

Hmmm… having tried these functions, they don’t seem to do anything. E.g. the output of deeplabcut.dropimagesduetolackofannotation('/home/marcus/DLC/BardTest/BardTest-Marcus-2021-01-24/config.yaml' is

Annotated images: 300 In folder: 300
PROCESSED: /home/marcus/DLC/BardTest/BardTest-Marcus-2021-01-24/labeled-data/SceneB now # of annotated images: 300 in folder: 300

However I know for a fact that only about 85 of these images have any annotations at all. Anything I’m doing wrong here?

There is not much that can go wrong using this function. It just checks .h5 file to see if certain image file is listed there with annotations and if not, deletes it from the folder (in loop over all folders in that project). Can you open the the Collected_data_***.h5 and check the listed images and their annotations? A screenshot of the file content would be great.