Skip to content
This repository has been archived by the owner on Sep 18, 2024. It is now read-only.

Does ImageDataGenerator randomly select validation images? #290

Open
ZheMann opened this issue May 5, 2020 · 2 comments
Open

Does ImageDataGenerator randomly select validation images? #290

ZheMann opened this issue May 5, 2020 · 2 comments
Labels
image Related to images

Comments

@ZheMann
Copy link

ZheMann commented May 5, 2020

When setting parameter validation_split to a value larger than 0.0, how does the Keras ImageDataGenerator select the validation images? Are they randomly selected from the input directory, or are the last n samples used, similar to the validation_split parameter for model.fit? More specifically, I'm primarily interested in the following situation: considering the flow_from_directory method, a shuffle parameter is available to randomize the data. However, is the shuffle applied after the input directory is splitted into a train and validation set by the ImageDataGenerator, or before?

I went through the official Keras and TF pages but they both show the same explanation of validation_split, namely:

validation_split: Float. Fraction of images reserved for validation (strictly between 0 and 1).

I also went through the source code (both Keras and TF) without any luck of finding additional information.

@ZheMann ZheMann added the image Related to images label May 5, 2020
@Dref360
Copy link
Contributor

Dref360 commented May 12, 2020

This is quite hidden into the code base, but in the case of flow_from_directory, it is a percentage per directory.

split: tuple of floats (e.g. `(0.2, 0.6)`) to only take into

@QoT
Copy link

QoT commented Oct 12, 2021

Not sure why is that not visible in @Dref360s answer, but important part is last sentence:

split: tuple of floats (e.g. (0.2, 0.6)) to only take into
account a certain fraction of files in each directory.
E.g.: segment=(0.6, 1.0) would only account for last 40 percent
of images in each directory
.

Actually, files are Python sorted() and if you format image names properly, you could use this feature pretty easy. Otherwise you might get something like this:

image_0.jpg
image_1.jpg
image_10.jpg
image_100.jpg
image_1000.jpg
image_1001.jpg
image_1002.jpg
image_1003.jpg
image_1004.jpg
image_1005.jpg
image_1006.jpg
image_1007.jpg
image_1008.jpg
image_1009.jpg
image_101.jpg
image_1010.jpg
image_1011.jpg
image_1012.jpg
image_1013.jpg
image_1014.jpg
image_1015.jpg
image_1016.jpg
image_1017.jpg
image_1018.jpg
image_1019.jpg

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
image Related to images
Projects
None yet
Development

No branches or pull requests

3 participants