Loading Image dataset from directory using TensorFLow
This blog discusses three ways to load data for modelling,
- ImageDataGenerator
- image_dataset_from_directory
- tf.data API
First to use the above methods of loading data, the images must follow below directory structure.
File Structure PreRequisite
# Example of file structure
directory <- top level folder
└───train <- training images
│ └───class_1
│ │ │ 1008104.jpg
│ │ │ 1638227.jpg
│ │ │ ...
│ └───class_1
│ │ 1000205.jpg
│ │ 1647351.jpg
│ │ ...
│
└───test <- testing images
│ └───class_1
│ │ │ 1001116.jpg
│ │ │ 1507019.jpg
│ │ │ ...
│ └───class_2
│ │ 100274.jpg
│ │ 1653815.jpg
│ │ ...
You can checkout Daniel’s preprocessing notebook for preparing the data.
1.1 ImageDataGenerator
Definition form docs - Generate batches of tensor image data with real time augumentaion.
So What’s Data Augumentation? - We’ll cover this later in the post.
To load in the data from directory, first an ImageDataGenrator instance needs to be created.
from tensorflow.keras.preprocessing.image import ImageDataGenerator
train_datagen = ImageDataGenerator()
test_datagen = ImageDataGenerator()
Two seperate data generator instances are created for training and test data.
# Creating DataGen instances
train_datagen_10_percent = ImageDataGenerator(rescale=1/255.)
test_datagen_1_percent = ImageDataGenerator(rescale=1/255.)
rescale=1/255.
is used to scale the images between 0 and 1 because most deep learning and machine leraning models prefer data that is scaled 0r normalized.
Let’s use flow_from_directory()
method of ImageDataGenerator
instance to load the data. We’ll load the data for both training and test data at the same time.
# Loading in the data
train_data_10_percent = train_datagen_10_percent.flow_from_directory(directory=train_dir_10_percent,
target_size=(224, 224),
class_mode='categorical',
batch_size=32,
shuffle=True)
test_data_1_percent = test_datagen_1_percent.flow_from_directory(directory=test_dir_1_percent,
target_size=(224, 224),
class_mode='categorical',
batch_size=32)
First Let’s see the parameters passes to the flow_from_directory()
- directory - The directory from where images are picked up
- batch_size - The images are converted to batches of 32. If we load all images from train or test it might not fit into the memory of the machine, so training the model in batches of data is good to save computer efficiency.
32
is a good batch size - target_size - Specify the shape of the image to be converted after loaded from directory
- class_mode - class_mode is
binary
for binary classification andcategorical
for multi-class classification. - seed - Mentioning seed to maintain consisitency if we repeat the experiments
After checking whether train_data
is tensor or not using tf.is_tensor()
, it returned False
. flow_from_directory()
returns an array of batched images and not Tensors
.
We can checkout a single batch using images, labels = train_data.next()
, we get image shape - (batch_size
, target_size
, target_size
, rgb
).
Training time: This method of loading data gives the second highest training time in the methods being dicussesd here. For 29 classes with 300 images per class, the training in GPU(Tesla T4) took 2mins 9s
and step duration of 71-74ms
.
1.2 ImageDataGenerator Data Augumentation
Data Augumentation - Is the method to tweak the images in our dataset while it’s loaded in training for accomodating the real worl images or unseen data.
We can implement Data Augumentaion in ImageDataGenerator using below ImageDateGenerator,
# Creating DataGen instances
train_data_10_percent_aug = ImageDataGenerator(rescale=1/255.,
horizontal_flip=True,
zoom_range=0.2,
rotation_range=0.2,
width_shift_range=0.2,
height_shift_range=0.2)
# Loding in the data
train_data_10_percent_aug = train_data_10_percent_aug.flow_from_directory(directory=train_dir_10_percent,
target_size=(224, 224),
class_mode='categorical',
batch_size=32)
There are many options for augumenting the data, let’s explain the ones covered above.
- horizontal_flip - Flips the image in horizontal axis
- zoom_range - Zooms in the image
- rotation_range - Rotates the image
- width_shift_range - Range of width shift performed
- height_shift_range - Range of height shift performed
All other parameters are same as in 1.ImageDataGenerator
Advantage of using data augumentation is it will give better results compared to training without augumentaion in most cases. But ImageDataGenerator Data Augumentaion
increases the training time, because the data is augumented in CPU and the loaded into GPU for train.
Training time: This method of loading data has highest training time in the methods being dicussesd here. For 29 classes with 300 images per class, the training in GPU(Tesla T4) took 7mins 53s
and step duration of 345-351ms
There’s another way of data augumentation using tf.keras.experimental.preporcessing
which reduces the training time.
Return Type: Return type of ImageDataGenerator.flow_from_directory()
is numpy array
.
Other Methods of ImageDataGenerator
2. image_dataset_from_directory
Let’s checkout how to load data using tf.keras.preprocessing.image_dataset_from_directory
,
# Loading in the data
train_data_10_percent_idfd = tf.keras.preprocessing.image_dataset_from_directory(directory=train_dir_10_percent,
label_mode='categorical',
image_size=(224,224),
batch_size=32,
shuffle=True)
test_data_1_percent_idfd = tf.keras.preprocessing.image_dataset_from_directory(directory=test_dir_1_percent,
label_mode='categorical',
image_size=(224,224),
batch_size=32)
Checking the parameters passed to image_dataset_from_directory
,
- directory - The directory from where images are picked up
- label_mode - This is similar to class_mode in
ImageDataGenerator
,binary
for binary classification andcategorical
for multi-class classification - image_size - Specify the shape of the image to be converted after loaded from directory
- batch_szie - The images are converted to batches of 32. If we load all images from train or test it might not fit into the memory of the machine, so training the model in batches of data is good to save computer efficiency.
32
is a good batch size
We can checkout the data using snippet below, we get image shape - (batch_size
, target_size
, target_size
, rgb
).
train_1s_idfd = train_data_10_percent_idfd.take(1)
for i, l in train_1s_idfd:
print(f"Shape of the image: {i.shape,}")
print(f"Shape of the label: {l.shape}")
print(f"Image : {i[0]}")
print(f"Label: {l[0]}")
Training time: This method of loading data gives the second lowest training time in the methods being dicussesd here. For 29 classes with 300 images per class, the training in GPU took 1min 55s
and step duration of 83-85ms
.
Return Type: Return type of image_dataset_from_directory
is tf.data.Dataset image_dataset_from_directory
which is a advantage over ImageDataGenerator
.
3. tf.data API
This first two methods are naive data loading methods or input pipeline. One big consideration for any ML practitioner is to have reduced experimenatation time. Without proper input pipelines and huge amount of data(1000 images per class in 101 classes) will increase the training time massivley.
tf.data API
offers methods using which we can setup better perorming pipeline.
Methods and code used are based on this documentaion
To load data using tf.data API
, we need functions to preprocess the image. Why this function is needed will be understodd in further reading…
3.1 Create POSIX path using pathlib
from pathlib import Path
train_dir_path_10_percent = Path("/content/asl_10_percent")
test_dir_path_1_percent = Path("/content/asl_10_percent_test")
3.2 Get the image count
image_count_train = len(list(train_dir_path_10_percent.glob('*/*.jpg')))
image_count_test = len(list(test_dir_path_1_percent.glob('*/*.jpg')))
3.3 Getting the list of files and shuffling it
# Getting the list of files for train
train_ds = tf.data.Dataset.list_files(file_pattern=str(train_dir_path_10_percent/'*/*'), shuffle=False)
train_ds = train_ds.shuffle(buffer_size=1000, reshuffle_each_iteration=False)
# Getting the list of files for test
test_ds = tf.data.Dataset.list_files(file_pattern=str(test_dir_path_1_percent/'*/*'), shuffle=False)
test_ds = test_ds.shuffle(buffer_size=1000, reshuffle_each_iteration=False)
3.4 Function to return one_hot encoded labels
def get_label(file_path):
part = tf.strings.split(file_path, os.path.sep)
# One hot encode the label
one_hot = part[-1] == class_names
return tf.one_hot(tf.argmax(one_hot), 29)
3.5 Function to decode image
def decode_image(img):
# Convert the compressed string to a uint8 tensor
img = tf.image.decode_jpeg(img)
# Resize the image to desired shape
return tf.image.resize(img, [img_size, img_size])
3.6 Function returns image and one_hot encoded label using 3.4 and 3.5
def process_path(file_path):
label = get_label(file_path)
img = tf.io.read_file(file_path)
img = decode_image(img)
return img, label
Now we’re ready to load the data, let’s write it and explain it later.
# Load in the data
img_size = 224
# Train
train_data_10_percent_da = train_ds.map(map_func=process_path,
num_parallel_calls=tf.data.AUTOTUNE)
train_data_10_percent_da = train_data_10_percent_da.shuffle(buffer_size=1000).batch(batch_size=32).prefetch(buffer_size=tf.data.AUTOTUNE)
# Test
test_data_1_percent_da = test_ds.map(map_func=process_path,
num_parallel_calls=tf.data.AUTOTUNE)
test_data_1_percent_da = test_data_1_percent_da.shuffle(buffer_size=1000).batch(batch_size=32).prefetch(buffer_size=tf.data.AUTOTUNE)
-
map()
- is used to map the preprocessing function over a list of filepaths which return img and label a.map_func
- pass the preprocessing function here b.num_parallel_calls
- this takes care of parallel processing calls in map and we’re usingtf.data.AUTOTUNE
for better parallel calls -
Once
map()
is completed,shuffle()
,bactch()
are applied on top of it. a.buffer_size
- Ideally, buffer size will be length of our trainig dataset. But if it’s huge amount line 100000 or 1000000 it will not fit into memory. So it’s better to use buffer_size of 1000 to 1500 -
prefetch()
- this is the most important thing improving the training time. what it does is while one batching of data is in progress, it prefetches the data for next batch, reducing the loading time and in turn training time compared to other methods.
Training time: This method of loading data gives the lowest training time in the methods being dicussesd here. For 29 classes with 300 images per class, the training in GPU(Tesla T4) took 1min 13s
and step duration of 50ms
.
Return Type: Return type of tf.data API
is tf.data.Dataset.
At the end, it’s better to use tf.data API
for larger experiments and other methods for smaller experiments.
Summary:
Data Loading methods are affecting the training metrics too, which cna be explored in the below table.
GPU: Tesla T4
Loading Method | Training Time | Step Time | loss | accuracy | val_losss | val_accuracy |
---|---|---|---|---|---|---|
ImageDataGenerator | 2mins 9s | 71-74ms | 0.6653 | 0.8026 | 6.4802 | 0.2034 |
ImageDataGenerator with Data Augumentation | 7mins 53s | 345-351ms | 3.2899 | 0.1245 | 13.2588 | 0.0345 |
image_dataset_from_directory | 1min 55s | 83-85ms | 23.9092 | 0.8178 | 437.2661 | 0.1862 |
tf.data API | 1min 13s | 50ms | 0.0000 | 1.0000 | 13804.7002 | 0.0000 |
Source Notebook - This notebook explores more than Loading data using TensorFlow, have fun reading… 😁