An autoencoder is a neural network trained to capture efficient and compact representations of input data. It compresses (or encode) input data, then reconstruct (or decode) the original input using that compressed representation from the encoder. The autoencoder is trained to minimize reconstruction error, using the original input itself as ground truth.
The autoencoder architectures typically introduce some form of bottleneck between the encoder and the decoder:
As data traverses the encoder network, each layer’s data capacity is progressively reduced. This forces the network to learn only the most important patterns hidden within the input data—called latent variables, the latent space, or bottleneck.
The bottleneck, located right after the encoder, serves as an extra layer that helps to compress the extracted features into a smaller vector representation. The purpose is to make it more difficult for the decoder to interpret the features and force it to learn more complex mappings so that the decoder network can accurately reconstruct the original input despite now having less information.
Convolutional autoencoder is designed specifically for images or other data with spatial structures, through unsupervised learning. It reduces the size of images for storage or transmission without losing important details.
The architecture of a convolutional autoencoder has a symmetric encoder-decoder structure. The encoder and decoder are built using convolutional neural networks (CNNs), which are well-suited for processing spatial data. The convolution layers replace the fully connected layers in a typical neural network in order to capture spatial hierarchies in data more effectively and scale better with larger input dimensions, particularly useful for images.
The convolutional layers captures simple patterns like edges and colors in an image. As data flows through deeper layers, the model identifies more complex features — such as shapes, textures, and even entire objects. Each convolutional layer builds on the patterns detected by the previous one, hence creating a rich and compressed feature representation of the image. Such feature dimensionality reduction function can perform a non-linear dimensionality reduction, better than the traditional PCA (Principal Component Analysis) which is constrained to its linear transformation.
By the time the data reaches the last layer of an encoder, it is transformed from a 2D image into a compact 1D vector that captures the most important information. The smaller the vector representation passed to the decoder, the fewer image features the decoder has access to and the less detailed its reconstructions will be.
For the image size of WxW, apply the filter kernel size of KxK, the padding of P, and the stride of S, the output image edge is
\begin{equation} Out=\frac{W-K+2 \times P}{S} \end{equation}The autoencoders can learn efficient data representations that minimize reconstruction error with fewer bits, enabling lossy image compression. Lossy image compression is a technique that reduces an image's file size by permanently removing some of its less important data, resulting in a smaller file that can be stored, transferred, and loaded faster.
The purpose of compressing images is to
Autoencoders with a significant dimensionality reduction in the latent space, can lose fine details and high-frequency information during the encoding process. This information may not be recoverable during decoding, leading to a lossy compression that impacts image quality for applications requiring high fidelity.
Autoencoders can be sensitive to noise or variations in the input data that differ from the training set, potentially affecting their compression and reconstruction performance. Adjusting the latent space size can minimize the irrelevant noise in the input data.
The effectiveness of an autoencoder for image compression is heavily reliant on the quality and quantity of the training data. With the possibility of overfitting, the autoencoder model could fail to restructure the unseen images.
Training deep autoencoders, especially with large datasets and complex architectures, can be computationally expensive and require significant resources. Adjusting the size of latent space requires a balance between computational cost and loss of fine details.
This project applies an autoencoder consisting of an encoder with 3 reductions of image size and a linear layer to output the latent space with 200 elements, and an decoder to symmetrically reverse the layers back to the original image size. See the figures below. For example, the input image is 96x96 pixels with 3 color channels. Apply the filter kernel size of 3x3, padding size of 1 and stride of 2, then the output image edge is as follows.
\begin{equation} out=\frac{W-K+2 \times P}{S}=\frac{96-3+2 \times 1}{2}~=48 \end{equation} The reduction factor from 96 to 48 is 2. Applying the reduction 3 times, the image edge before the flatten layer will become 96/2/2/2=12.
The dataset is STL10 dataset, which is an image recognition dataset for developing unsupervised feature learning, deep learning, self-taught learning algorithms. There are 10 classes of objects, 500 training and 800 test images per class.
References:
STL10 source: https://cs.stanford.edu/~acoates/stl10/
Pytorch convolution layer: https://docs.pytorch.org/docs/stable/nn.html#convolution-layers
outWtSize=48
QtyFold=3
StepSize=2
TotalReduction=int(StepSize**(QtyFold))
HiddenSize=int(imgEdge/TotalReduction)
# Defining the Encoder
class Encoder(nn.Module):
def __init__(self, in_channels=3, out_channels=outWtSize, latent_dim=200, act_fn=nn.ReLU()):
super().__init__()
self.net = nn.Sequential(
nn.Conv2d(in_channels, out_channels, 3, padding=1), # (96, 96)
act_fn,
nn.Conv2d(out_channels, out_channels, 3, padding=1),
act_fn,
nn.Conv2d(out_channels, 2 * out_channels, 3, padding=1, stride=2), # (48, 48)
act_fn,
nn.Conv2d(2 * out_channels, 2 * out_channels, 3, padding=1),
act_fn,
nn.Conv2d(2 * out_channels, 4 * out_channels, 3, padding=1, stride=2), # (24,24)
act_fn,
nn.Conv2d(4 * out_channels, 4 * out_channels, 3, padding=1),
act_fn,
nn.Conv2d(4 * out_channels, 8 * out_channels, 3, padding=1, stride=2), # (12,12)
act_fn,
nn.Conv2d(8 * out_channels, 8 * out_channels, 3, padding=1),
act_fn,
nn.Flatten(),
nn.Linear(TotalReduction * out_channels * HiddenSize * HiddenSize, latent_dim),
act_fn
)
def forward(self, x):
x = x.view(-1, QtyColor, imgEdge, imgEdge)
output = self.net(x)
return output
----------------------------------------------------------------
Layer (type) Output Shape Param #
================================================================
Conv2d-1 [-1, 48, 96, 96] 1,344
ReLU-2 [-1, 48, 96, 96] 0
Conv2d-3 [-1, 48, 96, 96] 20,784
ReLU-4 [-1, 48, 96, 96] 0
Conv2d-5 [-1, 96, 48, 48] 41,568
ReLU-6 [-1, 96, 48, 48] 0
Conv2d-7 [-1, 96, 48, 48] 83,040
ReLU-8 [-1, 96, 48, 48] 0
Conv2d-9 [-1, 192, 24, 24] 166,080
ReLU-10 [-1, 192, 24, 24] 0
Conv2d-11 [-1, 192, 24, 24] 331,968
ReLU-12 [-1, 192, 24, 24] 0
Conv2d-13 [-1, 384, 12, 12] 663,936
ReLU-14 [-1, 384, 12, 12] 0
Conv2d-15 [-1, 384, 12, 12] 1,327,488
ReLU-16 [-1, 384, 12, 12] 0
Flatten-17 [-1, 55296] 0
Linear-18 [-1, 200] 11,059,400
ReLU-19 [-1, 200] 0
================================================================
Total params: 13,695,608
Trainable params: 13,695,608
Non-trainable params: 0
# Defining the Decoder
class Decoder(nn.Module):
def __init__(self, in_channels=3, out_channels=outWtSize, latent_dim=200, act_fn=nn.ReLU()):
super().__init__()
self.out_channels = out_channels
self.linear = nn.Sequential(
nn.Linear(latent_dim, TotalReduction * out_channels * HiddenSize * HiddenSize),
act_fn
)
self.conv = nn.Sequential(
nn.ConvTranspose2d(8 * out_channels, 8 * out_channels, 3, padding=1), # (12, 12)
act_fn,
nn.ConvTranspose2d(8* out_channels, 4 * out_channels, 3, padding=1, stride=2, output_padding=1), # (24, 24)
act_fn,
nn.ConvTranspose2d(4 * out_channels, 4 * out_channels, 3, padding=1),
act_fn,
nn.ConvTranspose2d(4 * out_channels, 2 * out_channels, 3, padding=1, stride=2, output_padding=1), # (48, 48)
act_fn,
nn.ConvTranspose2d(2 * out_channels, 2 * out_channels, 3, padding=1),
act_fn,
nn.ConvTranspose2d(2 * out_channels, out_channels, 3, padding=1, stride=2, output_padding=1), # (96, 96)
act_fn,
nn.ConvTranspose2d(out_channels, out_channels, 3, padding=1),
act_fn,
nn.ConvTranspose2d(out_channels, in_channels, 3, padding=1)
)
def forward(self, x):
output = self.linear(x)
output = output.view(-1, TotalReduction * self.out_channels, HiddenSize, HiddenSize)
output = self.conv(output)
return output
----------------------------------------------------------------
Layer (type) Output Shape Param #
================================================================
Linear-1 [-1, 1, 55296] 11,114,496
ReLU-2 [-1, 1, 55296] 0
ReLU-3 [-1, 1, 55296] 0
ConvTranspose2d-4 [-1, 384, 12, 12] 1,327,488
ReLU-5 [-1, 384, 12, 12] 0
ReLU-6 [-1, 384, 12, 12] 0
ConvTranspose2d-7 [-1, 192, 24, 24] 663,744
ReLU-8 [-1, 192, 24, 24] 0
ReLU-9 [-1, 192, 24, 24] 0
ConvTranspose2d-10 [-1, 192, 24, 24] 331,968
ReLU-11 [-1, 192, 24, 24] 0
ReLU-12 [-1, 192, 24, 24] 0
ConvTranspose2d-13 [-1, 96, 48, 48] 165,984
ReLU-14 [-1, 96, 48, 48] 0
ReLU-15 [-1, 96, 48, 48] 0
ConvTranspose2d-16 [-1, 96, 48, 48] 83,040
ReLU-17 [-1, 96, 48, 48] 0
ReLU-18 [-1, 96, 48, 48] 0
ConvTranspose2d-19 [-1, 48, 96, 96] 41,520
ReLU-20 [-1, 48, 96, 96] 0
ReLU-21 [-1, 48, 96, 96] 0
ConvTranspose2d-22 [-1, 48, 96, 96] 20,784
ReLU-23 [-1, 48, 96, 96] 0
ReLU-24 [-1, 48, 96, 96] 0
ConvTranspose2d-25 [-1, 3, 96, 96] 1,299
================================================================
Total params: 13,750,323
Trainable params: 13,750,323
Non-trainable params: 0
----------------------------------------------------------------
# Defining the Autoencoder
class Autoencoder(nn.Module):
def __init__(self, encoder, decoder):
super().__init__()
self.encoder = encoder.to(device)
self.decoder = decoder.to(device)
def forward(self, x):
encoded = self.encoder(x)
decoded = self.decoder(encoded)
return decoded
In this project, we will use the PyTorch libraries such as datasets and transforms to download the data and also perform the data transformation.
import torchvision.transforms as transforms
import torchvision.datasets as Datasets
This project will also demonstrate the effect of the normalization to the input data. Therefore, there are two types of transformation pipelines we will use.
The following command is used to download the original training set of the STL10 datasets.
training_set = Datasets.STL10(root='./', split='train', download=True, transform=transforms.ToTensor())
In some cases, the normalization of the input data can help reshape the data into a normal distribution. In this process, it normalizes the frequency of a feature appearing n times in a 10xn sequence to be the same as the frequency of a feature appearing 2xn times in a 20xn sequence.
The normalization formula is
The normalization could be detrimental for an image processing. It could shift the image intensity range and cause degradation of image quality. In this project, we will demonstrate the impact of normalization by observing the learning loss.
transform_pipeline = transforms.Compose([
transforms.ToTensor(),
transforms.Normalize((0.5, 0.5, 0.5), (0.5, 0.5, 0.5))
])
training_set = Datasets.STL10(root='./', split='train', download=True, transform=transform_pipeline)
References:
Torchvision Transforms: https://docs.pytorch.org/vision/0.9/transforms.html
Torchvision Datasets: https://docs.pytorch.org/vision/stable/datasets.html
There are 10 classes of images in STL10.
class_names=dict(zip(range(10), training_set.classes))
The sequence of color channels and pixel maps in an image tensor (3x96x96) is different from the image interpretation (96x96x3) in Python. Therefore, it is required to reshape the matrix structure in order to visualize an image. This step is only required for the image visualization. It is not required during the machine learning process.
img = img.numpy().transpose((1, 2, 0))
The normalized image tends to be darker because the original pixel map does not contain many white cells (which has the values toward 255). The normalized pixel values are centered toward darker values.
A DataLoader wraps an iterable around the Dataset to enable easy access to the data in a specific way. The Dataset retrieves a dataset’s features and labels one sample at a time. While training a model, we typically want to pass samples in “minibatches”, reshuffle the data at every epoch to reduce model overfitting. There will be three dataloaders, one for each (training, validation, and test).
from torch.utils.data import Dataset, DataLoader, TensorDataset, Subset
dataloader_tr=DataLoader(training_set, batch_size=64)
Reference:
DataLoader: https://docs.pytorch.org/tutorials/beginner/basics/data_tutorial.html
The network is an autoencoder which is defined in a paragraph above.
An optimizer is selected for optimizing the network parameters at each training epoch.
network = Autoencoder(Encoder(), Decoder())
optimizer = torch.optim.Adam(self.network.parameters(), lr=1e-3)
It is required to initialize the network's weights. The function init_weights is defined below:
network.apply(init_weights)
def init_weights(module):
if isinstance(module, nn.Conv2d):
torch.nn.init.xavier_uniform_(module.weight)
module.bias.data.fill_(0.01)
elif isinstance(module, nn.Linear):
torch.nn.init.xavier_uniform_(module.weight)
module.bias.data.fill_(0.01)
Set the module in training mode.
network.train()
Reference:
Training Mode: https://docs.pytorch.org/docs/stable/generated/torch.nn.Module.html#torch.nn.Module.train
Use DataLoader to train the network in batches:
At each training epoch, the dataloader will load the data samples in batches. For example, if there are 5000 training samples in the dataloader and the batch size is 64 samples, there will be 5000/64~=79 iterations of training at each training epoch.
Use tqdm to show training progress:
Use tqdm to show the training progress. tqdm is a Python library used to display smart progress meters for loops and iterations, commonly employed in PyTorch for monitoring training and evaluation loops. tqdm can be easily integrated into PyTorch training loops by wrapping tqdm() around any iterable, such as a DataLoader.
Use a loss function to calculate the training loss:
The loss function is defined for the calculation of loss between the original image and the synthetic image. In this project, MSE loss is used to represent the training loss.
References:
tqdm: https://tqdm.github.io/
Mean Square Error: https://docs.pytorch.org/docs/stable/generated/torch.nn.MSELoss.html
from tqdm.notebook import tqdm
for images, _ in tqdm(train_loader):
# Zeroing gradients
optimizer.zero_grad()
# Reconstructing images
output = network(images)
# Computing loss
loss = loss_function(output, images.view(-1, QtyColor, imgEdge, imgEdge))
# Calculating gradients
loss.backward()
# Optimizing weights
optimizer.step()
# Collect loss value in the list train_losses[]
train_losses.append(loss.item())
# Take the average loss value from the iterations of training
# and collect this average value at each epoch
loss_trmean=np.mean(train_losses)
#log_loss['training_loss_per_batch'].append(loss.item())
log_loss['avg_training_loss_perEpoch'].append(loss_trmean)
At each epoch, we apply the trained network to the test images in order to obtain the loss values associated with the test samples which are not used to train the network. Same routine can be applied to the validation images.
# ------------
# TEST
# ------------
for test_images, _ in tqdm(test_loader):
# disables gradient calculation
with torch.no_grad():
#Obtain decoded image from the trained network
output = network(test_images)
#Compare the original image with the decoded image to calculate the loss value
test_loss = loss_function(output, test_images.view(-1, QtyColor, imgEdge, imgEdge))
test_losses.append(test_loss.item())
loss_testmean=np.mean(test_losses)
log_loss['avg_test_loss_perEpoch'].append(loss_testmean)
epochQty=121
model = Autoencoder(Autoencoder(Encoder(), Decoder()))
log_loss = model.train(nn.MSELoss(), epochs=epochQty, batch_size=64, train_loader=dataloader_tr, val_loader=dataloader_val, test_loader=dataloader_te)
There are four sets of trainings:
From the chart below, it shows:
#The dropout layers were added to the encoder in this format below. :
self.net = nn.Sequential(
nn.Conv2d(in_channels, out_channels, 3, padding=1), # (96, 96)
act_fn,
nn.Conv2d(out_channels, out_channels, 3, padding=1),
act_fn,
# Dropout layer
nn.Dropout2d(p=0.2),
nn.Conv2d(out_channels, 2 * out_channels, 3, padding=1, stride=2), # (48, 48)
act_fn,
nn.Conv2d(2 * out_channels, 2 * out_channels, 3, padding=1),
act_fn,
# Dropout layer
nn.Dropout2d(p=0.2),
nn.Conv2d(2 * out_channels, 4 * out_channels, 3, padding=1, stride=2), # (24,24)
act_fn,
nn.Conv2d(4 * out_channels, 4 * out_channels, 3, padding=1),
act_fn,
nn.Conv2d(4 * out_channels, 8 * out_channels, 3, padding=1, stride=2), # (12,12)
act_fn,
nn.Conv2d(8 * out_channels, 8 * out_channels, 3, padding=1),
act_fn,
nn.Flatten(),
nn.Linear(TotalReduction * out_channels * HiddenSize * HiddenSize, latent_dim),
act_fn
)
# The dropout layers were symmetrically added to the decoder below:
self.conv = nn.Sequential(
nn.ConvTranspose2d(8 * out_channels, 8 * out_channels, 3, padding=1), # (12, 12)
act_fn,
nn.ConvTranspose2d(8* out_channels, 4 * out_channels, 3, padding=1, stride=2, output_padding=1), # (24, 24)
act_fn,
nn.ConvTranspose2d(4 * out_channels, 4 * out_channels, 3, padding=1),
act_fn,
# Dropout layer
nn.Dropout2d(p=0.2),
nn.ConvTranspose2d(4 * out_channels, 2 * out_channels, 3, padding=1, stride=2, output_padding=1), # (48, 48)
act_fn,
nn.ConvTranspose2d(2 * out_channels, 2 * out_channels, 3, padding=1),
act_fn,
# Dropout layer
nn.Dropout2d(p=0.2),
nn.ConvTranspose2d(2 * out_channels, out_channels, 3, padding=1, stride=2, output_padding=1), # (96, 96)
act_fn,
nn.ConvTranspose2d(out_channels, out_channels, 3, padding=1),
act_fn,
nn.ConvTranspose2d(out_channels, in_channels, 3, padding=1)
)