Convolutional Autoencoder for Image Compression¶

Autoencoder¶

            An autoencoder is a neural network trained to capture efficient and compact representations of input data. It compresses (or encode) input data, then reconstruct (or decode) the original input using that compressed representation from the encoder. The autoencoder is trained to minimize reconstruction error, using the original input itself as ground truth.

            The autoencoder architectures typically introduce some form of bottleneck between the encoder and the decoder:

            As data traverses the encoder network, each layer’s data capacity is progressively reduced. This forces the network to learn only the most important patterns hidden within the input data—called latent variables, the latent space, or bottleneck.

            The bottleneck, located right after the encoder, serves as an extra layer that helps to compress the extracted features into a smaller vector representation. The purpose is to make it more difficult for the decoder to interpret the features and force it to learn more complex mappings so that the decoder network can accurately reconstruct the original input despite now having less information.

Convolutional Autoencoder for image compression¶

            Convolutional autoencoder is designed specifically for images or other data with spatial structures, through unsupervised learning. It reduces the size of images for storage or transmission without losing important details.

            The architecture of a convolutional autoencoder has a symmetric encoder-decoder structure. The encoder and decoder are built using convolutional neural networks (CNNs), which are well-suited for processing spatial data. The convolution layers replace the fully connected layers in a typical neural network in order to capture spatial hierarchies in data more effectively and scale better with larger input dimensions, particularly useful for images.

            The convolutional layers captures simple patterns like edges and colors in an image. As data flows through deeper layers, the model identifies more complex features — such as shapes, textures, and even entire objects. Each convolutional layer builds on the patterns detected by the previous one, hence creating a rich and compressed feature representation of the image. Such feature dimensionality reduction function can perform a non-linear dimensionality reduction, better than the traditional PCA (Principal Component Analysis) which is constrained to its linear transformation.

            By the time the data reaches the last layer of an encoder, it is transformed from a 2D image into a compact 1D vector that captures the most important information. The smaller the vector representation passed to the decoder, the fewer image features the decoder has access to and the less detailed its reconstructions will be.

Convolution Image Size Reduction¶

For the image size of WxW, apply the filter kernel size of KxK, the padding of P, and the stride of S, the output image edge is

\begin{equation} Out=\frac{W-K+2 \times P}{S} \end{equation}

Image compression.¶

            The autoencoders can learn efficient data representations that minimize reconstruction error with fewer bits, enabling lossy image compression. Lossy image compression is a technique that reduces an image's file size by permanently removing some of its less important data, resulting in a smaller file that can be stored, transferred, and loaded faster.

            The purpose of compressing images is to

  • Reduce bandwidth for web and mobile applications
  • Develop codecs tailored for specific image domains such as MRI image compression, leveraging their ability to learn compact, low-dimensional representations of images while preserving diagnostic information
  • Apply the downsized input features for supervised learning tasks
  • Visualize complex datasets in lower-dimensional spaces (2D or 3D embeddings)
  • Accelerate downstream classification or clustering tasks

Disadvantage of autoencoder¶

Loss of Fine Details:¶

            Autoencoders with a significant dimensionality reduction in the latent space, can lose fine details and high-frequency information during the encoding process. This information may not be recoverable during decoding, leading to a lossy compression that impacts image quality for applications requiring high fidelity.

Sensitivity to Input Variations:¶

            Autoencoders can be sensitive to noise or variations in the input data that differ from the training set, potentially affecting their compression and reconstruction performance. Adjusting the latent space size can minimize the irrelevant noise in the input data.

Limited Generalization:¶

            The effectiveness of an autoencoder for image compression is heavily reliant on the quality and quantity of the training data. With the possibility of overfitting, the autoencoder model could fail to restructure the unseen images.

  • Apply dropout layers in the network to randomly deactivate a subset of neurons during training. This forces the autoencoder to learn robust, general features rather than relying on specific activations, reducing the risk of overfitting.
  • Apply data augmentation techniques by introducing variations of the input data, such as flipping, rotating, scaling, or adding noise. These variations increase the diversity of the training set, helping the model to generalize better by learning to reconstruct data under slightly different conditions.

Computational Cost:¶

            Training deep autoencoders, especially with large datasets and complex architectures, can be computationally expensive and require significant resources. Adjusting the size of latent space requires a balance between computational cost and loss of fine details.

Example of an Autoencoder for Image Compression¶

            This project applies an autoencoder consisting of an encoder with 3 reductions of image size and a linear layer to output the latent space with 200 elements, and an decoder to symmetrically reverse the layers back to the original image size. See the figures below. For example, the input image is 96x96 pixels with 3 color channels. Apply the filter kernel size of 3x3, padding size of 1 and stride of 2, then the output image edge is as follows.

\begin{equation} out=\frac{W-K+2 \times P}{S}=\frac{96-3+2 \times 1}{2}~=48 \end{equation}

            The reduction factor from 96 to 48 is 2. Applying the reduction 3 times, the image edge before the flatten layer will become 96/2/2/2=12.

            The dataset is STL10 dataset, which is an image recognition dataset for developing unsupervised feature learning, deep learning, self-taught learning algorithms. There are 10 classes of objects, 500 training and 800 test images per class.

References:
STL10 source: https://cs.stanford.edu/~acoates/stl10/
Pytorch convolution layer: https://docs.pytorch.org/docs/stable/nn.html#convolution-layers

outWtSize=48
QtyFold=3
StepSize=2
TotalReduction=int(StepSize**(QtyFold))
HiddenSize=int(imgEdge/TotalReduction)

# Defining the Encoder
class Encoder(nn.Module):
    def __init__(self, in_channels=3, out_channels=outWtSize, latent_dim=200, act_fn=nn.ReLU()):
        super().__init__()

        self.net = nn.Sequential(
            nn.Conv2d(in_channels, out_channels, 3, padding=1),  # (96, 96)
            act_fn,
            nn.Conv2d(out_channels, out_channels, 3, padding=1),
            act_fn,
            nn.Conv2d(out_channels, 2 * out_channels, 3, padding=1, stride=2),  # (48, 48)
            act_fn,
            nn.Conv2d(2 * out_channels, 2 * out_channels, 3, padding=1),
            act_fn,
            nn.Conv2d(2 * out_channels, 4 * out_channels, 3, padding=1, stride=2),  # (24,24)
            act_fn,
            nn.Conv2d(4 * out_channels, 4 * out_channels, 3, padding=1),
            act_fn,
            nn.Conv2d(4 * out_channels, 8 * out_channels, 3, padding=1, stride=2),  # (12,12)
            act_fn,
            nn.Conv2d(8 * out_channels, 8 * out_channels, 3, padding=1),
            act_fn,
            nn.Flatten(),
            nn.Linear(TotalReduction * out_channels * HiddenSize * HiddenSize, latent_dim),
            act_fn
        )

    def forward(self, x):
        x = x.view(-1, QtyColor, imgEdge, imgEdge)
        output = self.net(x)
        return output
----------------------------------------------------------------
        Layer (type)               Output Shape         Param #
================================================================
            Conv2d-1           [-1, 48, 96, 96]           1,344
              ReLU-2           [-1, 48, 96, 96]               0
            Conv2d-3           [-1, 48, 96, 96]          20,784
              ReLU-4           [-1, 48, 96, 96]               0
            Conv2d-5           [-1, 96, 48, 48]          41,568
              ReLU-6           [-1, 96, 48, 48]               0
            Conv2d-7           [-1, 96, 48, 48]          83,040
              ReLU-8           [-1, 96, 48, 48]               0
            Conv2d-9          [-1, 192, 24, 24]         166,080
             ReLU-10          [-1, 192, 24, 24]               0
           Conv2d-11          [-1, 192, 24, 24]         331,968
             ReLU-12          [-1, 192, 24, 24]               0
           Conv2d-13          [-1, 384, 12, 12]         663,936
             ReLU-14          [-1, 384, 12, 12]               0
           Conv2d-15          [-1, 384, 12, 12]       1,327,488
             ReLU-16          [-1, 384, 12, 12]               0
          Flatten-17                [-1, 55296]               0
           Linear-18                  [-1, 200]      11,059,400
             ReLU-19                  [-1, 200]               0
================================================================
Total params: 13,695,608
Trainable params: 13,695,608
Non-trainable params: 0
# Defining the Decoder
class Decoder(nn.Module):
    def __init__(self, in_channels=3, out_channels=outWtSize, latent_dim=200, act_fn=nn.ReLU()):
        super().__init__()

        self.out_channels = out_channels

        self.linear = nn.Sequential(
            nn.Linear(latent_dim, TotalReduction * out_channels * HiddenSize * HiddenSize),
            act_fn
        )

        self.conv = nn.Sequential(
            nn.ConvTranspose2d(8 * out_channels, 8 * out_channels, 3, padding=1),  # (12, 12)
            act_fn,
            nn.ConvTranspose2d(8* out_channels, 4 * out_channels, 3, padding=1, stride=2, output_padding=1),  # (24, 24)
            act_fn,
            nn.ConvTranspose2d(4 * out_channels, 4 * out_channels, 3, padding=1),
            act_fn,
            nn.ConvTranspose2d(4 * out_channels, 2 * out_channels, 3, padding=1, stride=2, output_padding=1),  # (48, 48)
            act_fn,
            nn.ConvTranspose2d(2 * out_channels, 2 * out_channels, 3, padding=1),
            act_fn,
            nn.ConvTranspose2d(2 * out_channels, out_channels, 3, padding=1, stride=2, output_padding=1),  # (96, 96)
            act_fn,
            nn.ConvTranspose2d(out_channels, out_channels, 3, padding=1),
            act_fn,
            nn.ConvTranspose2d(out_channels, in_channels, 3, padding=1)
        )

    def forward(self, x):
        output = self.linear(x)
        output = output.view(-1, TotalReduction * self.out_channels, HiddenSize, HiddenSize)
        output = self.conv(output)
        return output
----------------------------------------------------------------
        Layer (type)               Output Shape         Param #
================================================================
            Linear-1             [-1, 1, 55296]      11,114,496
              ReLU-2             [-1, 1, 55296]               0
              ReLU-3             [-1, 1, 55296]               0
   ConvTranspose2d-4          [-1, 384, 12, 12]       1,327,488
              ReLU-5          [-1, 384, 12, 12]               0
              ReLU-6          [-1, 384, 12, 12]               0
   ConvTranspose2d-7          [-1, 192, 24, 24]         663,744
              ReLU-8          [-1, 192, 24, 24]               0
              ReLU-9          [-1, 192, 24, 24]               0
  ConvTranspose2d-10          [-1, 192, 24, 24]         331,968
             ReLU-11          [-1, 192, 24, 24]               0
             ReLU-12          [-1, 192, 24, 24]               0
  ConvTranspose2d-13           [-1, 96, 48, 48]         165,984
             ReLU-14           [-1, 96, 48, 48]               0
             ReLU-15           [-1, 96, 48, 48]               0
  ConvTranspose2d-16           [-1, 96, 48, 48]          83,040
             ReLU-17           [-1, 96, 48, 48]               0
             ReLU-18           [-1, 96, 48, 48]               0
  ConvTranspose2d-19           [-1, 48, 96, 96]          41,520
             ReLU-20           [-1, 48, 96, 96]               0
             ReLU-21           [-1, 48, 96, 96]               0
  ConvTranspose2d-22           [-1, 48, 96, 96]          20,784
             ReLU-23           [-1, 48, 96, 96]               0
             ReLU-24           [-1, 48, 96, 96]               0
  ConvTranspose2d-25            [-1, 3, 96, 96]           1,299
================================================================
Total params: 13,750,323
Trainable params: 13,750,323
Non-trainable params: 0
----------------------------------------------------------------
# Defining the Autoencoder
class Autoencoder(nn.Module):
    def __init__(self, encoder, decoder):
        super().__init__()
        self.encoder = encoder.to(device)
        self.decoder = decoder.to(device)

    def forward(self, x):
        encoded = self.encoder(x)
        decoded = self.decoder(encoded)
        return decoded

Download the data set¶

            In this project, we will use the PyTorch libraries such as datasets and transforms to download the data and also perform the data transformation.

import torchvision.transforms as transforms
import torchvision.datasets as Datasets

            This project will also demonstrate the effect of the normalization to the input data. Therefore, there are two types of transformation pipelines we will use.

(1) Tensor transformation only¶

            The following command is used to download the original training set of the STL10 datasets.

training_set = Datasets.STL10(root='./', split='train', download=True, transform=transforms.ToTensor())

(2) Tensor transformation plus normalization¶

            In some cases, the normalization of the input data can help reshape the data into a normal distribution. In this process, it normalizes the frequency of a feature appearing n times in a 10xn sequence to be the same as the frequency of a feature appearing 2xn times in a 20xn sequence.

            The normalization formula is

\begin{equation} P_{normalized}=\frac{P-P_{mean}}{P_{std}} \end{equation}

            The normalization could be detrimental for an image processing. It could shift the image intensity range and cause degradation of image quality. In this project, we will demonstrate the impact of normalization by observing the learning loss.

transform_pipeline = transforms.Compose([ transforms.ToTensor(), transforms.Normalize((0.5, 0.5, 0.5), (0.5, 0.5, 0.5)) ])

training_set = Datasets.STL10(root='./', split='train', download=True, transform=transform_pipeline)

References:
Torchvision Transforms: https://docs.pytorch.org/vision/0.9/transforms.html
Torchvision Datasets: https://docs.pytorch.org/vision/stable/datasets.html

Retrieve the image class names¶

            There are 10 classes of images in STL10.

class_names=dict(zip(range(10), training_set.classes))

Convert the tensor to the viewable image¶

            The sequence of color channels and pixel maps in an image tensor (3x96x96) is different from the image interpretation (96x96x3) in Python. Therefore, it is required to reshape the matrix structure in order to visualize an image. This step is only required for the image visualization. It is not required during the machine learning process.

img = img.numpy().transpose((1, 2, 0))

Normalized images¶

            The normalized image tends to be darker because the original pixel map does not contain many white cells (which has the values toward 255). The normalized pixel values are centered toward darker values.

Create DataLoaders¶

            A DataLoader wraps an iterable around the Dataset to enable easy access to the data in a specific way. The Dataset retrieves a dataset’s features and labels one sample at a time. While training a model, we typically want to pass samples in “minibatches”, reshuffle the data at every epoch to reduce model overfitting. There will be three dataloaders, one for each (training, validation, and test).

from torch.utils.data import Dataset, DataLoader, TensorDataset, Subset
dataloader_tr=DataLoader(training_set, batch_size=64)

Reference: DataLoader: https://docs.pytorch.org/tutorials/beginner/basics/data_tutorial.html

Initialization for the network training¶

            The network is an autoencoder which is defined in a paragraph above.
An optimizer is selected for optimizing the network parameters at each training epoch.

network = Autoencoder(Encoder(), Decoder())
optimizer = torch.optim.Adam(self.network.parameters(), lr=1e-3)

            It is required to initialize the network's weights. The function init_weights is defined below:

network.apply(init_weights)

 def init_weights(module):
            if isinstance(module, nn.Conv2d):
                torch.nn.init.xavier_uniform_(module.weight)
                module.bias.data.fill_(0.01)
            elif isinstance(module, nn.Linear):
                torch.nn.init.xavier_uniform_(module.weight)
                module.bias.data.fill_(0.01)
                

Set Training mode¶

            Set the module in training mode.

network.train()

Reference:
Training Mode: https://docs.pytorch.org/docs/stable/generated/torch.nn.Module.html#torch.nn.Module.train

At each training epoch:¶

Use DataLoader to train the network in batches:
            At each training epoch, the dataloader will load the data samples in batches. For example, if there are 5000 training samples in the dataloader and the batch size is 64 samples, there will be 5000/64~=79 iterations of training at each training epoch.

Use tqdm to show training progress:
            Use tqdm to show the training progress. tqdm is a Python library used to display smart progress meters for loops and iterations, commonly employed in PyTorch for monitoring training and evaluation loops. tqdm can be easily integrated into PyTorch training loops by wrapping tqdm() around any iterable, such as a DataLoader.

Use a loss function to calculate the training loss:
            The loss function is defined for the calculation of loss between the original image and the synthetic image. In this project, MSE loss is used to represent the training loss.

References:
tqdm: https://tqdm.github.io/
Mean Square Error: https://docs.pytorch.org/docs/stable/generated/torch.nn.MSELoss.html

from tqdm.notebook import tqdm

 for images, _ in tqdm(train_loader):
                # Zeroing gradients
                optimizer.zero_grad()
                
                # Reconstructing images
                output = network(images)
                
                # Computing loss
                loss = loss_function(output, images.view(-1, QtyColor, imgEdge, imgEdge))
                
                # Calculating gradients
                loss.backward()
                
                # Optimizing weights
                optimizer.step()
                
                # Collect loss value in the list train_losses[]
                train_losses.append(loss.item())
                
# Take the average loss value from the iterations of training 
# and collect this average value at each epoch

loss_trmean=np.mean(train_losses)
                
#log_loss['training_loss_per_batch'].append(loss.item())
log_loss['avg_training_loss_perEpoch'].append(loss_trmean)

At each epoch, obtain the loss associated with test samples¶

            At each epoch, we apply the trained network to the test images in order to obtain the loss values associated with the test samples which are not used to train the network. Same routine can be applied to the validation images.

 # ------------
 # TEST
 # ------------
            
    for test_images, _ in tqdm(test_loader):
        
        # disables gradient calculation 
        with torch.no_grad():
            
            #Obtain decoded image from the trained network
            output = network(test_images)
            
            #Compare the original image with the decoded image to calculate the loss value
            test_loss = loss_function(output, test_images.view(-1, QtyColor, imgEdge, imgEdge))
            
            test_losses.append(test_loss.item())
            
    loss_testmean=np.mean(test_losses)
    log_loss['avg_test_loss_perEpoch'].append(loss_testmean)
    

epochQty=121

model = Autoencoder(Autoencoder(Encoder(), Decoder()))

log_loss = model.train(nn.MSELoss(), epochs=epochQty, batch_size=64, train_loader=dataloader_tr, val_loader=dataloader_val, test_loader=dataloader_te)

Training Results¶

            There are four sets of trainings:

  • Train with original data. No additional processing such as normalization. Line color green.
  • Train with original data. No additional processing. Add dropout layers to reduce overfitting. Line color Orange.
  • Train with normalized data. No dropout layers. Line color Blue.
  • Train with normalized data with additional dropout layers. Line color Red.

            From the chart below, it shows:

  • The validation loss associated with data, not used by training, is generally higher than the training loss.
  • The normalized data was skewed to darker color range and did not help improve the network training.
  • The dropout process help the model improve its performance. The network was optimized at the 108th epoch before the overfitting occurs.
#The dropout layers were added to the encoder in this format below. :

  self.net = nn.Sequential(
            nn.Conv2d(in_channels, out_channels, 3, padding=1),  # (96, 96)
            act_fn,
            nn.Conv2d(out_channels, out_channels, 3, padding=1),
            act_fn,
      
            # Dropout layer
            nn.Dropout2d(p=0.2),
      
            nn.Conv2d(out_channels, 2 * out_channels, 3, padding=1, stride=2),  # (48, 48)
            act_fn,
            nn.Conv2d(2 * out_channels, 2 * out_channels, 3, padding=1),
            act_fn,
      
             # Dropout layer
            nn.Dropout2d(p=0.2),
      
            nn.Conv2d(2 * out_channels, 4 * out_channels, 3, padding=1, stride=2),  # (24,24)
            act_fn,
            nn.Conv2d(4 * out_channels, 4 * out_channels, 3, padding=1),
            act_fn,
            nn.Conv2d(4 * out_channels, 8 * out_channels, 3, padding=1, stride=2),  # (12,12)
            act_fn,
            nn.Conv2d(8 * out_channels, 8 * out_channels, 3, padding=1),
            act_fn,
            nn.Flatten(),
            nn.Linear(TotalReduction * out_channels * HiddenSize * HiddenSize, latent_dim),
            act_fn
        )
    
# The dropout layers were symmetrically added to the decoder below:

self.conv = nn.Sequential(
            nn.ConvTranspose2d(8 * out_channels, 8 * out_channels, 3, padding=1),  # (12, 12)
            act_fn,
            nn.ConvTranspose2d(8* out_channels, 4 * out_channels, 3, padding=1, stride=2, output_padding=1),  # (24, 24)
            act_fn,
            nn.ConvTranspose2d(4 * out_channels, 4 * out_channels, 3, padding=1),
            act_fn,
    
            # Dropout layer
            nn.Dropout2d(p=0.2),
    
            nn.ConvTranspose2d(4 * out_channels, 2 * out_channels, 3, padding=1, stride=2, output_padding=1),  # (48, 48)
            act_fn,

            nn.ConvTranspose2d(2 * out_channels, 2 * out_channels, 3, padding=1),
            act_fn,
    
            # Dropout layer
            nn.Dropout2d(p=0.2),
    
            nn.ConvTranspose2d(2 * out_channels, out_channels, 3, padding=1, stride=2, output_padding=1),  # (96, 96)
            act_fn,
            nn.ConvTranspose2d(out_channels, out_channels, 3, padding=1),
            act_fn,
            nn.ConvTranspose2d(out_channels, in_channels, 3, padding=1)
        )