Implement CovNet using PyTorch

    7 minute read    

This post shows how to build a ConvNet using PyTorch. A ConvNet is made up of Layers. Every Layer has a simple API: It transforms an input 3D volume to an output 3D volume with some differentiable function that may or may not have parameters.

Layers used to build ConvNet

A simple ConvNet is a sequence of layers. Each layer accepts an input 3D volume and transforms it to an output 3D volume through a differentiable function. There are three main types of layers to build ConvNet architectures:

  • CONV Layer: Convolutional Layer.

  • POOL Layer: Pooling Layer.

  • FC Layer: Fully-Connected Layer (exactly as seen in regular Neural Networks).

Other layers include:

  • INPUT layer: hold the raw pixel values of the image.
  • RELU layer: apply an elementwise activation function, such as the max(0,x) thresholding at zero.

We will stack these layers to form a full ConvNet architecture. In this way, ConvNets transform the original image layer by layer from the original pixel values to the final class scores.

Some layers contain parameters and other don’t.

  • The CONV/FC layers perform transformations that are a function of not only the activations in the input volume, but also of the parameters (the weights and biases of the neurons). The parameters in the CONV/FC layers will be trained with gradient descent so that the class scores that the ConvNet computes are consistent with the labels in the training set for each image.

  • The RELU/POOL layers will implement a fixed function without additional parameters.

  • Each Layer may or may not have additional hyperparameters (e.g. CONV/FC/POOL do, RELU doesn’t)

CONV Layer

An image is treated as a height x width x depth cube input in ConvNet. In constrast to fully-connected neural network, ConvNet only connect spatially proximate nodes in height x width axis and keep depth axis fully connected.

Conv Layer Source: Stanford Deep Learning

CONV layer multiplies a small part of the entire image, usually called a kernel, by the filter. To determine the size and location of the small image part, we need to specify a few parameters with respect to the spatial arrangement:

  • Depth / Kernel Size:
  • Kernel Type: the value of the actual filter, such as identity, edge detection, and sharpen.
  • Stride: the steps with which we slide the filter. When the stride is 1, we move the filters one pixel at a time. The larger the stride, the smaller output volumes spatially.
  • Padding: Pad the input volume with zeros around the border to make sure that the kernel properly passed over the edges of the image.
  • Output Layers: how many different kernels are applied to the image.

The output of the convolutional layer is called the “convolved feature” or “feature map”. This is just a filtered version of the original image where we multiplied some pixels by some numbers. The feature map can be viewed as a more optimal representation of the input image.

This layer is realized through the following function in PyTorch:


ReLU Layer

The CONV Layer only conduct a linear transformation of the original image. ReLu layer add in a nonlinear operation to help approximate the nonlinearity of the real word.

Rectified Linear Unit (ReLU) function Max(0, x) is just one most commonly used nonlinear function in NN. Other nonlinear functions include tanh or softmax. This layer is realized through PyTorch function:


POOL Layer: Max Pooling

A special POOL Layer is Max Pooling. As its name suggests, Max Pooling pass over a section of the image and pool them into the maximal value in the section. The POOL Layer reduce the size of the feature set. The following figure from Stanford Deep Learning visualizes it simply.

Max Layer

The parameters for the POOL Layer includes stride and padding.

There are other types of pooling functions such as sum pooling or average pooling.

MAX POOL Layer is implemented in PyTorch using function:


FC Layer

The output layer of a ConvNet is usually a fully connected layer in the traditional neural network architecture. This layer can be view as a final linear classifier.

The drive of higher accuracy in ConvNet is the steps we take to engineer better features – CONV, ReLU, and Max POOLing prepare the orginal data into useful information in an efficient manner.

Layer is implemented in PyTorch using function:


Implementation I: A Simple 2-Layer ConvNet


I will use the CIFAR-10 dataset. It contains images of 10 different classes, and is a standard library used for building CNNs.

This site lists best results for some common image classification tasks.

Python Code

The four major functions in PyTorch are:

# applies convolution
torch.nn.Conv2d(in_channels, out_channels, kernel_size, stride, padding) 

# applies ReLU

# applies max pooling
torch.nn.MaxPool2d(kernel_size, stride, padding)

# fully connected layer (multiply inputs by learned weights)
torch.nn.Linear(in_features, out_features)
# this should be put in a sepherate head file ''. 

# a simple 2 layer ConvNet class that inherits from the master _torch.nn.Module_ class

from torch.autograd import Variable
import torch.nn.functional as nnfun

class ConvNet2L(torch.nn.Module):
    # Later change to accept kernel_size, stride, padding as input
    def __init__(self):
        super(ConvNet2L, self).__init__()
        # Input channels = 3, output channels = 18
        self.conv1 = torch.nn.Cov2d(3, 18, kernel_size=3, stride=1, padding=1)
        self.pool = torch.nn.MaxPool2d(kernel_size=2, stride=2, padding=0)
        # 18 x 16 x 16 input features, 64 output features
        self.fc1 = torch.nn.Linear(18*16*16, 64)
        # 64 input features, 10 output for 10 defined classes
        self.fc2 = torch.nn.Linear(64, 10)
    def forward(self, x):
        # Compute the activation of the first convolution
        # Size changes from raw image (3, 32, 32) to (18, 32, 32)
        x = nnfun.relu(self.conv1(x))
        # Size changes from (18, 32, 32) to (18, 16, 16)
        x = self.pool(x)
        # Reshape data to vector to the input layer of the neural net
        # Size changes from (18, 16, 16) to (1, 18*16*16)
        x = x.view(-1, 18*16*16)
        # Computes the activition of the first fully connected layer
        # Size changes from (1, 18*16*16) to (1, 64)
        x = nnfun.relu(self.fc1(x))
        # Computes the second fully connected layer (activition applied layer)
        # Size changes from (1, 64) to (1, 10)
        x = self.fc2(x)
# Defined in a seperate file ''
# calculate the outputsize based on input_size, kernel_size, stride, and padding

def output_size(input_size, kernel_size, stride, padding):
    outputsize = int((input_size - kernel_size + 2*padding)/stride) + 1
# Defined in a seperate file ''
# Split data set for training, testing and cross validation set
from import SubsetRandomSampler

## Training
n_train_samples = 20000
train_sampler = SubsetRandomSampler(np.arrange(n_train_samples, dtype=np.int64))

## Validation
n_vali_samples = 500
vali_sampler = SubsetRandomSampler(np.arrange(n_train_samples, n_training_samples + n_vali_samples, dtype=np.int64))

## Testing
n_test_samples = 5000
test_sampler = SubsetRandomSampler(np.arrange(n_test_samples, dtype=np.int64))
# ''
# DataLoader takes in a dataset and a sampler for loading (num_works deals with system level memory)

def get_train_loader(batch_size):
    train_loader =, batch_size=batch_size, sampler=train_sampler, num_workers=2)

# Test and validation loaders have constant batch sizes, so we can define them directly
test_loader =, batch_size=4, sampler=test_sampler, num_workers=2)
vali_loader =, batch_size=128, sampler=vali_sampler, num_workers=2)
# Defined in a separate file ''
# Define loss and optimization

import torch.optim as optim

def loss_and_optimization(net, learning_rate=0.001):
    # Loss function
    loss = torch.nn.CrossEntropyLoss()
    # Optimizer
    optimizer = optim.Adam(net.parameters(), lr=learning_rate)
    return(loss, optimizer)


import time

def trainNN(net, batch_size, n_epochs, learning_rate):
    # Print all of the hyperparameters of the training iteration:
    print("####### Hyperparameters #######")
    print("batch_size = ", batch_size)
    print("epochs", n_epochs)
    print("learning_rate", learning_rate)
    print("#" * 31)
    # Get training data
    train_loader = get_train_loader(batch_size)
    n_batches = len(train_loader)
    # Create loss and optimizer function
    loss, optimizer = loss_and_optimization(net, learning_rate)
    # Time for printing
    training_start_time = time.time()
    # Loop for n_epochs
    for epoch in range(n_epochs):
        running_loss = 0.0
        print_every = n_batches // 10
        start_time = time.time()
        total_train_loss = 0
        for i, data in enumerate(train_loader, 0):
            # Get inputs
            inputs, labels = data
            # Wrap them in a Variable object
            inputs, ables = Variable(inputs), Variable(labels)
            # Set the parameter gradients to zero
            # Forward pass, backward pass, optimize
            outputs = net(inputs)
            loss_size = loss(outputs, labels)
            # Print statistics
            running_loss +=[0]
            total_train_loss +=[0]
            # Print every 10th batch of an epoch
            if (i + 1) % (print_every + 1) == 0:
                print("Epoch {}, {:d} % \t train_loss: {:.2f} took: {:.2f}s".format(epoch+1, int(100 * (i+1) / n_batches), running_loss / print_every, time.time() - start_time))
                # Reset running loss and time
                running_loss = 0.0
                start_time = time.time()
        # end for i, data in enumerate(train_loader, 0):
        # At the end of the epoch, do a pass on the validation set
        total_val_loss = 0
        for inputs, labels, in val_loader:
            # Wrap tensor in Variables
            inputs, labels = Variable(inputs), Variable(labels)
            # Forward pass
            val_outputs = net(inputs)
            val_loss_size = loss(val_outputs, labels)
            total_val_loss +=[0]
        # end for inputs, labels, in val_loader:
        print("Validation loss = {:.2f}s".format(time.time()-training_start_time))
# main file: load data, train & test

# import all the required libraries
import numpy as np
import torch
import torchvision
import torchvision.transforms as transforms

# set random seed for reproducible results
rseed = 77

transform = transforms.Compose([transforms.ToTensor(), transforms.Normalize((0.5, 0.5, 0.5), (0.5, 0.5, 0.5))])

train_set = torchvision.datasets.CIFAR10(root='./cifardata', train=True, download=True, transform=transform)
test_set = torchvision.datasets.CIFAR10(root='./cifardata', train=False, download=True, transform=transform)

classes = ('plane', 'car', 'bird', 'cat',
           'deer', 'dog', 'frog', 'horse', 'ship', 'truck')

ConvNet = ConvNet2L()
trainNN(ConvNet, batch_size=32, n_epochs=5, learning_rate=0.001)