This blog post delivers the fundamental principles behind object detection and it's algorithms with rigorous intuition.
Prerequisites :
Some basic knowledge in Deep Learning / Machine Learning / Mathematics .
1.) What is object detection ?
2.) Explanation of some of the terminologies involved in object detection.
3.) Walk-through to ANN,CNN
4.) Principles behind SSD and simple implementation in python.
5.) Case study involving ENA24 dataset(detecting wildlife animals).
What is object detection?
It is a technique in computer vision, which is used to identify and locate objects in an image or video. The camera application deployed in recent computers uses object detection to identify face(s).There are many applications to it ranging from medical domain to more advanced domain like space research.
There are many terminologies (jargon's) associated with object detection technology .We will discuss some of them in detail .
1.)
Object
: A material thing that can be seen and touched. In the computer vision community, it is considered as a group of pixels in an image or video which represent an material.
In the above image, both apple and banana is an object. If you notice the image carefully, we can infer that a group of yellow colored pixels with a curvy structure forms an banana. Similarly, a group of red colored pixels with some structure forms an apple. So, basically most of the object detection algorithms involves in finding the common
structure
which forms an object.
2.) Bounding box
: It is an geometrical figure (square,circle,etc.,) which encloses the object of interest. If you look at the above image, the red colored rectangle depicts an bounding box. It encloses the apple(object) in the image.Similarly,the blue colored rectangle which encloses the banana(object) in the image depicts an bounding box. This bounding box is also called as ground-truth bounding box, as it is not generated by the object detection algorithm, rather it is already
given
.
3.) Anchor box
: It is an geometrical figure (square,circle,etc.,) which is generated by the object detection algorithm in order to identify and locate an object in an image. This definition will make more sense, when we try to discuss SSD.
In the above image, the
object
of interest is the
truck
. The
bounding box
(
ground-truth box
) is the
black colored rectangle
. The goal of the algorithm is to
predict
the
ground-truth box
and the
object
contained in it.For that, it
proposes
multiple
anchor boxes
and
filters
out some of them based on certain
criteria
. Don't feel intimidated if you don't understand.
We will discuss these ideas in more detail in the later part.
4.) Category or class :
This defines
a
name
for an
object
. In the above image ,
banana
is the
class(category)
of the
object
enclosed in the
blue colored rectangle.
We can even call it as a fruit. Usually, we keep it simple.
Artificial Neural Network(s).
Let us denote the price for one vegetable as
P
Let us denote the quantity(total number of vegetables) as
Q
Therefore, the price for Q number of items is
P * Q
Gradient descent :
1.) Pose the objective or loss function .
2.) Compute the derivative of the objective function with respect to the parameters of the function.
3.) Displace the parameter in opposite to the gradient vector (generalization of derivative).
It is a simple calculus exercise. You can try solve it on your own !!
Why we call it as a network ?
Convolutional neural networks :
1.) Local features.
2.) Features are location independent ( It can be seen anywhere in the image ) .
1.) Network must try to detect local features .
2.) It must exploit translational invariance ( Even, if the features are displaced in the image the network must correctly identify it ).
1.) To use the property numbered one , the network only considers a part of the input in a given time. This way , the network can identify local features in the input . Therefore , in a given layer we consider a subset of features from the superset of the overall features . This is usually implemented (not so) in practice by considering a weight matrix which has zeros in the place where the weights tries to focus outside the part of the input .
2.) To use the property numbered two , the network must be designed in such a way that it slides across the input to find the possible locations where the feature(s) are present. The sliding window can be controlled by some parameter(s). One such way to implement this method is by defining a Toeplitz matrix (Refer Wikipedia to get more on this ) .
Property 1:
Property 2 :
Principles behind SSD
Building blocks of SSD :
1.) Bounding box
: In the beginning of this post , we discussed about this in detail. Here, we are going to code this up in Python. I am assuming the readers are familiar with the requirements mentioned above .Let's import some libraries
import mxnet,d2l
from d2l import mxnet as d2l
from mxnet import np,npx
npx.set_np()
Let's read an image
image= mxnet.image.imread('H.jpg')
print("Number of channels ",image.shape[-1])
print("Image (height,width) ",image.shape[0],image.shape[1])
d2l.plt.imshow(image.asnumpy())
How to draw the bounding box? There are many tools available online and offline to do this task. One such tool is LabelImg. I used this tool to draw the bounding box around the object. Since we are dealing with one image ,we can itself do it. A bounding box is specified by four coordinates.
anchor_box_coordinates=[45,32,150,170]
def bbox_to_rect(bbox, color):
"""Convert bounding box to matplotlib format."""
# Convert the bounding box (top-left x, top-left y, bottom-right x,
# bottom-right y) format to matplotlib format: ((upper-left x,
# upper-left y), width, height)
return d2l.plt.Rectangle(
xy=(bbox[0], bbox[1]), width=bbox[2]-bbox[0], height=bbox[3]-bbox[1],
fill=False, edgecolor=color, linewidth=2)
fig = d2l.plt.imshow(img.asnumpy())
fig.axes.add_patch(bbox_to_rect(anchor_box_coordinates, 'blue'))
d2l.plt.show()
2.) Convolutional neural network:
We need some mechanism to identify and locate the letter 'H' in the image. We use convolutional neural network to perform this task. What does it do ?.We already mentioned in the beginning of this post, that an object is nothing but a collection of pixels with some structure. If we identify the structure and the group of pixels, we can easily spot the object in it. How do we do it ?.If we look carefully in the above image, the letter H contains two vertical edges and one horizontal edge. This is somewhat unique to this object. If we design an algorithm to find vertical and horizontal edges in an image , we can essentially detect the presence of the letter H. This is what exactly an convolutional neural network does. It inherits the principles of pattern matching and tries to use it to match features. We use what is called as a filter to accomplish this task.
Let's code this up in python🎈
kernel=np.array([[1,1,1],[0,0,0],[-1,-1,-1]])
kernel=np.expand_dims(kernel,axis=0)
kernel=np.concatenate([kernel,kernel,kernel],axis=0)
print(kernel,"\n")
print(kernel.shape)
conv=mxnet.gluon.nn.Conv2D(3,3,padding=1,in_channels=3)
conv.initialize(init=mxnet.init.Constant(kernel))
print(conv.weight.data())
Let's perform the horizontal edge detection in our image.
image=img.transpose(2,0,1)
image=np.expand_dims(image,axis=0)
image=image.astype('float32')
filtered=conv(image)
print(filtered)
We will first implement the convolutional neural network
class Convolution(mxnet.gluon.nn.Block):
def __init__(self):
super().__init__()
self.conv1 = mxnet.gluon.nn.Conv2D(channels=3,kernel_size=3,padding=1)
self.bn1 = mxnet.gluon.nn.BatchNorm()
self.conv2 = mxnet.gluon.nn.Conv2D(channels=3,kernel_size=3,padding=1)
self.bn2 = mxnet.gluon.nn.BatchNorm()
def forward(self,x):
x=npx.relu(self.bn1(self.conv1(x)))
x=npx.relu(self.bn2(self.conv2(x)))
return x
conv = Convolution()
conv.initialize()
image=np.expand_dims(img,axis=0)
image=image.astype('float32')
print(conv(image))
3.) Anchor box generation :
position in the matrix as the center, where 0 <= i <= h and 0 <=j <=w . Therefore, total number of anchor boxes generated are w * h * ( m + n - 1 )
#A simple routine to draw the rectangle specified by the coordinates.
#To have further details,please free to look into matplotlib documentation.
def bbox_plot(bbox,color):
return d2l.plt.Rectangle(
xy=(bbox[0], bbox[1]), width=bbox[2]-bbox[0], height=bbox[3]-bbox[1],
fill=False, edgecolor=color, linewidth=2)
#This routine is to plot 'n' number of bounding boxes.
#All the code is implemented using matplotlib.
#We first draw the rectangle and define the patch to it by adding an text.
def multi_bbox(axes,bbox,labels,colors,size):
for i,b in enumerate(bbox):
if b[0]!=-size:
c=colors[i]
l=labels[i]
rectangle=bbox_plot(b,c)
axes.add_patch(rectangle)
t_color='w'
axes.text(rectangle.xy[0], rectangle.xy[1], l,
va='center', ha='center', fontsize=9, color=t_color,
bbox=dict(facecolor=c, lw=0))
sizes=[[0.27,0.44,0.52,0.62,0.71,0.85]]
ratios=[[1,1.5,2,2.2]]
#This code is taken from the matplotlib documentation.
#It is used to return the colors for the edges in the matplotlib library
prop_cycle = plt.rcParams['axes.prop_cycle']
colors = prop_cycle.by_key()['color']
image=mxnet.image.imread('/content/sample.PNG')
image=transforms.Resize(13)(image)
anchors=mxnet.npx.multibox_prior(image,sizes=sizes,ratios=ratios)
anchors=anchors.reshape(13,13,9,4)
anchors=anchors[5,5,:,:]
fig=d2l.plt.imshow(image.asnumpy())
scale=np.array([[13,13,13,13]]).as_in_context(device)
multi_bbox(fig.axes,a*scale,['' for i in anchors],colors[:9],13)
4.) Category prediction :
SSD algorithm :
def class_predictions(num_anchors , num_categories):
return mxnet.gluon.nn.Conv2D( num_anchors * (num_categories + 1) , kernel_size=3,padding=1)
def anchorbox_offsets(num_anchors):
return nn.Conv2D(num_anchors * 4, kernel_size=3, padding=1)
import mxnet,d2l
from mxnet import np,npx
from mxnet.gluon import nn
npx.set_np()
class Convolution(nn.Block):
def __init__(self):
super().__init__()
self.conv1=nn.Conv2D(channels=3,kernel_size=3,padding=1,strides=2)
self.bn1=nn.BatchNorm()
self.conv2=nn.Conv2D(channels=3,kernel_size=3,padding=1,strides=2)
self.bn2=nn.BatchNorm()
def forward(self,x):
x=npx.relu(self.bn1(self.conv1(x)))
x=npx.relu(self.bn2(self.conv2(x)))
return x
def class_predictions(num_anchors,num_categories):
return nn.Conv2D(num_anchors*(num_categories+1),kernel_size=3,padding=1)
def anchorbox_offsets(num_anchors):
return nn.Conv2D(num_anchors*4,kernel_size=3,padding=1)
class SSD(nn.Block):
def __init__(self):
super().__init__()
self.sizes=[0.27,0.44,0.52,0.62,0.71,0.85]
self.ratios=[1,1.5,2,2.2]
self.num_anchors = len(self.sizes) + len(self.ratios) -1
self.basenet=Convolution()
self.class_predictions=class_predictions(self.num_anchors,1)
self.anchorbox_offset=anchorbox_offsets(self.num_anchors)
def forward(self,x):
x=self.basenet(x)
anchors=npx.multibox_prior(x,self.sizes,self.ratios)
class_predictions= self.class_predictions(x)
anchorbox_offsets= self.anchorbox_offset(x)
anchors = anchors.reshape(-1,self.num_anchors*50*50,4)
class_predictions = class_predictions.reshape(-1,self.num_anchors*50*50,2)
anchorbox_offsets = anchorbox_offsets.reshape(-1,self.num_anchors*50*50*4)
return anchors,class_predictions,anchorbox_offsets
ssd=SSD()
ssd.initialize()
x=np.ones(shape=(1,3,200,200))
a,c,b=ssd(x)
a.shape,c.shape,b.shape
1.) SoftMax Cross Entropy Loss :
This loss is defined for the category predictions.2.) L1Loss :
This loss is defined for the offset predictions. L2Loss doesn't penalize if the difference is very small.category_loss = mxnet.gluon.loss.SoftmaxCrossEntropyLoss()
offset_loss= mxnet.gluon.loss.L1Loss()
model = SSD ()
model.initialize()
x = np.ones(shape=(1,3,200,200))
anchor,class_predictions,offsets = model(x)
trainer = mxnet.gluon.Trainer ( model.collect_params() , 'sgd' , {'learning_rate':0.01} )
def train(image,bounding_box,num_epochs):
for epoch in range(num_epochs):
with mxnet.autograd.record():
anchors,class_predictions,offset_predictions = model(image)
bbox_label,bbox_mask,class_truth = npx.multibox_target(anchors,bounding_box,class_predictions.transpose(0,2,1
l=category_loss(class_predictions,class_truth) + offset_loss(offset_predictions*bbox_masks , bbox_labels*bbox_masks)
l.backward()
trainer.step(1)
How do we do inference ?
def inference(image):
anchors, class_preds, offset_preds = model(image)
class_probs = npx.softmax(class_preds).transpose(0, 2, 1)
output = npx.multibox_detection(class_probs, offset_preds, anchors)
idx = [i for i, row in enumerate(output[0]) if row[0] != -1]
return output[0, idx]
Let's get started.
Description
Real word problem
Problem Statement :
2.) This could be useful to monitor wildlife animals to safeguard them.
Useful Links :
https://www.nature.com/scitable/knowledge/library/ethics-of-wildlife-management-and-conservation-what-80060473
Business objectives and constraints :
Installing the requirements :
#Installing the requirements.
!pip install mxnet-cu101==1.7.0
# !pip install tensorflow #Not required if you are in colaboratory notebook.
!pip install -U d2l
!pip install --upgrade mxnet-cu101 gluoncv
!pip install mxboard
# !pip install tensorboard #Not required if you are in colaboratory notebook
# !pip install tqdm #Not required if you are in colaboratory notebook
Importing the required libraries :
#Importing the necessary modules.
import mxnet
from mxnet import np,npx,image
import os,json,tqdm
from tqdm import tqdm
os.environ['TF_CPP_MIN_LOG_LEVEL'] = '3'
from d2l import mxnet as d2l
import matplotlib.pyplot as plt
import time
from mxnet.gluon import nn
import gluoncv
from mxnet.gluon.data.vision import transforms
npx.set_np()
Downloading the dataset :
#wget command is used to download multiple files from the Internet.
!wget --header="Host: lilablobssc.blob.core.windows.net" --header="User-Agent: Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/84.0.4147.105 Safari/537.36" --header="Accept: text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9" --header="Accept-Language: en-GB,en-US;q=0.9,en;q=0.8" --header="Referer: http://lila.science/datasets/ena24detection" "https://lilablobssc.blob.core.windows.net/ena24/ena24.zip" -c -O 'ena24.zip'
#This command unzips the downloaded file
!unzip /content/ena24.zip -d /content/Images
#wget command is used to download multiple files from the Internet.
!wget --header="Host: lilablobssc.blob.core.windows.net" --header="User-Agent: Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/84.0.4147.105 Safari/537.36" --header="Accept: text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9" --header="Accept-Language: en-GB,en-US;q=0.9,en;q=0.8" --header="Referer: http://lila.science/datasets/ena24detection" "https://lilablobssc.blob.core.windows.net/ena24/ena24.json" -c -O 'ena24.json'
Exploratory Data Analysis:
Data preprocessing :
#Using the json module,we are opening the json file
with open('ena24.json') as file:
info=json.load(file)
#Displaying the keys contained in the json formatted file.
print(info.keys())
#Displaying the content in the image key.
print(info['images'][:2])
#Displaying the content in the annotations key.
print(info['annotations'][0])
#Displaying the content in the categories key.
print(info['categories'][:2])
#Displaying the content in the info key.
print(info['info'])
#Creating a empty dictionaries to store the details about images,annotations and categories.
images={}
annotations={}
categories={}
#We are extracting the information from the json file and storing it in separate dictionaries.
for i in info['images']:
key=int(i['id'])
images[key]={}
images[key]['file_name']=i['file_name']
images[key]['height']=i['height']
images[key]['width']=i['width']
#Displaying the information in the images dictionary.
images[8703]
#We are just appending the bounding boxes and category it belongs to.
#We are introducing a invariant called count.
#The count variable keeps track of the number of valid bouding boxes.
#More about the details regarding valid bounding boxes in the later part.
for i in info['annotations']:
key=int(i['image_id'])
if key not in annotations:
annotations[key]={}
annotations[key]['bbox']=[]
category=i['category_id']
i['bbox'].insert(0,category)
annotations[key]['bbox'].append(i['bbox'])
annotations[key]['count']=1
annotations[key]['category']=category
else:
category=i['category_id']
i['bbox'].insert(0,category)
annotations[key]['bbox'].append(i['bbox'])
annotations[key]['count']+=1
2.) We append the category id in the bounding box list at the start for future purpose.
3.) Count is defined as the number of valid bounding boxes for an image.
4.) The meaning of valid makes sense, once we move further.
#Displaying the keys in annotations.
annotations.keys()[:10]
#Displaying the values in the annotations dictionary
print(annotations.values()[0)
#We are creating a map for storing the categories where key is the unique id and the value is the category it belongs to.
for i in info['categories']:
id=int(i['id'])
categories[id]=i['name']
#Displaying the category.
print(categories[10])
#We are using stochastic version of batch gradient descent.
#Therefore,it requires the data to be in batches.
#But the problem here is that,each image contains different number of bounding boxes.
#So,we find the image with maximum number of bouding boxes.
#We add extra illegal bounding boxes to other images until it reaches the maximum value.
#We add it with a special value ( -1 ).
#MxNet framework safely ignores the bounding boxes with -1 as labelled.
#We are also resizing the bounding box,since we are resizing the image to (200,200).
def bbox_transform(bbox,in_width,in_height,out_height,out_width,m,n):
temp=np.zeros(shape=(8,5))
temp[:m,1:]=gluoncv.data.transforms.bbox.resize(bbox[:,1:],(in_width,in_height),(out_width,out_height))
temp[:m,1:]/=200
temp[m:,:]=-1
temp[:m,0]=bbox[:m,0]
return temp
#The format of the bounding box coordinates provided in the dataset are as follows:
# (x_min,y_min,width,height)
# We need to transform it into: (x_min,y_min,x_max,y_max)
# x_max= x_min + width
# y_max= y_min + height
# The above format is what MxNet expects.
# Also,MxNet is expecting the bounding box to be normalized i.e., it needs to be divided by the image's width and height.
def bbox_normalize(arg):
for i in arg.keys():
img_width,img_height=images[i]['width'],images[i]['height']
if type(annotations[i]['bbox'][0]) is list:
m=len(annotations[i]['bbox'])
for j in range(m):
v=annotations[i]['bbox'][j]
width=v[3]
height=v[4]
annotations[i]['bbox'][j][3]=annotations[i]['bbox'][j][1]+width
annotations[i]['bbox'][j][4]=annotations[i]['bbox'][j][2]+height
annotations[i]['bbox']=np.array(annotations[i]['bbox'])
annotations[i]['bbox']=bbox_transform(annotations[i]['bbox'],img_width,img_height,200,200,m,8)
else:
v=annotations[i]['bbox']
width=v[3]
height=v[4]
annotations[i]['bbox'][3]=annotations[i]['bbox'][1]+width
annotations[i]['bbox'][4]=annotations[i]['bbox'][2]+height
annotations[i]['bbox'][1]/=img_width
annotations[i]['bbox'][3]/=img_width
annotations[i]['bbox'][2]/=img_height
annotations[i]['bbox'][4]/=img_height
1.) We are using minibatch stochastic gradient descent for the optimization part.
bbox_normalize(annotations)
Data Visualization:
#A simple routine to draw the rectangle specified by the coordinates.
#To have further details,please free to look into matplotlib documentation.
def bbox_plot(bbox,color):
return d2l.plt.Rectangle(
xy=(bbox[0], bbox[1]), width=bbox[2]-bbox[0], height=bbox[3]-bbox[1],
fill=False, edgecolor=color, linewidth=2)
#This routine is to plot 'n' number of bounding boxes.
#All the code is implemented using matplotlib.
#We first draw the rectangle and define the patch to it by adding an text.
def multi_bbox(axes,bbox,labels,colors,size):
for i,b in enumerate(bbox):
if b[0]!=-size:
c=colors[i]
l=labels[i]
rectangle=bbox_plot(b,c)
axes.add_patch(rectangle)
t_color='w'
axes.text(rectangle.xy[0], rectangle.xy[1], l,
va='center', ha='center', fontsize=9, color=t_color,
bbox=dict(facecolor=c, lw=0))
#Displaying the sample bounding box.
#The value -1 in the array indicates that, it is a illegal bounding box present for creating a batch.
bbox=annotations[3]['bbox']
bbox
#We are reading the image using OpenCV
#We are resizing the image using Resize block (MxNet block)
#We are displaying the image with bounding boxes on it.
image=mxnet.image.imread('/content/ena24/3.jpg')
image=transforms.Resize(200)(image)
fig=d2l.plt.imshow(image.asnumpy())
multi_bbox(fig.axes,bbox[:,1:]*np.array([[200,200,200,200]]),[categories[int(i[0])] if int(i[0])!=-1 else 'Null' for i in bbox],['r' for i in range(9)],200)
#We are defining a count variable and storing the number of valid bounding boxes for each image.
counts=[]
for i in annotations.values():
counts.append(i['count'])
#This is just a simple plot describing the number of bounding boxes for each image.
#This plot shows that, the maximum number of bounding boxes for an single image is 8.
#Most of the images contains 4 or less number of bounding boxes.
d2l.plt.title('Bounding box id vs Number of bounding boxes')
d2l.plt.xlabel('Bounding box id')
d2l.plt.ylabel('Number of bounding box')
d2l.plt.plot(counts)
d2l.plt.show()
#It calculates the maximum number of bounding box for an single image.
print(max(counts))
#This simple routine calculates the area of the bounding boxes.
#It is to describe the sizes of the objects contained in the image.
#The formula is :
# (width * height)
# width= (x_max-x_min) and height= (y_max-y_min)
#The area values are bounded between [0,1], since we normalized them.
def calculate_area(bbox_dict):
areas=[]
for i in bbox_dict.keys():
b=bbox_dict[i]['bbox']
c=bbox_dict[i]['count']
for j in range(c):
temp=b[j]
area=(temp[3]-temp[1])*(temp[4]-temp[2])
areas.append(area.item())
return areas
areas=calculate_area(annotations)
d2l.plt.ylabel('Area')
d2l.plt.xlabel('Bounding box')
d2l.plt.title('Area vs Bounding box id')
d2l.plt.plot(areas)
d2l.plt.show()
Preparing the dataset.
#It uses the os module to store the list of files in the main variable.
main=os.listdir('/content/ena24')
#We are ignoring the '.jpg' expression for reading purpose.
main=[int(i[:-4]) for i in main]
#Displaying the length of the dataset.
print(len(main))
#Since, there is no notion of time axis, we can safely split the dataset randomly or using any valid method.
train_indices=main[:7031]
test_indices=main[7031:]
# TWe are inheriting from the Dataset abstract class.
# We are storing the indices,images and their responding annotations.
# We are storing the location where the dataset is stored in the local disk.
# The length method returns the number of valid examples for training the model.
# The getitem method is used to select an example from the list of examples and applies some transformations if it has.
class Ena24TrainDataset(mxnet.gluon.data.Dataset):
def __init__(self,train_indices,images,annotations,root):
super().__init__()
self.train_indices=train_indices
self.images=images
self.annotations=annotations
self.root=root
def __len__(self):
return len(self.train_indices)
def __getitem__(self,idx):
index=self.train_indices[idx]
image=mxnet.image.imread(os.path.join(self.root,self.images[index]['file_name']))
bbox=self.annotations[index]['bbox']
return image,bbox
init method :
We use it for book-keeping purpose. We store the images, annotations, train_indices and the root location.len method :
We return the length of the dataset.getitem method :
We select an image and bounding box from the dataset. We return the image and bounding box.# TWe are inheriting from the Dataset abstract class.
# We are storing the indices,images and their responding annotations.
# We are storing the location where the dataset is stored in the local disk.
# The length method returns the number of valid examples for training the model.
# The getitem method is used to select an example from the list of examples and applies some transformations if it has.
class Ena24TestDataset(mxnet.gluon.data.Dataset):
def __init__(self,test_indices,images,annotations,root):
super().__init__()
self.test_indices=train_indices
self.images=images
self.annotations=annotations
self.root=root
def __len__(self):
return len(self.test_indices)
def __getitem__(self,idx):
index=self.test_indices[idx]
image=mxnet.image.imread(os.path.join(self.root,self.images[index]['file_name']))
bbox=self.annotations[index]['bbox']
return image,bbox
# We are setting the device variable to be the gpu context.
# This is done,so that we can load our data in the gpu.
device=mxnet.gpu(0)
# This function defines the transformation to the samples in the dataset.
# It resizes the image to be of (200,200) and normalizes them.
def custom_transformations(*sample):
mean= np.array([0.485, 0.456, 0.406]).reshape(3,1,1)
std=np.array([0.229, 0.224, 0.225]).reshape(3,1,1)
img=sample[0]
bbox=sample[1]
img=transforms.Resize(200)(img)
img=transforms.ToTensor()(img)
img[:]-=mean
img[:]/=std
return (img.as_np_ndarray().as_in_context(device),bbox.as_in_context(device))
1. ) We normalize the image by ImageNet statistics .
2. ) We resize the image to ( 200 , 200 ) .
3. ) We convert into a ndarray Tensor ( it moves the channel dimension after the batch axis ) .
4. ) We put the data in the GPU memory .
#Creating a dataset object from the class we defined earlier.
train_data=Ena24TrainDataset(train_indices,images,annotations,'/content/ena24')
#We are creating a dataloader object which encompasses the dataset object and is used to produce batches while training.
#We set the value of the shuffle parameter as True.This is done to change the order of dataset in each epoch.
#We declared the batch_size as 8.
train_dataloader=mxnet.gluon.data.DataLoader(train_data.transform(custom_transformations),batch_size=8,shuffle=True,last_batch='discard')
#Creating a dataset object from the class we defined earlier.
test_data=Ena24TestDataset(test_indices,images,annotations,'/content/ena24')
#We are creating a dataloader object which encompasses the dataset object and is used to produce batches while training.
#We set the value of the shuffle parameter as True.This is done to change the order of dataset in each epoch.
#We declared the batch_size as 8.
test_dataloader=mxnet.gluon.data.DataLoader(test_data.transform(custom_transformations),batch_size=8,shuffle=True,last_batch='discard')
#This model is inspired from Inception network.
#We didn't use maxpooling layer.There is no reason behind it.It worked well without it.I heard Mr.Geoffrey Hinton saying that maxpooling layers are not good.
#All other traditional things are followed here.
class CustomConv(nn.HybridBlock):
def __init__(self):
super().__init__()
self.conv31=nn.Conv2D(200,3,padding=1)
self.conv51=nn.Conv2D(100,5,padding=2)
self.conv71=nn.Conv2D(50,7,padding=3)
self.conv11=nn.Conv2D(300,1)
self.conv32=nn.Conv2D(350,3,padding=1,strides=2)
self.conv33=nn.Conv2D(400,3,padding=1,strides=2)
self.conv34=nn.Conv2D(450,3,padding=1,strides=2)
self.conv35=nn.Conv2D(500,3,padding=1,strides=2)
self.bn1=nn.BatchNorm()
self.bn2=nn.BatchNorm()
self.bn3=nn.BatchNorm()
self.bn4=nn.BatchNorm()
self.bn5=nn.BatchNorm()
self.bn6=nn.BatchNorm()
self.bn7=nn.BatchNorm()
self.bn8=nn.BatchNorm()
def hybrid_forward(self,F,x):
y=F.npx.relu(self.bn1(self.conv31(x)))
z=F.npx.relu(self.bn2(self.conv51(x)))
t=F.npx.relu(self.bn3(self.conv71(x)))
x=F.np.concatenate((y,z,t),axis=1)
x=F.npx.relu(self.bn4(self.conv11(x)))
x=F.npx.relu(self.bn5(self.conv32(x)))
x=F.npx.relu(self.bn6(self.conv33(x)))
x=F.npx.relu(self.bn7(self.conv34(x)))
x=F.npx.relu(self.bn8(self.conv35(x)))
return x
2.) The dataset contains objects in the images of varying sizes.
3.) To model this aspect of this dataset, we need to design our model in accordance with that .
4.) So, we used different size kernels at the first layer and appended it's activations at the feature dimension , so that it models the varying size property .
#Instantiating the class to create an object.
model=CustomConv()
#Initializing the object.It is done for optimization part.To get more details about it,please free to look the MxNet documentation
model.initialize()
#Creating a sample input.
x=np.ones(shape=(1,3,200,200))
#Transforming the input by the rules specified by the model.
#Bonus: It is equivalent to calling model.forward(x) (A special __call __ method)
print(model(x).shape)
#Routine to calculate the number of trainable parameters.
#It uses the method collect_params(), which inturn returns the parameters stored in the model's kvstore.
ans=0
for i in list(model.collect_params().values()):
ans+=(i.data().size)
#Displaying the total parameters.
print("Total number of parameters ",ans)
Visualizing the model
#The hybridize method is used to convert the model's definition to symbolic representation(Used in C++ inference).It is a advanced concept.
model.hybridize()
model(x).shape
#We are using the Mxboard library which internally integrates with tensorboard.
#We are using the SummaryWriter and defining a context to add the model to visualize it.
from mxboard import SummaryWriter
with SummaryWriter(logdir='./ena24_tensorboard/model1') as sw:
sw.add_graph(model)
#Magic command to use the tensorboard in the jupyter notebook.
%load_ext tensorboard
#Loading the tensorboard.
%tensorboard --logdir ./ena24_tensorboard/model1/
Finally , we need to create a class which encapsulates all the information required to train our model.
#We are creating the final class to define all the methods and invariants for our final model.
#We define the sizes and ratios for the anchor boxes we are going to propose.
#We are defining a class prediction layer, which is a convolution block that transforms the input(batch_size,number of channels,height,width) to (batch_size,(number_of_anchors)*(num_classes+1),width,height)
#We are defining a bounding box prediciton layer,which is a convolution block that transforms the input(batch_size,number of channels,height,width) to (batch_size,(number_of_anchors)*4,width,height)
#We used convolution block,since it holds less number of parameters than a dense layer or any such.
#We are not using multi-stage blocks, so there is no need for concatenating the predictions at multiple stages.
#But still I implemented the routine for multi-stage blocks,for future purpose.
class Ena24SSD(nn.HybridBlock):
def __init__(self):
super(Ena24SSD,self).__init__()
self.sizes=[[0.27,0.44,0.52,0.62,0.71,0.85]]
self.ratios=[[1,1.5,2,2.2]]
self.num_anchors=len(self.sizes[0])+len(self.ratios[0])-1
self.num_classes=23
self.class_predict=self.class_predictor()
self.bbox_predict=self.bbox_predictor()
self.features=CustomConv()
def class_predictor(self):
s=nn.HybridSequential()
g=[nn.Conv2D(self.num_anchors*(self.num_classes+1),3,padding=1),nn.BatchNorm(),nn.Activation('softrelu')]
s.add(*g)
return s
def bbox_predictor(self):
b=nn.HybridSequential()
f=[nn.Conv2D(kernel_size=3,padding=1,channels=380),nn.BatchNorm(),nn.Activation('relu')]
s=[nn.Conv2D(kernel_size=3,padding=1,channels=220),nn.BatchNorm(),nn.Activation('relu')]
t=[nn.Conv2D(kernel_size=3,padding=1,channels=120),nn.BatchNorm(),nn.Activation('relu')]
fo=[nn.Conv2D(kernel_size=3,padding=1,channels=4*self.num_anchors),nn.BatchNorm(),nn.Activation('relu')]
b.add(*f,*s,*t,*fo)
return b
def hybrid_forward(self, F, x, *args, **kwargs):
feature=[]
feature.append(self.features(x))
cls_preds=F.npx.batch_flatten(F.np.transpose(self.class_predict(feature[0]),(0,2,3,1)))
bbox_preds=F.npx.batch_flatten(F.np.transpose(self.bbox_predict(feature[0]),(0,2,3,1)))
anchors=F.np.reshape(F.npx.multibox_prior(feature[0],self.sizes[0],self.ratios[0]),(1,-1))
cls_preds=F.npx.reshape(cls_preds,(-2,-1,self.num_classes+1))
bbox_preds=F.npx.reshape(bbox_preds,(-2,-1))
anchors=F.np.reshape(anchors,(1,-1,4))
return anchors,cls_preds,bbox_preds
#We are instantiating the class and creating an object.
model=Ena24SSD()
#We are initializing the model.
model.initialize(ctx=device)
#We are transforming the image to anchors,class_predictions,boundingbox_offsets.
image=np.ones(shape=(1,3,200,200)).as_in_context(device)
anchors,class_predictions,boundingbox_predictions=model(image)
#We are displaying the shape of the outputs returned by the model.
anchors.shape,class_predictions.shape,boundingbox_predictions.shape
#This code is taken from the matplotlib documentation.
#It is used to return the colors for the edges in the matplotlib library
prop_cycle = plt.rcParams['axes.prop_cycle']
colors = prop_cycle.by_key()['color']
#Displaying the available colors.
print(colors)
#We are visualizing the anchors on a dummy image.
#The anchor boxes are designed to cover all sizes of the object in an image.
#We used the multi_bbox function which was defined earlier in the notebook.
a=anchors.reshape(13,13,9,4)
a=a[5,5,:,:]
image=mxnet.image.imread('/content/sample.PNG')
image=transforms.Resize(13)(image)
fig=d2l.plt.imshow(image.asnumpy())
scale=np.array([[13,13,13,13]]).as_in_context(device)
multi_bbox(fig.axes,a*scale,['' for i in a],colors[:9],13)
#We are using the collect_params() method to display the parameters which the model holds.
print(model.collect_params())
#We are hybridizing the model.
model.hybridize()
a,c,b,=model(x.as_in_context(device))
#We are exporting the model to tensorboard format.
from mxboard import SummaryWriter
with SummaryWriter(logdir='./ena24_tensorboard/model2') as sw:
sw.add_graph(model)
#We are loading the tensorboard.
%tensorboard --logdir ./ena24_tensorboard/model2/
Metrics
#Micro F1 score is used to treat multiclass detection problem.We need to implement on our own(Not inplemented in the library itself).
class MicroF1Score(mxnet.gluon.nn.HybridBlock):
def __init__(self,num_classes):
super().__init__()
self.classes=list(range(num_classes+1))
self.true_pos=[0]*(len(self.classes)+1)
self.false_pos=[0]*(len(self.classes)+1)
self.false_neg=[0]*(len(self.classes)+1)
def hybrid_forward(self,F,x):
pred=x[0]
true=x[1]
for i in range(1,len(self.classes)):
p=mxnet.np.equal(pred,i)
t=mxnet.np.equal(true,i)
true_positive=float((p*t).sum())
false_positive=float(((true!=i)*p).sum())
false_negative=float((t*(pred!=i)).sum())
self.true_pos[i]=true_positive
self.false_pos[i]=false_positive
self.false_neg[i]=false_negative
true_pos=sum(self.true_pos)
false_pos=sum(self.false_pos)
false_neg=sum(self.false_neg)
if (true_pos+false_pos)==0:
precision=0
else:
precision=true_pos/(true_pos+false_pos)
if (true_pos+false_neg)==0:
recall=0
else:
recall=true_pos/(true_pos+false_neg)
if (precision+recall)==0:
f1score=0
else:
f1score=(2*(precision*recall))/(precision+recall)
return float(format(f1score,'.2g'))
#Overall accuracy
def evaluateclass(class_preds,class_labels):
predictions=npx.softmax(class_preds).argmax(axis=-1)
return ((predictions.astype(class_labels)==class_labels).mean()).item()
#Mean absolute deviation (for bounding box).
def evaluatebbox(bbox_preds,bbox_labels,bbox_masks):
return ((np.abs((bbox_labels-bbox_preds)*bbox_masks)).mean()).item()
Loss Function
#Since,it is a multiclass detection problem
classs_loss=mxnet.gluon.loss.SoftmaxCrossEntropyLoss()
#For boundning box offset loss
bbox_loss=mxnet.gluon.loss.L1Loss()
#Container for loss computation.
class LossBox:
def __init__(self):
self.weight=(np.ones(shape=(batch_size,num_anchors))).as_in_context(device)
self.weight1=((np.ones(shape=(batch_size,num_anchors*4)))*2).as_in_context(device)
def calculate_loss(self,class_preds,class_labels,bbox_preds,bbox_labels,bbox_masks,train):
if train==1:
weights=((self.weight*(class_labels!=0))*900)+((np.ones(shape=class_labels.shape,ctx=device))*100)
loss_class=classs_loss(class_preds,class_labels,np.expand_dims(weights,axis=-1))
loss_bbox=bbox_loss(bbox_preds*bbox_masks,bbox_labels*bbox_masks,self.weight1*bbox_masks)
else:
loss_class=classs_loss(class_preds,class_labels)
loss_bbox=bbox_loss(bbox_preds*bbox_masks,bbox_labels*bbox_masks)
return loss_bbox+loss_class
Optimization
#We are defining the trainer object.
#We are using Stochastic gradient descent in batches.
#We can use Adam,NaG,etc.
#The problem is,it itself contains some parameters.
#The sgd is working very well.
trainer=mxnet.gluon.Trainer(model.collect_params(),'sgd',{'learning_rate':0.01})
Training
#We are declaring some empty lists to store the metrics and losses.
train_loss_l=[]
test_loss_l=[]
train_f1_l=[]
test_f1_l=[]
train_accuracy_l=[]
test_accuracy_l=[]
train_mae_l=[]
test_mae_l=[]
num_epochs=50
with SummaryWriter(logdir='./models/plots') as sw:
for epoch in tqdm(range(num_epochs)):
start=time.time()
loss=0
train_loss=0
test_loss=0
train_f1=0
test_f1=0
train_accuracy=0
test_accuracy=0
train_mae=0
test_mae=0
for data in train_dataloader:
image,label=data[0],data[1]
with mxnet.autograd.record():
anchors,class_predictions,bbox_predictions=model(image)
bbox_labels,bbox_masks,class_labels=npx.multibox_target(anchors,label,class_predictions.transpose(0,2,1))
l=(losses.calculate_loss(class_predictions,class_labels,bbox_predictions,bbox_labels,bbox_masks,1)).sum()
loss=loss+(l.mean().item())
l.backward()
if len(train_loss_l)>=2:
if train_loss_l[-1]>=train_loss_l[-2]:
trainer.set_learning_rate(trainer.optimizer.lr/5)
trainer.step(batch_size)
train_loss=loss
train_mae=evaluatebbox(bbox_predictions,bbox_labels,bbox_masks)
class_p=(npx.softmax(class_predictions).argmax(axis=-1)).reshape(-1)
class_l=class_labels.reshape(-1)
train_f1=f1((class_p,class_l))
train_accuracy=evaluateclass(class_predictions,class_labels)
train_loss_l.append(train_loss)
train_f1_l.append(train_f1)
train_accuracy_l.append(train_accuracy)
train_mae_l.appen
test_f1_l.append(test_f1)d(train_mae)
for data in test_dataloader:
anchors,class_predictions,bbox_predictions=model(image)
bbox_labels,bbox_masks,class_labels=npx.multibox_target(anchors,label,class_predictions.transpose(0,2,1))
l=(losses.calculate_loss(class_predictions,class_labels,bbox_predictions,bbox_labels,bbox_masks,0)).sum()
loss=loss+(l.mean().item())
test_loss=loss
test_mae=evaluatebbox(bbox_predictions,bbox_labels,bbox_masks)
class_p=(npx.softmax(class_predictions).argmax(axis=-1)).reshape(-1)
class_l=class_labels.reshape(-1)
test_f1=f1((class_p,class_l))
test_accuracy=evaluateclass(class_predictions,class_labels)
test_loss_l.append(test_loss)
test_accuracy_l.append(test_accuracy)
test_mae_l.append(test_mae)
sw.add_scalar(tag='Log_loss',value={'train':train_loss,'test':test_loss},global_step=epoch)
sw.add_scalar(tag='Accuracy',value={'train':train_accuracy,'test':test_accuracy},global_step=epoch)
sw.add_scalar(tag='MAe',value={'train':train_mae,'test':test_mae},global_step=epoch)
sw.add_scalar(tag='Micro_F1',value={'train':train_f1,'test':test_f1},global_step=epoch)
if train_f1>=0.8 and test_f1>=0.8:
model.save_parameters('model_'+str(epoch))
dicts=dict(model.collect_params())
for i in dicts.keys():
if i[-6:]=='weight':
sw.add_histogram(tag=i,values=dicts[i].data(),global_step=epoch,bins=200)
else:
if i[-4:]=='bias':
sw.add_histogram(tag=i,values=dicts[i].data(),global_step=epoch)
end=time.time()
print("Time taken to run epoch ",epoch," ",(end-start)/60," minutes")
#We are loading the tensorboard.
%tensorboard --logdir models/plots/
Color coding :
Log loss :
Accuracy :
MAE (Mean absolute deviation ):
Micro_F1 score :
from prettytable import PrettyTable
table = PrettyTable()
table.field_names = [ 'Accuracy/train','Accuracy/test','Log_loss/train','Log_loss/test','MAe/train','MAe/test','Micro_F1/train','Micro_F1/test']
table.add_row([0.9988,0.9985,1.39,1.40,0.0025,0.0024,0.89,0.88])
print(table)
model.save_parameters('model')
Inference :
#We are transforming the image
def image_transform(image):
mean= np.array([0.485, 0.456, 0.406]).reshape(3,1,1)
std=np.array([0.229, 0.224, 0.225]).reshape(3,1,1)
image=transforms.ToTensor()(image)
image[:]-=mean
image[:]/=std
return np.expand_dims((image.as_np_ndarray()),axis=0).as_in_context(device)
#Used to display the anchor boxes on the image.
def bbox_show(bbox,image):
fig=d2l.plt.imshow(image)
multi_bbox(fig.axes,bbox[:,1:]*np.array([[200,200,200,200]]),[categories[int(i[0])] if int(i[0])!=-1 else 'Null' for i in bbox],['r']*(bbox.shape[0]),200)
#This routine takes in the image location and applies the transformation specified by the model.
#It used non-max suppression to filter out the anchor boxes.
#It outputs the anchor boxes,if the class probability is higher than the threshold.
def prediction(image_location,threshold):
image=mxnet.image.imread(image_location)
image=transforms.Resize(200)(image)
image_numpy=image.asnumpy()
image=image_transform(image)
anchors,class_predictions,bbox_offsets=model(image)
class_predictions=npx.softmax(class_predictions)
output=npx.multibox_detection(class_predictions.transpose(0,2,1),bbox_offsets,anchors,nms_threshold=0.5)
output=output[0]
bbox=[]
for i in output:
if i[0]!=-1:
if i[1]>=threshold:
bbox.append(i[[0,2,3,4,5]])
bbox=np.array(bbox)
bbox_show(bbox,image_numpy)
#Calling the method.
prediction('/content/ena24/125.jpg',0.97)
Creating an web application
#Exporting the model.
model.export('Ena24SSDMODEL')
Installing the requirements
#Installing the required libraries for creating a web application.
!pip install -U ipykernel
!pip install -q streamlit
!pip install pyngrok
#This magic command writes all the code into a file named model.py
%%writefile model.py
#Importing the necessary modules
import warnings,mxnet
from mxnet import gluon
ctx=mxnet.cpu(0)
import mxnet
from mxnet import np,npx,image
import os,json,tqdm
from tqdm import tqdm
os.environ['TF_CPP_MIN_LOG_LEVEL'] = '3'
from d2l import mxnet as d2l
import matplotlib.pyplot as plt
import time
from mxnet.gluon import nn
import gluoncv
from mxnet.gluon.data.vision import transforms
import streamlit as st
from PIL import Image
npx.set_np()
#Setting the device and other things.
device=mxnet.cpu(0)
ctx=mxnet.cpu(0)
st.set_option('deprecation.showfileUploaderEncoding',False)
st.header('ENA24 OBJECT DETECTION USING SSD')
st.subheader("Implemented in MXNet framework")
#This routine is used to load the model and uses st.cache decorator.
@st.cache(allow_output_mutation=True)
def load_model():
with warnings.catch_warnings():
warnings.simplefilter("ignore")
deserialized_net = gluon.nn.SymbolBlock.imports("Ena24SSD-symbol.json", ['data'], "Ena24SSD-0000.params", ctx=ctx)
return deserialized_net
#A spinner widget.
with st.spinner("Loading into memory"):
model=load_model()
#Defining the categories.
c=[{'name': 'Bird', 'id': 0}, {'name': 'Eastern Gray Squirrel', 'id': 1}, {'name': 'Eastern Chipmunk', 'id': 2}, {'name': 'Woodchuck', 'id': 3}, {'name': 'Wild Turkey', 'id': 4}, {'name': 'White_Tailed_Deer', 'id': 5}, {'name': 'Virginia Opossum', 'id': 6}, {'name': 'Eastern Cottontail', 'id': 7}, {'name': 'Human', 'id': 8}, {'name': 'Vehicle', 'id': 9}, {'name': 'Striped Skunk', 'id': 10}, {'name': 'Red Fox', 'id': 11}, {'name': 'Eastern Fox Squirrel', 'id': 12}, {'name': 'Northern Raccoon', 'id': 13}, {'name': 'Grey Fox', 'id': 14}, {'name': 'Horse', 'id': 15}, {'name': 'Dog', 'id': 16}, {'name': 'American Crow', 'id': 17}, {'name': 'Chicken', 'id': 18}, {'name': 'Domestic Cat', 'id': 19}, {'name': 'Coyote', 'id': 20}, {'name': 'Bobcat', 'id': 21}, {'name': 'American Black Bear', 'id': 22}]
categories={}
for i in c:
categories[i['id']]=i['name']
#Routine for diplaying the bounding boxes.
def bbox_plot(bbox,color):
return d2l.plt.Rectangle(
xy=(bbox[0], bbox[1]), width=bbox[2]-bbox[0], height=bbox[3]-bbox[1],
fill=False, edgecolor=color, linewidth=2)
#Routine to add some more details to the anchor boxes.
def multi_bbox(axes,bbox,labels,colors,size):
for i,b in enumerate(bbox):
if b[0]!=-size:
c=colors[i]
l=labels[i]
rectangle=bbox_plot(b,c)
axes.add_patch(rectangle)
t_color='w'
axes.text(rectangle.xy[0], rectangle.xy[1], l,
va='center', ha='center', fontsize=9, color=t_color,
bbox=dict(facecolor=c, lw=0))
axes.figure.savefig('/content/normal')
#This routine is used to transform the image.
def image_transform(image):
mean= np.array([0.485, 0.456, 0.406]).reshape(3,1,1)
std=np.array([0.229, 0.224, 0.225]).reshape(3,1,1)
image=transforms.ToTensor()(image)
image[:]-=mean
image[:]/=std
return np.expand_dims((image.as_np_ndarray()),axis=0).as_in_context(device)
#This routine is used to extract the saved plot and displays it.
def bbox_show(bbox,image):
fig=d2l.plt.imshow(image)
multi_bbox(fig.axes,bbox[:,1:]*np.array([[200,200,200,200]]),[categories[int(i[0])] if int(i[0])!=-1 else 'Null' for i in bbox],['r']*(bbox.shape[0]),200)
image=Image.open('/content/normal.png')
st.image(image,use_column_width=True)
#This is the actual predicition logic as we discussed earlier.
def prediction(image_location,threshold):
image=mxnet.image.imread(image_location)
image=transforms.Resize(200)(image)
image_numpy=image.asnumpy()
image=image_transform(image)
anchors,class_predictions,bbox_offsets=model(image)
class_predictions=npx.softmax(class_predictions)
output=npx.multibox_detection(class_predictions.transpose(0,2,1),bbox_offsets,anchors,nms_threshold=0.5)
output=output[0]
bbox=[]
for i in output:
if i[0]!=-1:
if i[1]>=threshold:
bbox.append(i[[0,2,3,4,5]])
bbox=np.array(bbox)
if type(bbox)!=list:
bbox_show(bbox,image_numpy)
return 1
else:
return None
#A simple UI declaration.
path=st.text_input('Enter image location')
if path:
threshold=st.slider('Enter the threshold value ',min_value=0.1,max_value=1.0,step=0.01)
if path and threshold:
with st.spinner("Doing"):
a=prediction(path,float(threshold))
if a is None:
st.text('Sorry...unable to recognize')
#We are getting an public URL rather than a localhost URL>
from pyngrok import ngrok
url=ngrok.connect(port=8501)
#Printing the URL
print(url)
Screenshot of our web application