Self-driving cars can’t afford mistakes. Missing a traffic light or a pedestrian could mean disaster. But object detection in dynamic urban environments? That’s hard.

I worked on optimizing object detection for autonomous vehicles using Atrous Spatial Pyramid Pooling (ASPP) and Transfer Learning. The result? A model that detects objects at multiple scales, even in bad lighting, and runs efficiently in real-time.

Here’s how I did it.


The Problem: Object Detection in the Wild

Self-driving cars rely on Convolutional Neural Networks (CNNs) to detect objects, but real-world conditions introduce challenges:

Traditional CNNs struggle with multi-scale object detection and training from scratch takes forever. That’s where ASPP and Transfer Learning come in.


ASPP: Capturing Objects at Different Scales

CNNs work well for fixed-size objects but real-world objects vary in size and distance. Atrous Spatial Pyramid Pooling (ASPP) solves this by using dilated convolutions to capture features at multiple scales.

How ASPP Works

ASPP applies multiple convolution filters with different dilation rates to extract features at different resolutions, small objects, large objects, and everything in between.

Here’s how I implemented ASPP in PyTorch incorporating group normalization and attention for robust performance in complex environments:

import torch
import torch.nn as nn
import torch.nn.functional as F

class ASPP(nn.Module):
    """
    A more advanced ASPP with optional attention and group normalization.
    """
    def __init__(self, in_channels, out_channels, dilation_rates=(6,12,18), groups=8):
        super(ASPP, self).__init__()
        self.aspp_branches = nn.ModuleList()
        
        #1x1 Conv branch
        self.aspp_branches.append(
            nn.Sequential(
                nn.Conv2d(in_channels, out_channels, kernel_size=1, stride=1, padding=0, bias=False),
                nn.GroupNorm(groups, out_channels),
                nn.ReLU(inplace=True)
            )
        )
        
        for rate in dilation_rates:
            self.aspp_branches.append(
                nn.Sequential(
                    nn.Conv2d(in_channels, out_channels, kernel_size=3, stride=1, 
                              padding=rate, dilation=rate, bias=False),
                    nn.GroupNorm(groups, out_channels),
                    nn.ReLU(inplace=True)
                )
            )
        
        #Global average pooling branch
        self.global_pool = nn.AdaptiveAvgPool2d((1, 1))
        self.global_conv = nn.Sequential(
            nn.Conv2d(in_channels, out_channels, kernel_size=1, stride=1, bias=False),
            nn.GroupNorm(groups, out_channels),
            nn.ReLU(inplace=True)
        )
        
        #Attention mechanism to refine the concatenated features
        self.attention = nn.Sequential(
            nn.Conv2d(out_channels*(len(dilation_rates)+2), out_channels, kernel_size =1, bias=False),
            nn.Sigmoid()
        )

        self.project = nn.Sequential(
            nn.Conv2d(out_channels*(len(dilation_rates)+2), out_channels, kernel_size=1, bias=False),
            nn.GroupNorm(groups, out_channels),
            nn.ReLU(inplace=True)
        )

    def forward(self, x):
        cat_feats = []
        for branch in self.aspp_branches:
            cat_feats.append(branch(x))
        
        g_feat = self.global_pool(x)
        g_feat = self.global_conv(g_feat)
        g_feat = F.interpolate(g_feat, size=x.shape[2:], mode='bilinear', align_corners=False)
        cat_feats.append(g_feat)
        
        #Concatenate along channels
        x_cat = torch.cat(cat_feats, dim=1)
        
        #channel-wise attention
        att_map = self.attention(x_cat)
        x_cat = x_cat * att_map
        
        out = self.project(x_cat)
        return out

Why It Works

Results:


Transfer Learning: Standing on the Shoulders of Giants

Training an object detection model from scratch doesn’t provide a lot of benefit when pre-trained models exist. Transfer learning lets us fine-tune a model that already understands objects.

I used DETR (Detection Transformer), a transformer-based object detection model from Facebook AI. It learns context—so it doesn’t just find a stop sign, it understands it’s part of a road scene.

Here’s how I fine-tuned DETR on self-driving datasets:

import torch
import torch.nn as nn
from transformers import DetrConfig, DetrForObjectDetection

class CustomBackbone(nn.Module):
    def __init__(self, in_channels=3, hidden_dim=256):
        super(CustomBackbone, self).__init__()
        # Example: basic conv layers + ASPP
        self.initial_conv = nn.Sequential(
            nn.Conv2d(in_channels, 64, kernel_size=7, stride=2, padding=3, bias=False),
            nn.BatchNorm2d(64),
            nn.ReLU(inplace=True),
            nn.MaxPool2d(kernel_size=3, stride=2, padding=1)
        )
        self.aspp = ASPP(in_channels=64, out_channels=hidden_dim)

    def forward(self, x):
        x = self.initial_conv(x)   
        x = self.aspp(x)           
        return x

class DETRWithASPP(nn.Module):
    def __init__(self, num_classes=91):
        super(DETRWithASPP, self).__init__()
        self.backbone = CustomBackbone()
        
        config = DetrConfig.from_pretrained("facebook/detr-resnet-50")
        config.num_labels = num_classes
        self.detr = DetrForObjectDetection.from_pretrained("facebook/detr-resnet-50", config=config)
        
        self.detr.model.backbone.body = nn.Identity()  

    def forward(self, images, pixel_masks=None):
        features = self.backbone(images)
        
        feature_dict = {
            "0": features
        }
        outputs = self.detr.model(inputs_embeds=None, pixel_values=None, pixel_mask=pixel_masks, 
                                  features=feature_dict, output_attentions=False)
        return outputs

model = DETRWithASPP(num_classes=10)
images = torch.randn(2, 3, 512, 512)
outputs = model(images)

Results:


Boosting Data With Synthetic Images

Autonomous vehicles need massive datasets, but real-world labeled data is scarce. The fix? Generate synthetic data using GANs (Generative Adversarial Networks).

I used a GAN to create fake but realistic lane markings and traffic scenes to expand the dataset.

Here’s a simple GAN for lane marking generation:

import torch
import torch.nn as nn
import torch.nn.functional as F

class LaneMarkingGenerator(nn.Module):
    """
    A DCGAN-style generator designed for producing synthetic lane or road-like images.
    Input is a latent vector (noise), and the output is a (1 x 64 x 64) grayscale image.
    You can adjust channels, resolution, and layers to match your target data.
    """
    def __init__(self, z_dim=100, feature_maps=64):
        super(LaneMarkingGenerator, self).__init__()
        self.net = nn.Sequential(
            #Z latent vector of shape (z_dim, 1, 1)
            nn.utils.spectral_norm(nn.ConvTranspose2d(z_dim, feature_maps * 8, 4, 1, 0, bias=False)),
            nn.BatchNorm2d(feature_maps * 8),
            nn.ReLU(True),

            #(feature_maps * 8) x 4 x 4
            nn.utils.spectral_norm(nn.ConvTranspose2d(feature_maps * 8, feature_maps * 4, 4, 2, 1, bias=False)),
            nn.BatchNorm2d(feature_maps * 4),
            nn.ReLU(True),

            #(feature_maps * 4) x 8 x 8
            nn.utils.spectral_norm(nn.ConvTranspose2d(feature_maps * 4, feature_maps * 2, 4, 2, 1, bias=False)),
            nn.BatchNorm2d(feature_maps * 2),
            nn.ReLU(True),

            #(feature_maps * 2) x 16 x 16
            nn.utils.spectral_norm(nn.ConvTranspose2d(feature_maps * 2, feature_maps, 4, 2, 1, bias=False)),
            nn.BatchNorm2d(feature_maps),
            nn.ReLU(True),

            #(feature_maps) x 32 x 32
            nn.utils.spectral_norm(nn.ConvTranspose2d(feature_maps, 1, 4, 2, 1, bias=False)),
            nn.Tanh()
        )

    def forward(self, z):
        return self.net(z)

class LaneMarkingDiscriminator(nn.Module):
    """
    A DCGAN-style discriminator. It takes a (1 x 64 x 64) image and attempts
    to classify whether it's real or generated (fake).
    """
    def __init__(self, feature_maps=64):
        super(LaneMarkingDiscriminator, self).__init__()
        self.net = nn.Sequential(
            #1x 64 x 64
            nn.utils.spectral_norm(nn.Conv2d(1, feature_maps, 4, 2, 1, bias=False)),
            nn.LeakyReLU(0.2, inplace=True),

            #(feature_maps) x 32 x 32
            nn.utils.spectral_norm(nn.Conv2d(feature_maps, feature_maps * 2, 4, 2, 1, bias=False)),
            nn.BatchNorm2d(feature_maps * 2),
            nn.LeakyReLU(0.2, inplace=True),

            #(feature_maps * 2) x 16 x 16
            nn.utils.spectral_norm(nn.Conv2d(feature_maps * 2, feature_maps * 4, 4, 2, 1, bias=False)),
            nn.BatchNorm2d(feature_maps * 4),
            nn.LeakyReLU(0.2, inplace=True),

            #(feature_maps * 4) x 8 x 8
            nn.utils.spectral_norm(nn.Conv2d(feature_maps * 4, feature_maps * 8, 4, 2, 1, bias=False)),
            nn.BatchNorm2d(feature_maps * 8),
            nn.LeakyReLU(0.2, inplace=True),

            #(feature_maps * 8) x 4 x 4
            nn.utils.spectral_norm(nn.Conv2d(feature_maps * 8, 1, 4, 1, 0, bias=False)),
        )

    def forward(self, x):
        return self.net(x).view(-1)

Results:


Final Results: Smarter, Faster Object Detection

By combining ASPP, Transfer Learning, and Synthetic Data, I built a more accurate, scalable object detection system for self-driving cars. Some of the key results are:


Next Steps: Making It Even Better

Conclusion

We merged ASPP, Transformers, and Synthetic Data into a triple threat for autonomous object detection—turning once sluggish, blind-spot-prone models into swift, perceptive systems that spot a traffic light from a block away. By embracing dilated convolutions for multi-scale detail, transfer learning for rapid fine-tuning, and GAN-generated data to fill every gap, we cut inference times nearly in half and saved hours of training. It’s a big leap toward cars that see the world more like we do only faster, more precisely, and on their way to navigating our most chaotic streets with confidence.


Further Reading on Some of the Techniques