How to Train a Semi-Supervised Classifier With Pseudo-Labeling and CNN Embeddings

Introduction: Labels are expensive, images are free
Part 1 — Data Exploration: Understanding the Data You’re Working With
Part 2 — Preprocessing: Speaking the Model’s Language
Part 3 — Feature Extraction: Turning Images Into Meaningful Numbers
Part 4 — Unsupervised Clustering: Discovering Structure in the Dark
Part 5 — Semi-Supervised Training: The Core Experiment
Part 6 — Scaling to Millions of Images: A Realistic Roadmap
Conclusion

Labels are expensive, images are free

In a perfect world, every image in your dataset would have a label. "Defective." "Normal." "Crack type A." "Scratch type B." But in the real world, labeling is brutally expensive. A single medical radiologist can label maybe 50 brain scans per hour — at €200/hour. An industrial quality inspector can annotate maybe 100 parts per hour. At scale, labeling 100,000 images can cost more than training the model itself.

Here's the paradox: companies often have millions of images (from cameras, sensors, user uploads) but can only afford to label a tiny fraction. This is where semi-supervised learning comes in. Instead of throwing away the unlabeled images, we use them to improve the model — combining a small labeled set with a large unlabeled set.

Let me use an analogy. Imagine you're a teacher with 30 students. You give them all an exam, but you only have time to grade 5 papers. You grade those 5 carefully, and you notice patterns: students who wrote a lot tend to get high scores, and students who left blank answers tend to get low scores. Using those patterns, you can estimate the grades of the other 25 papers without reading every line. That's semi-supervised learning: you use the few graded papers (labeled data) to understand patterns, then apply those patterns to the ungraded ones (unlabeled data).

This article builds a complete semi-supervised image classification pipeline from scratch. We'll work through every step with detailed explanations as we go.

Case study: detecting manufacturing defects on metal surfaces. A factory produces steel plates, and cameras photograph each plate as it rolls off the production line. Most plates are normal, some have defects (scratches, cracks, pitting, inclusions). We have 10,000 images, but only 200 labeled ones.

Building a Semi-Supervised Learning Pipeline

Before diving into any code, let's map out the entire pipeline. Understanding the flow first will make each step feel purposeful rather than random. Take a moment to study this diagram — every box is a step we'll implement:

THE COMPLETE SEMI-SUPERVISED PIPELINE:

[10,000 raw images] ──▶ [Exploration & Cleaning]
                              │
                              ▼
                    [Preprocessing: resize, normalize,
                     histogram equalization]
                              │
                              ▼
                    [Feature Extraction: pretrained ResNet50
                     → 2048-dim embedding per image]
                              │
                    ┌─────────┴──────────┐
                    │                    │
                    ▼                    ▼
          [200 LABELED images]    [9,800 UNLABELED images]
                    │                    │
                    │                    ▼
                    │          [Clustering: K-Means, DBSCAN
                    │           on embeddings → pseudo-labels]
                    │                    │
                    │                    ▼
                    │          [WEAKLY labeled dataset]
                    │           (cluster assignments)
                    │                    │
                    ▼                    ▼
          ┌────────────────────────────────────┐
          │  SEMI-SUPERVISED TRAINING:          │
          │  1. Pre-train CNN on weakly labeled │
          │  2. Fine-tune CNN on strongly labeled│
          │  3. Compare vs supervised-only       │
          └────────────────────────────────────┘
                              │
                              ▼
                    [Evaluation: F1, AUC-ROC,
                     confusion matrix, comparison]

The key insight: we'll turn unlabeled images into approximately labeled images (via clustering), then use that approximate knowledge to give the final model a head start. Think of it as giving a student a rough study guide before the real exam — it won't be perfect, but it's better than nothing.

Let's also be explicit about the vocabulary we'll use throughout this article, because mixing up terms is a very common source of confusion:

Strongly labeled (or just "labeled"): images with labels verified by a human expert. Gold standard. We have 200 of these.
Weakly labeled (or "pseudo-labeled"): images whose labels were guessed by clustering. Cheaper but noisier. We'll create 9,800 of these.
Unlabeled: images with no label at all. This is their state before we cluster them.
Embedding: a compact numerical summary of an image, produced by a pretrained neural network. Our main tool for making images comparable.

Further reading:

1. Data Exploration: understanding the data you’re working with

Why you must look at your data before doing anything else

Image datasets have unique failure modes you won't find in tabular data: corrupted files that crash your training loop at 3 AM, inconsistent resolutions that silently distort your images, wrong color channels (grayscale disguised as RGB), and extreme class imbalance where 95% of images are "normal." If you skip exploration, your model will silently learn garbage — and you won't know why it performs poorly.

The golden rule: never trust data you haven't inspected. Let's see what we're dealing with.

1.1 — Loading and counting the images

Our dataset is organized into two main folders: one with labeled images (subdivided into "normal" and "defect" subfolders), and one with unlabeled images (no subfolders — just a flat collection of .png files).

import os
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from PIL import Image
from pathlib import Path

data_dir = Path("data/metal_surfaces")
labeled_dir = data_dir / "labeled"
unlabeled_dir = data_dir / "unlabeled"

We use pathlib.Path rather than string concatenation because it handles path separators correctly on any operating system. Now let's count:

labeled_files = list(labeled_dir.glob("**/*.png"))
unlabeled_files = list(unlabeled_dir.glob("**/*.png"))

print(f"Labeled images:   {len(labeled_files)}")
print(f"Unlabeled images: {len(unlabeled_files)}")
print(f"Total:            {len(labeled_files) + len(unlabeled_files)}")

label_ratio = len(labeled_files) / (len(labeled_files) + len(unlabeled_files))
print(f"Label ratio:      {label_ratio:.1%}")

That label ratio is typically around 2%. Only 2% of our data has expert-verified labels. The other 98% is a gold mine we can't afford to ignore — and that's exactly what semi-supervised learning exploits.

1.2 — Scanning for problems: resolution, color, corruption

Next, we need to check every image individually. A single corrupted file can crash an entire training run. Inconsistent resolutions will distort images if you're not careful. And wrong color modes (grayscale when your model expects RGB) will produce nonsense features.

We write a small function that tries to open each image and records its properties:

def get_image_info(filepath):
    """
    Try to open an image and record its properties.
    If the file is corrupted, Pillow will throw an exception.
    """
    try:
        img = Image.open(filepath)
        return {
            "path": str(filepath),
            "width": img.size[0],
            "height": img.size[1],
            "mode": img.mode,       # 'RGB', 'L' (grayscale), 'RGBA'
            "filesize_kb": os.path.getsize(filepath) / 1024,
            "corrupted": False,
        }
    except Exception:
        return {
            "path": str(filepath),
            "width": None, "height": None,
            "mode": None, "filesize_kb": None,
            "corrupted": True,
        }

The img.mode field is particularly important. 'RGB' means 3 color channels (red, green, blue). 'L' means grayscale (1 channel). 'RGBA' means RGB with an alpha (transparency) channel. Our pretrained model expects RGB, so we'll need to convert everything later.

Now let's scan all 10,000 images:

all_files = labeled_files + unlabeled_files
image_info = [get_image_info(f) for f in all_files]
info_df = pd.DataFrame(image_info)

And examine the results:

print(f"Total images scanned:  {len(info_df)}")
print(f"Corrupted images:      {info_df['corrupted'].sum()}")
print(f"\nResolution distribution:")
print(info_df[~info_df["corrupted"]][["width", "height"]].describe().round(0))
print(f"\nColor modes: {info_df['mode'].value_counts().to_dict()}")
print(f"File size (KB): min={info_df['filesize_kb'].min():.0f}, "
      f"max={info_df['filesize_kb'].max():.0f}, "
      f"mean={info_df['filesize_kb'].mean():.0f}")

What to look for in the output:

Corrupted images > 0: Remove them immediately. Even one bad file can crash your training.
Different resolutions: If min ≠ max for width or height, images have different sizes. We'll resize them all to 224×224 during preprocessing.
Multiple color modes: If you see both 'RGB' and 'L', you have a mix of color and grayscale. We'll convert everything to RGB.
Extreme file sizes: A 1KB file is probably empty or corrupted. A 50MB file might be uncompressed — worth investigating.

Let's clean up the corrupted files:

corrupted_paths = set(info_df[info_df["corrupted"]]["path"].tolist())
if corrupted_paths:
    print(f"Removing {len(corrupted_paths)} corrupted images")
    labeled_files = [f for f in labeled_files if str(f) not in corrupted_paths]
    unlabeled_files = [f for f in unlabeled_files if str(f) not in corrupted_paths]

1.3 — Class distribution: Is our labeled set balanced?

In industrial and medical settings, defects are rare. Your labeled set might be 90% "normal" and 10% "defect." This matters enormously: a lazy model that always predicts "normal" would get 90% accuracy while being completely useless. We need to know the balance upfront so we can compensate for it later.

class_counts = {}
for class_dir in labeled_dir.iterdir():
    if class_dir.is_dir():
        count = len(list(class_dir.glob("*.png")))
        class_counts[class_dir.name] = count

print("Class distribution (labeled set):")
for cls, count in class_counts.items():
    pct = count / sum(class_counts.values()) * 100
    print(f"  {cls}: {count} images ({pct:.1f}%)")

If the imbalance is severe, we'll handle it later using a technique called pos_weight in the loss function — essentially telling the model "missing a defect is 4x worse than a false alarm."

Let's also visualize it:

fig, ax = plt.subplots(figsize=(6, 4))
ax.bar(class_counts.keys(), class_counts.values(), color=["#2ecc71", "#e74c3c"])
ax.set_title("Class Distribution (Labeled Images)")
ax.set_ylabel("Number of images")
plt.tight_layout()
plt.savefig("outputs/class_distribution.png", dpi=150)
plt.show()

1.4 — Visualizing sample images: always look before you model

This might be the most important step in the whole pipeline. Look at your data. You might discover images that are obviously mislabeled, scanning artifacts (black borders, rotations), quality issues (blur, overexposure), or that defects are visually subtle and your task is harder than you thought.

fig, axes = plt.subplots(2, 5, figsize=(15, 6))
fig.suptitle("Sample Images — Top: Normal | Bottom: Defect", fontsize=14)

for i, class_name in enumerate(["normal", "defect"]):
    class_files = list((labeled_dir / class_name).glob("*.png"))[:5]
    for j, filepath in enumerate(class_files):
        img = Image.open(filepath)
        axes[i, j].imshow(img, cmap="gray" if img.mode == "L" else None)
        axes[i, j].set_title(class_name, fontsize=10)
        axes[i, j].axis("off")

plt.tight_layout()
plt.savefig("outputs/sample_images.png", dpi=150)
plt.show()

Take a moment to study these images. Can you see the defects with your own eyes? If you can't, the model will struggle too. If the defects are obvious (a deep scratch, a large crack), that's encouraging — the model should be able to learn the difference. If they're subtle (a slight discoloration, a hairline crack), you'll need particularly good preprocessing and feature extraction.

Further reading:

Pillow documentation — the Python imaging library we use for loading images
NEU Surface Defect Database — a real-world steel surface defect dataset you can practice with

Part 2 — Preprocessing: speaking the model's language

Why we can't just feed raw images to a neural network

A pretrained CNN like ResNet50 was trained on a very specific type of input: 224×224 pixel images, in RGB color, normalized with specific mean and standard deviation values calculated from the ImageNet dataset. If you feed it images of different sizes, or with raw pixel values instead of normalized ones, the features it extracts will be meaningless.

Think of it like language. ResNet50 "speaks ImageNet." If we want it to understand our metal surface images, we need to "translate" them into ImageNet format first. This translation involves four steps:

Convert to RGB (3 channels)
Enhance contrast via histogram equalization
Resize to 224×224
Normalize pixel values to match ImageNet statistics

2.1 — What is histogram equalization, and why does it matter here?

Industrial images often have very low contrast. The difference between a normal surface and a scratched one might be just a few pixel intensity levels — invisible to the naked eye, and very hard for a model to detect.

Histogram equalization redistributes pixel intensities so that the full range (0 to 255) is used evenly. The result: subtle features "pop out" visually and numerically.

We use a more advanced version called CLAHE (Contrast Limited Adaptive Histogram Equalization). Unlike global equalization (which applies the same transformation to the whole image), CLAHE divides the image into small tiles (8×8 by default) and equalizes each tile independently. This preserves local details much better.

Here's the analogy: imagine you're adjusting the brightness on a photo. Global equalization is like using a single brightness slider for the entire image — you might brighten the dark corners but wash out the already-bright center. CLAHE is like adjusting brightness in each region independently, so every part of the image becomes clear.

2.2 — Building a custom PyTorch Dataset

PyTorch organizes data loading around two classes: Dataset (knows how to load one item) and DataLoader (knows how to batch and shuffle items). PyTorch provides a built-in ImageFolder dataset, but it assumes every image has a label. Our dataset has both labeled AND unlabeled images, so we need a custom class.

Let's build it step by step. First, the skeleton:

import torch
import torchvision.transforms as T
from torch.utils.data import Dataset, DataLoader
import cv2

class MetalSurfaceDataset(Dataset):
    """
    Custom Dataset that handles both labeled and unlabeled images.
    Returns -1 as the label for unlabeled images.
    """

    def __init__(self, image_paths, labels=None, transform=None):
        self.image_paths = image_paths
        self.labels = labels          # None for unlabeled images
        self.transform = transform

    def __len__(self):
        return len(self.image_paths)

__len__ tells PyTorch how many images we have. __init__ stores the image paths and (optionally) their labels.

Now the core method, __getitem__, which loads and preprocesses a single image. We'll break it into three stages:

Stage 1 — Load the image and force RGB:

    def __getitem__(self, idx):
        # Load image and convert to RGB
        # .convert("RGB") handles grayscale → RGB conversion automatically
        # (it duplicates the single channel into R, G, and B)
        img = Image.open(self.image_paths[idx]).convert("RGB")

Why .convert("RGB")? Because ResNet50 expects 3 channels. If our image is grayscale (1 channel), this duplicates the gray values into R, G, and B. If it's already RGB, it does nothing. If it's RGBA (4 channels), it drops the alpha channel.

Stage 2 — Apply CLAHE histogram equalization:

        # Convert PIL Image → numpy array for OpenCV processing
        img_np = np.array(img)

        # Convert RGB → LAB color space
        # L = Lightness (brightness), A and B = color channels
        # We only equalize L (brightness) to avoid distorting colors
        img_lab = cv2.cvtColor(img_np, cv2.COLOR_RGB2LAB)

        # Create CLAHE object and apply to L channel
        # clipLimit=2.0 prevents over-amplification of noise
        # tileGridSize=(8,8) means 8x8 tiles for local equalization
        clahe = cv2.createCLAHE(clipLimit=2.0, tileGridSize=(8, 8))
        img_lab[:, :, 0] = clahe.apply(img_lab[:, :, 0])

        # Convert back LAB → RGB → PIL Image
        img_np = cv2.cvtColor(img_lab, cv2.COLOR_LAB2RGB)
        img = Image.fromarray(img_np)

Why LAB and not just apply CLAHE directly on RGB? Because if you equalize R, G, and B channels independently, you'll distort the colors — turning blue into green, for instance. By working in LAB space, we only touch the lightness (L) channel and leave the colors (A, B) untouched. This is a standard practice in image processing.

Stage 3 — Apply transforms and return:

        # Apply resize + normalize transforms
        if self.transform:
            img = self.transform(img)

        # Return the label (or -1 if this image has no label)
        label = self.labels[idx] if self.labels is not None else -1
        return img, label

2.3 — The transform pipeline: what each step does

Now we define the sequence of transformations. Each one has a specific purpose:

preprocessing = T.Compose([
    T.Resize((224, 224)),           # (1) Resize to model's expected input size
    T.ToTensor(),                    # (2) PIL Image → PyTorch Tensor, scale [0,1]
    T.Normalize(
        mean=[0.485, 0.456, 0.406],  # (3) Normalize with ImageNet statistics
        std=[0.229, 0.224, 0.225],
    ),
])

Let's explain each step:

(1) Resize to 224×224. ResNet50's architecture requires exactly this size. If you feed a 300×400 image, the tensor shapes won't match and PyTorch will crash. Resizing may distort aspect ratios slightly, but for texture-based tasks (like defect detection) this is rarely a problem.

(2) ToTensor. Does two things: converts the pixel format from HWC (Height × Width × Channels) to CHW (Channels × Height × Width), which is what PyTorch expects; and scales pixel values from [0, 255] integers to [0.0, 1.0] floats.

(3) Normalize with ImageNet mean and std. These magic numbers — [0.485, 0.456, 0.406] and [0.229, 0.224, 0.225] — are the mean and standard deviation of pixel values across the entire ImageNet dataset, computed channel by channel (R, G, B). ResNet50 was trained with these exact normalization values, so its internal weights "expect" inputs centered around 0 with unit variance. Using different numbers would be like measuring in inches but calculating in centimeters — the math breaks.

2.4 — Creating datasets and DataLoaders

Now we assemble everything. A critical principle: labeled and unlabeled data must be kept strictly separate at all times. Mixing them would contaminate our evaluation.

First, collect labeled image paths and their class labels:

labeled_paths = []
labeled_labels = []
for class_idx, class_name in enumerate(["normal", "defect"]):
    class_dir = labeled_dir / class_name
    for fp in class_dir.glob("*.png"):
        labeled_paths.append(str(fp))
        labeled_labels.append(class_idx)  # 0 = normal, 1 = defect

Then, collect unlabeled image paths (no labels needed):

unlabeled_paths = [str(fp) for fp in unlabeled_files]

Create the PyTorch Dataset objects:

labeled_dataset = MetalSurfaceDataset(labeled_paths, labeled_labels, preprocessing)
unlabeled_dataset = MetalSurfaceDataset(unlabeled_paths, labels=None, transform=preprocessing)

print(f"Labeled dataset:   {len(labeled_dataset)} images")
print(f"Unlabeled dataset: {len(unlabeled_dataset)} images")

Finally, wrap them in DataLoaders. A DataLoader batches images together (batch_size=32 means 32 images at a time) and optionally shuffles them:

labeled_loader = DataLoader(labeled_dataset, batch_size=32, shuffle=False)
unlabeled_loader = DataLoader(unlabeled_dataset, batch_size=32, shuffle=False)

Why shuffle=False here? Because we're about to extract features, and we need the embeddings to stay in the same order as our file lists. We'll shuffle later, during training.

Further reading:

Part 3 — Feature extraction: turning images into meaningful numbers

3.1 — Why raw pixels are a terrible representation

A 224×224 RGB image has 150,528 numbers (224 × 224 × 3 channels). Most of them are noise — minor variations in lighting, sensor artifacts, compression artifacts. Worse: two photos of the exact same scratch, taken from slightly different angles or lighting, have completely different pixel values. If we try to cluster or classify raw pixels, images that look the same to us will appear completely different to the algorithm.

What we need is a representation that captures the meaning of the image — "this looks like a scratch," "this is a smooth surface" — in a compact, stable numerical form. That's what embeddings do.

3.2 — What is a pretrained model and why we don't train from scratch

Training a deep neural network from scratch requires a LOT of data — typically hundreds of thousands of images. We have 200 labeled images. If we tried to train ResNet50's 25 million parameters on 200 images, the model would memorize every single training image perfectly but fail completely on new images. This is called overfitting.

Instead, we use a model that was already trained on ImageNet — a dataset of 14 million images across 1,000 categories (dogs, cats, cars, buildings, etc.). This model has already learned to recognize fundamental visual features: edges, textures, shapes, color gradients, geometric patterns. These low-level features are universal — they apply to steel surfaces just as well as they apply to cats.

Think of it like hiring an experienced photographer to inspect your factory. They've never seen steel plates before, but they already know how to see: they can spot unusual textures, abrupt changes in surface quality, patterns that break the norm. They just need to learn what counts as "defective" in your specific factory.

3.3 — Loading ResNet50

Let's load the pretrained model:

import torchvision.models as models

resnet = models.resnet50(weights=models.ResNet50_Weights.IMAGENET1K_V1)

This single line downloads (the first time) a ResNet50 model whose weights have been trained on ImageNet. The model knows how to classify 1,000 categories of everyday objects.

3.4 — Freezing the parameters

We don't want to modify any of the learned features. We're using ResNet as a read-only tool:

for param in resnet.parameters():
    param.requires_grad = False

requires_grad = False tells PyTorch "don't compute gradients for these parameters." This has two benefits: it prevents accidental modification of the pretrained weights, and it makes inference faster (no gradient tracking = less computation and less memory).

3.5 — Removing the classification head

ResNet50's architecture looks like this:

Input image (224×224×3)
    ↓
[Convolutional layers] — learn visual features
    ↓
[Average Pooling] — compress spatial dimensions → 2048-dim vector
    ↓
[Fully Connected layer] — classify into 1000 ImageNet categories
    ↓
Output (1000 probabilities)

We want the 2048-dimensional vector from the Average Pooling layer — that's our embedding. We don't want the last Fully Connected layer, because it's specific to ImageNet's 1000 classes (dog, cat, airplane...) and useless for our task.

feature_extractor = torch.nn.Sequential(*list(resnet.children())[:-1])

What this line does: resnet.children() returns all the layers of ResNet as a list. [:-1] takes all layers except the last one (the FC layer). Sequential(*) wraps them back into a model.

feature_extractor.eval()

eval() mode disables dropout and uses running statistics for batch normalization. This ensures the model gives deterministic outputs — the same image always produces the same embedding.

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
feature_extractor = feature_extractor.to(device)
print(f"Feature extractor ready on {device}")

Using GPU (if available) makes feature extraction roughly 10x faster.

3.6 — The extraction function

Now let's write a function that feeds batches of images through the feature extractor and collects the embeddings. We'll build it piece by piece.

The outer structure:

def extract_embeddings(dataloader, model, device):
    """
    Feed all images through the model and collect embeddings.

    Returns:
        embeddings: numpy array, shape (n_images, 2048)
        labels: numpy array, shape (n_images,) — -1 if unlabeled
    """
    all_embeddings = []
    all_labels = []

We accumulate results in lists because we process images in batches (32 at a time), not all at once (wouldn't fit in GPU memory).

The main loop:

    with torch.no_grad():
        for batch_images, batch_labels in dataloader:
            batch_images = batch_images.to(device)
            features = model(batch_images)

torch.no_grad() disables gradient tracking — essential because we're only doing inference, not training. This alone cuts memory usage in half and speeds things up. batch_images.to(device) moves the images to the GPU if we have one.

The output of model(batch_images) has shape (batch_size, 2048, 1, 1) — the last two dimensions are spatial remnants from the average pooling. We need to squeeze them:

            features = features.squeeze(-1).squeeze(-1)
            # Now shape is (batch_size, 2048) — that's our embedding

Finally, we move the results back to CPU (numpy can't work with GPU tensors) and store them:

            all_embeddings.append(features.cpu().numpy())
            all_labels.append(batch_labels.numpy())

    return np.concatenate(all_embeddings), np.concatenate(all_labels)

np.concatenate glues all the batches together into a single array.

3.7 — Running the extraction

print("Extracting embeddings for labeled images...")
labeled_embeddings, labeled_labels_arr = extract_embeddings(
    labeled_loader, feature_extractor, device
)
print(f"  Shape: {labeled_embeddings.shape}")  # Expected: (200, 2048)

200 images, each represented by a 2048-dimensional vector. That's a 73x compression compared to raw pixels (150,528 → 2,048) — and the compressed representation is far more meaningful.

print("Extracting embeddings for unlabeled images...")
unlabeled_embeddings, _ = extract_embeddings(
    unlabeled_loader, feature_extractor, device
)
print(f"  Shape: {unlabeled_embeddings.shape}")  # Expected: (9800, 2048)

We discard the labels (they're all -1 anyway) with the _ variable.

3.8 — Saving embeddings (don't re-extract every time!)

Feature extraction is the most expensive step — potentially 30+ minutes on a GPU for 10,000 images. Save the results so you never have to redo it:

np.save("data/labeled_embeddings.npy", labeled_embeddings)
np.save("data/labeled_labels.npy", labeled_labels_arr)
np.save("data/unlabeled_embeddings.npy", unlabeled_embeddings)
print("Embeddings saved to disk")

Later, you can reload instantly with np.load("data/labeled_embeddings.npy").

3.9 — Sanity check: are the embeddings reasonable?

Before moving on, a quick verification. Garbage in, garbage out — let's make sure our embeddings are sane:

print(f"Embedding statistics:")
print(f"  Mean:   {labeled_embeddings.mean():.4f}")
print(f"  Std:    {labeled_embeddings.std():.4f}")
print(f"  Min:    {labeled_embeddings.min():.4f}")
print(f"  Max:    {labeled_embeddings.max():.4f}")
print(f"  NaN:    {np.isnan(labeled_embeddings).any()}")
print(f"  Inf:    {np.isinf(labeled_embeddings).any()}")

What to expect: mean around 0.3-0.5, std around 0.5-1.0, no NaN, no Inf. If you see NaN values, a corrupted image probably slipped through the cleaning step. If the mean is exactly 0, something is wrong with the normalization.

Going further:

Transfer learning explained (PyTorch tutorial)
ResNet paper (He et al., 2015) — the original architecture that introduced skip connections
Feature extraction vs fine-tuning — Stanford CS231n course notes

Part 4 — Unsupervised clustering: discovering structure in the dark

4.1 — What clustering does and why we need it

We now have 10,000 embeddings — compact numerical summaries of each image. 200 of them have labels. The other 9,800 don't. Our goal is to find natural groupings in the data that hopefully correspond to "normal" and "defect."

The underlying assumption: if our embeddings are good (and ResNet50 embeddings usually are), images of the same type will be close in embedding space. Normal surfaces will cluster together; defective surfaces will cluster together. Even without labels, the structure is there — clustering reveals it.

But first, a practical problem: 2048 dimensions is impossible to visualize and make some algorithms slow. We need to reduce dimensions first.

4.2 — Standardize the embeddings

Clustering algorithms, especially K-Means, compute distances between points. If one dimension ranges from 0 to 1000 and another from 0 to 0.01, the first one will completely dominate the distance — as if the second dimension doesn't exist. Standardization (mean=0, std=1 per dimension) puts all features on equal footing.

from sklearn.preprocessing import StandardScaler

# Combine labeled + unlabeled for joint standardization
all_embeddings = np.concatenate([labeled_embeddings, unlabeled_embeddings], axis=0)

scaler = StandardScaler()
all_embeddings_scaled = scaler.fit_transform(all_embeddings)

Why standardize labeled and unlabeled together? Because they come from the same distribution (same factory, same camera). Standardizing them jointly ensures consistent scaling.

# Split back — we'll need them separate later
labeled_scaled = all_embeddings_scaled[:len(labeled_embeddings)]
unlabeled_scaled = all_embeddings_scaled[len(labeled_embeddings):]

print(f"After standardization: mean={all_embeddings_scaled.mean():.4f}, "
      f"std={all_embeddings_scaled.std():.4f}")

4.3 — Reduce dimensions with PCA

PCA (Principal Component Analysis) finds the directions of maximum variance in the data and projects onto the top directions. We go from 2048 to 50 dimensions:

from sklearn.decomposition import PCA

pca = PCA(n_components=50, random_state=42)
all_pca = pca.fit_transform(all_embeddings_scaled)

print(f"PCA: 2048 → 50 dimensions")
print(f"Variance retained: {pca.explained_variance_ratio_.sum():.1%}")

~95% variance retained means we've discarded only 5% of the information but reduced dimensionality by 40x. This makes t-SNE and DBSCAN much faster and more stable.

Let's also visualize how much each component contributes:

plt.figure(figsize=(10, 4))
plt.plot(np.cumsum(pca.explained_variance_ratio_), marker="o", markersize=3)
plt.xlabel("Number of PCA components")
plt.ylabel("Cumulative explained variance")
plt.title("PCA: How many components do we need?")
plt.axhline(y=0.95, color="r", linestyle="--", label="95% threshold")
plt.legend()
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.savefig("outputs/pca_variance.png", dpi=150)
plt.show()

This "elbow plot" shows where adding more components stops providing significant gains. If the curve flattens before 50 components, you could use even fewer.

4.4 — Visualize with t-SNE

t-SNE is a nonlinear dimensionality reduction technique that's specifically designed for visualization. It preserves local structure: images that are close in high-dimensional space will be close in the 2D plot. This makes it perfect for checking whether normal and defect images naturally separate.

One important caveat: never cluster on t-SNE output. t-SNE distorts global distances — the space between clusters is not meaningful. Use it only for visualization, and cluster on the original (or PCA-reduced) embeddings.

from sklearn.manifold import TSNE

# Apply t-SNE on PCA output (faster and more stable than on raw 2048-dim)
tsne = TSNE(n_components=2, random_state=42, perplexity=30, n_iter=1000)
all_tsne = tsne.fit_transform(all_pca)

The perplexity parameter roughly controls the "neighborhood size" — how many nearby points t-SNE considers. 30 is a reasonable default for datasets of our size.

Now let's split the t-SNE coordinates:

labeled_tsne = all_tsne[:len(labeled_embeddings)]
unlabeled_tsne = all_tsne[len(labeled_embeddings):]

And visualize:

fig, axes = plt.subplots(1, 2, figsize=(14, 6))

# Left plot: labeled images only, colored by true label
for cls_idx, cls_name, color in [(0, "Normal", "#2ecc71"), (1, "Defect", "#e74c3c")]:
    mask = labeled_labels_arr == cls_idx
    axes[0].scatter(labeled_tsne[mask, 0], labeled_tsne[mask, 1],
                    label=cls_name, alpha=0.7, s=40, c=color)
axes[0].set_title("t-SNE: Labeled Images (True Labels)")
axes[0].legend()

If you see two distinct clouds in this plot — green on one side, red on the other — that's an excellent sign. It means the ResNet50 embeddings genuinely capture the difference between normal and defective surfaces.

# Right plot: all images (unlabeled in gray, labeled overlaid)
axes[1].scatter(unlabeled_tsne[:, 0], unlabeled_tsne[:, 1],
                c="lightgray", alpha=0.2, s=10, label="Unlabeled")
for cls_idx, cls_name, color in [(0, "Normal", "#2ecc71"), (1, "Defect", "#e74c3c")]:
    mask = labeled_labels_arr == cls_idx
    axes[1].scatter(labeled_tsne[mask, 0], labeled_tsne[mask, 1],
                    label=f"Labeled: {cls_name}", alpha=0.8, s=40, c=color)
axes[1].set_title("t-SNE: All Images (Labeled in Color)")
axes[1].legend()

plt.tight_layout()
plt.savefig("outputs/tsne_visualization.png", dpi=150)
plt.show()

In the right plot, the gray cloud (unlabeled images) should overlap with the colored dots. This confirms that labeled and unlabeled images come from the same distribution — a necessary condition for semi-supervised learning to work.

4.5 — K-Means clustering

K-Means is the simplest and most widely used clustering algorithm. It divides the data into exactly k groups by iteratively assigning each point to the nearest cluster center, then updating the centers.

Since we know we have 2 classes (normal and defect), we start with k=2. But we also test k=3, 4, 5 to check whether the data might have more structure (e.g., different types of defects forming separate clusters).

To evaluate how well the clusters match the real labels, we use the ARI (Adjusted Rand Index). ARI = 1.0 means perfect agreement with the true labels. ARI = 0.0 means random clustering. ARI < 0 means worse than random.

from sklearn.cluster import KMeans
from sklearn.metrics import adjusted_rand_score, silhouette_score

print("K-Means Clustering:")
print(f"  {'k':<5s} {'ARI':>8s} {'Silhouette':>12s}")
print(f"  {'-'*27}")

for k in [2, 3, 4, 5]:
    kmeans = KMeans(n_clusters=k, random_state=42, n_init=10)
    all_clusters = kmeans.fit_predict(all_embeddings_scaled)

    # ARI: compare clusters vs true labels (on labeled images only)
    labeled_clusters = all_clusters[:len(labeled_embeddings)]
    ari = adjusted_rand_score(labeled_labels_arr, labeled_clusters)

    # Silhouette: internal quality measure (no labels needed)
    # How well-separated are the clusters? Range: -1 to +1
    sil = silhouette_score(all_embeddings_scaled, all_clusters)

    print(f"  {k:<5d} {ari:>8.4f} {sil:>12.4f}")

The n_init=10 parameter means K-Means will run 10 times with different random initializations and keep the best result. This avoids getting stuck in a bad local minimum.

If k=2 gives the highest ARI, that confirms our data has two natural groups that align with normal vs defect.

4.6 — DBSCAN clustering (an alternative approach)

DBSCAN works very differently from K-Means. Instead of specifying the number of clusters, you specify two parameters:

eps (epsilon): the maximum distance between two points for them to be considered neighbors. Think of it as "how close is close enough?"
min_samples: the minimum number of points needed to form a dense region (a cluster). Think of it as "how crowded does a neighborhood need to be to count as a cluster?"

DBSCAN automatically determines the number of clusters AND identifies outliers (points that don't belong to any cluster — labeled as -1). This can be useful for finding unusual images that might be mislabeled or anomalous.

from sklearn.cluster import DBSCAN

print("\nDBSCAN Clustering:")
print(f"  {'eps':<6s} {'min_s':<7s} {'clusters':>9s} {'noise':>7s} {'ARI':>8s}")
print(f"  {'-'*40}")

We need to test multiple parameter combinations because the "right" values depend on the data:

for eps in [3.0, 5.0, 7.0, 10.0]:
    for min_samples in [5, 10, 20]:
        dbscan = DBSCAN(eps=eps, min_samples=min_samples)
        db_clusters = dbscan.fit_predict(all_pca)  # Use PCA-reduced data

        n_clusters = len(set(db_clusters)) - (1 if -1 in db_clusters else 0)
        n_noise = (db_clusters == -1).sum()

        if n_clusters >= 2:
            labeled_db = db_clusters[:len(labeled_embeddings)]
            mask = labeled_db != -1   # Exclude noise points from ARI
            if mask.sum() > 10:
                ari = adjusted_rand_score(labeled_labels_arr[mask], labeled_db[mask])
                print(f"  {eps:<6.1f} {min_samples:<7d} {n_clusters:>9d} "
                      f"{n_noise:>7d} {ari:>8.4f}")

Notice that we use PCA-reduced data (all_pca, 50 dims) instead of the full 2048-dim embeddings. DBSCAN struggles in very high dimensions because all distances become similar (the "curse of dimensionality").

Compare the best ARI from DBSCAN with the best from K-Means, and pick the winner.

4.7 — Visualizing the clusters on the t-SNE plot

Let's see how the best clustering looks on our t-SNE visualization:

best_kmeans = KMeans(n_clusters=2, random_state=42, n_init=10)
all_cluster_ids = best_kmeans.fit_predict(all_embeddings_scaled)

fig, ax = plt.subplots(figsize=(8, 6))
scatter = ax.scatter(all_tsne[:, 0], all_tsne[:, 1],
                     c=all_cluster_ids, cmap="coolwarm", alpha=0.4, s=15)
ax.set_title("K-Means Clusters (k=2) on t-SNE")
plt.colorbar(scatter, label="Cluster ID")
plt.tight_layout()
plt.savefig("outputs/kmeans_clusters_tsne.png", dpi=150)
plt.show()

If the two colors in this plot roughly match the two groups you saw in the labeled t-SNE plot, the clustering is working.

4.8 — Assigning pseudo-labels to unlabeled images

Now the critical step: we take the cluster assignments and treat them as "weak labels" for the unlabeled images. But there's a subtlety — K-Means assigns cluster IDs arbitrarily. Cluster 0 might correspond to "defect" or "normal." We need to check.

Extract the pseudo-labels:

unlabeled_pseudo_labels = all_cluster_ids[len(labeled_embeddings):]

Check alignment with real labels:

labeled_cluster_ids = all_cluster_ids[:len(labeled_embeddings)]

# What fraction of labeled images in cluster 0 are actually "normal"?
cluster_0_normal_rate = (labeled_labels_arr[labeled_cluster_ids == 0] == 0).mean()
cluster_1_normal_rate = (labeled_labels_arr[labeled_cluster_ids == 1] == 0).mean()

print(f"Cluster 0: {cluster_0_normal_rate:.1%} of labeled images are 'normal'")
print(f"Cluster 1: {cluster_1_normal_rate:.1%} of labeled images are 'normal'")

If cluster 0 is mostly defects (normal_rate < 50%), we flip the mapping:

if cluster_0_normal_rate < 0.5:
    unlabeled_pseudo_labels = 1 - unlabeled_pseudo_labels
    print("Cluster IDs flipped to match convention (0=normal, 1=defect)")

Let's see the distribution:

print(f"\nPseudo-label distribution:")
print(f"  Normal (0): {(unlabeled_pseudo_labels == 0).sum()} images")
print(f"  Defect (1): {(unlabeled_pseudo_labels == 1).sum()} images")

We now have two separate datasets with very different characteristics:

Strongly labeled — 200 images with real expert labels. High quality, small quantity. This is our ground truth.
Weakly labeled — 9,800 images with cluster-based pseudo-labels. Lower quality (some labels are wrong), but massive quantity.

The golden rule: never mix these two. They serve different purposes in the next step.

Further reading:

K-Means explained (scikit-learn)
DBSCAN explained (scikit-learn)
How to read t-SNE correctly (Distill) — essential reading

Part 5 — Semi-supervised training: the actual experiment

5.1 — The logic behind our two-phase approach

Here's the intuition. Imagine you're training a new quality inspector at the factory:

Phase 1 (pre-training on pseudo-labels): You show them 9,800 photos and say "I think these are normal and these are defective, but I'm not 100% sure." The inspector starts forming a rough mental model. Some of the labels are wrong, but the overall pattern — normal surfaces are smooth and uniform, defective surfaces have irregularities — is mostly correct. After this phase, the inspector has a decent intuition.

Phase 2 (fine-tuning on real labels): You then show them 200 photos that were carefully verified by an expert: "These are DEFINITELY normal, and these are DEFINITELY defective." The inspector refines their mental model — correcting mistakes from phase 1 and sharpening their judgment on edge cases.

The result: an inspector who has seen 10,000 images (building broad intuition) and has been calibrated by 200 expert-verified examples (ensuring precision). We expect this inspector to outperform one who only ever saw the 200 verified examples.

To prove this, we run two parallel experiments:

Experiment A — Supervised only: train on 200 labeled images only
Experiment B — Semi-supervised: pre-train on 9,800 pseudo-labeled images, then fine-tune on 200 labeled images

Same model architecture, same test set. The only difference is whether the model sees the unlabeled data or not.

5.2 — Building the classifier: architecture

We use ResNet50 as a backbone again, but this time we replace the final layer with a binary classifier and we train it (unlike Part 3, where we just extracted features).

import torch.nn as nn

class DefectClassifier(nn.Module):
    """
    Binary classifier: Normal (0) vs Defect (1).
    Based on ResNet50 with a custom classification head.
    """

    def __init__(self, dropout_rate=0.5):
        super().__init__()
        self.backbone = models.resnet50(weights=models.ResNet50_Weights.IMAGENET1K_V1)
        num_features = self.backbone.fc.in_features  # 2048

Here we replace the original ImageNet classification head (2048 → 1000 classes) with our own:

        self.backbone.fc = nn.Sequential(
            nn.Dropout(p=dropout_rate),    # Anti-overfitting
            nn.Linear(num_features, 1),    # Binary output
        )

    def forward(self, x):
        return self.backbone(x)

Why Dropout(0.5)? With only 200 labeled images and 25 million parameters, overfitting is the main threat. Dropout randomly disables 50% of neurons during each training step, forcing the network to learn redundant representations. At inference time, all neurons are active. This is the single most effective regularization technique for small datasets.

Why Linear(2048, 1) instead of Linear(2048, 2)? For binary classification, a single output neuron with a sigmoid activation is mathematically equivalent to two neurons with softmax, but simpler and slightly more numerically stable.

5.3 — The loss function: handling class imbalance

Before writing the training loop, let's discuss the loss function. We use BCEWithLogitsLoss (Binary Cross-Entropy with Logits), which combines sigmoid activation with binary cross-entropy in a single, numerically stable operation.

The key addition is pos_weight:

pos_weight = torch.tensor([4.0]).to(device)
criterion = nn.BCEWithLogitsLoss(pos_weight=pos_weight)

What does pos_weight=4.0 mean? It tells the loss function: "a missed defect (false negative) should be penalized 4 times more than a false alarm (false positive)." This compensates for class imbalance. Without it, the model could achieve 80% accuracy by always predicting "normal" — which is useless.

The value 4.0 is a rough estimate based on the class ratio. If you have 80% normal / 20% defect, then pos_weight = 80/20 = 4.0. You can tune this value, but 4.0 is a good starting point.

5.4 — The optimizer: AdamW with weight decay

import torch.optim as optim

optimizer = optim.AdamW(model.parameters(), lr=lr, weight_decay=1e-4)

Why AdamW? It's Adam with proper weight decay (L2 regularization). The weight_decay=1e-4 gently penalizes large weights, which is another layer of protection against overfitting. Think of it as telling the model "prefer simpler explanations."

5.5 — The learning rate scheduler

scheduler = optim.lr_scheduler.ReduceLROnPlateau(optimizer, patience=3, factor=0.5)

This automatically reduces the learning rate when the validation loss stops improving. patience=3 means "wait 3 epochs without improvement before reducing." factor=0.5 means "multiply the learning rate by 0.5." This is crucial for convergence — as the model approaches a minimum, smaller steps prevent overshooting.

5.6 — The training loop: one epoch at a time

Now let's build the full training function. We'll go through each part of the loop separately.

The training phase (one epoch — one pass through all training data):

from sklearn.metrics import f1_score

def train_model(model, train_loader, val_loader, epochs, lr, device, phase_name=""):
    """Train the model and track validation F1 score."""

    pos_weight = torch.tensor([4.0]).to(device)
    criterion = nn.BCEWithLogitsLoss(pos_weight=pos_weight)
    optimizer = optim.AdamW(model.parameters(), lr=lr, weight_decay=1e-4)
    scheduler = optim.lr_scheduler.ReduceLROnPlateau(optimizer, patience=3, factor=0.5)

    best_f1 = 0

    for epoch in range(epochs):

        # ---- TRAINING ----
        model.train()   # Enable dropout, update batch norm stats
        for images, labels in train_loader:
            images = images.to(device)
            labels = labels.float().unsqueeze(1).to(device)
            # .float() because BCEWithLogitsLoss expects float targets
            # .unsqueeze(1) adds a dimension: shape (batch,) → (batch, 1)

            optimizer.zero_grad()    # Reset gradients from previous batch
            outputs = model(images)  # Forward pass
            loss = criterion(outputs, labels)  # Compute loss
            loss.backward()          # Compute gradients (backpropagation)
            optimizer.step()         # Update weights

Each batch goes through the classic cycle: forward pass → compute loss → backpropagate gradients → update weights → reset gradients for next batch.

The validation phase (after each training epoch):

        # ---- VALIDATION ----
        model.eval()    # Disable dropout, use fixed batch norm stats
        all_preds, all_true = [], []
        val_loss_total = 0

        with torch.no_grad():   # No gradients needed for evaluation
            for images, labels in val_loader:
                images = images.to(device)
                outputs = model(images)

                # Compute validation loss
                val_loss_total += criterion(
                    outputs, labels.float().unsqueeze(1).to(device)
                ).item()

                # Convert raw logits → binary predictions
                # sigmoid maps logits to [0, 1], then threshold at 0.5
                preds = (torch.sigmoid(outputs) >= 0.5).int().cpu().numpy().flatten()
                all_preds.extend(preds)
                all_true.extend(labels.numpy())

Notice the model.eval() call. This is important: it disables dropout (all neurons active) and uses the running statistics for batch normalization instead of batch statistics. Without it, your validation metrics would be noisy and unreliable.

Track metrics and update scheduler:

        val_f1 = f1_score(all_true, all_preds, average="binary")
        scheduler.step(val_loss_total)   # Reduce LR if loss plateaued

        if val_f1 > best_f1:
            best_f1 = val_f1

        if (epoch + 1) % 5 == 0:
            print(f"  [{phase_name}] Epoch {epoch+1}/{epochs}: val_f1={val_f1:.4f}")

    print(f"  [{phase_name}] Best F1: {best_f1:.4f}")
    return best_f1

We track the best F1 across all epochs, not just the last one. Models often peak before the end of training (after which they may overfit slightly).

5.7 — Preparing the data splits

We split the labeled data into train (70%) and test (30%). The test set is sacred: it is never used for training in either experiment. Stratification ensures the class ratio is preserved in both splits.

from sklearn.model_selection import train_test_split

labeled_train_idx, labeled_test_idx = train_test_split(
    range(len(labeled_paths)),
    test_size=0.3,
    random_state=42,
    stratify=labeled_labels,
)

Create the three datasets we need:

# 1. Labeled training data (for supervised training + fine-tuning)
train_labeled_ds = MetalSurfaceDataset(
    [labeled_paths[i] for i in labeled_train_idx],
    [labeled_labels[i] for i in labeled_train_idx],
    preprocessing,
)

# 2. Test data (for evaluation only — NEVER used for training)
test_ds = MetalSurfaceDataset(
    [labeled_paths[i] for i in labeled_test_idx],
    [labeled_labels[i] for i in labeled_test_idx],
    preprocessing,
)

# 3. Weakly labeled data (pseudo-labels from clustering)
weakly_labeled_ds = MetalSurfaceDataset(
    unlabeled_paths,
    unlabeled_pseudo_labels.tolist(),
    preprocessing,
)

And the DataLoaders:

train_labeled_loader = DataLoader(train_labeled_ds, batch_size=16, shuffle=True)
test_loader = DataLoader(test_ds, batch_size=16, shuffle=False)
weakly_labeled_loader = DataLoader(weakly_labeled_ds, batch_size=32, shuffle=True)

print(f"Train (labeled):   {len(train_labeled_ds)} images")
print(f"Test:              {len(test_ds)} images")
print(f"Weakly labeled:    {len(weakly_labeled_ds)} images")

Note the different batch sizes: 16 for the small labeled set (fewer images per epoch), 32 for the large weakly labeled set (faster processing). Both use shuffle=True for training to prevent the model from learning the order of samples.

5.8 — Experiment A: supervised only (baseline)

This is the simpler experiment. We train a fresh model using ONLY the 140 labeled training images (the other 60 are reserved for testing). This is the performance we'd get without semi-supervised learning.

print("=" * 60)
print("EXPERIMENT A: SUPERVISED ONLY (140 labeled images)")
print("=" * 60)

model_supervised = DefectClassifier(dropout_rate=0.5).to(device)

f1_supervised = train_model(
    model_supervised, train_labeled_loader, test_loader,
    epochs=30, lr=1e-4, device=device, phase_name="Supervised"
)

5.9 — Experiment B: semi-supervised (the two-phase approach)

Now the full pipeline. Phase 1 gives the model broad intuition from 9,800 pseudo-labeled images. Phase 2 sharpens it with 140 real labels.

Phase 1 — Pre-training on weakly labeled data:

print("\n" + "=" * 60)
print("EXPERIMENT B: SEMI-SUPERVISED")
print("=" * 60)

model_semi = DefectClassifier(dropout_rate=0.5).to(device)

print("\nPhase 1: Pre-train on pseudo-labeled data (9,800 images)...")
train_model(
    model_semi, weakly_labeled_loader, test_loader,
    epochs=10, lr=1e-4, device=device, phase_name="Pre-train"
)

We only train for 10 epochs here because the pseudo-labels are noisy. Training too long on noisy labels would reinforce the mistakes.

Phase 2 — Fine-tuning on strongly labeled data:

print("\nPhase 2: Fine-tune on real labeled data (140 images)...")
f1_semi = train_model(
    model_semi, train_labeled_loader, test_loader,
    epochs=20, lr=5e-5, device=device, phase_name="Fine-tune"
)

Notice the lower learning rate (5e-5 vs 1e-4 in phase 1). This is deliberate and critical. If we use a high learning rate during fine-tuning, the model would quickly "forget" everything it learned during pre-training — the gradients would be too large and would overwrite the pre-trained weights. A gentle learning rate lets the model make small corrections to its existing knowledge, keeping the broad patterns from phase 1 while fixing the errors with real labels.

This is analogous to the factory inspector: you don't start their training from scratch in phase 2. You gently correct their misconceptions while preserving their overall intuition.

5.10 — Final evaluation: the moment of truth

Now we evaluate both models on the same test set with multiple metrics.

First, the evaluation function:

from sklearn.metrics import roc_auc_score, classification_report

def full_evaluation(model, test_loader, device, name):
    """
    Evaluate on the test set.
    Returns F1 score and AUC-ROC.
    """
    model.eval()
    all_preds, all_probs, all_true = [], [], []

    with torch.no_grad():
        for images, labels in test_loader:
            outputs = model(images.to(device))
            probs = torch.sigmoid(outputs).cpu().numpy().flatten()
            all_probs.extend(probs)
            all_preds.extend((probs >= 0.5).astype(int))
            all_true.extend(labels.numpy())

We collect both probabilities (for AUC-ROC, which measures ranking quality) and binary predictions (for F1, which measures classification quality at the 0.5 threshold):

    f1 = f1_score(all_true, all_preds, average="binary")
    auc = roc_auc_score(all_true, all_probs)

    print(f"\n{name}:")
    print(f"  F1 Score: {f1:.4f}")
    print(f"  AUC-ROC:  {auc:.4f}")
    print(classification_report(
        all_true, all_preds, target_names=["Normal", "Defect"]
    ))
    return f1, auc

Why F1 and not accuracy? Because with imbalanced classes, accuracy is misleading. A model that always predicts "normal" gets 80% accuracy but 0% recall on defects. F1 is the harmonic mean of precision and recall — it punishes models that ignore the minority class.

Why AUC-ROC? It measures how well the model ranks images (defective images should get higher probabilities than normal ones), regardless of the classification threshold. An AUC of 1.0 means perfect ranking; 0.5 means random.

Now the comparison:

f1_sup, auc_sup = full_evaluation(
    model_supervised, test_loader, device, "SUPERVISED ONLY"
)
f1_semi, auc_semi = full_evaluation(
    model_semi, test_loader, device, "SEMI-SUPERVISED"
)

And the final verdict:

print("=" * 60)
print("FINAL COMPARISON")
print("=" * 60)
print(f"  {'Metric':<12s} {'Supervised':>12s} {'Semi-supervised':>16s} {'Delta':>8s}")
print(f"  {'-'*50}")
print(f"  {'F1':<12s} {f1_sup:>12.4f} {f1_semi:>16.4f} {f1_semi - f1_sup:>+8.4f}")
print(f"  {'AUC-ROC':<12s} {auc_sup:>12.4f} {auc_semi:>16.4f} {auc_semi - auc_sup:>+8.4f}")

If the Delta column shows positive numbers, we've proven that the unlabeled data was useful. The pseudo-labels, despite being imperfect, gave the model a head start that pure supervision on 200 images couldn't match.

5.11 — Interpreting the results

Here's how to read the comparison:

F1 improved by +0.05 or more: clear win for semi-supervised. The unlabeled data provided meaningful signal.
F1 improved by +0.01 to +0.04: modest improvement. Semi-supervised helps but the margin is small. Consider improving the clustering quality or using more sophisticated pseudo-labeling.
F1 unchanged or worse: the pseudo-labels were too noisy to help, or the clustering didn't capture the true structure. Try different feature extractors, different clustering algorithms, or higher confidence thresholds for pseudo-labels.

Further reading:

Pseudo-labelling paper (Lee, 2013) — the original approach
Semi-supervised learning survey (van Engelen & Hoos) — comprehensive overview of methods
PyTorch training loop best practices

Part 6 — Scaling to millions of images: a realistic roadmap

The question from the business

"Your proof of concept works on 10,000 images. We have 4 million images to process. Can we scale this pipeline with a budget of €5,000?"

This is a question you'll face in any real project. Let's break it down honestly.

Computation costs

Feature extraction is the bottleneck. On our 10,000 images with a single GPU, it took about 30 minutes. Scaling linearly:

4,000,000 images ÷ 10,000 images × 30 min = 12,000 min = 200 GPU-hours

At ~€2/hour for a cloud GPU instance (Azure NC-series with a T4 or A10 GPU), that's about €400.

Clustering also needs adaptation. Standard K-Means loads all data into memory to compute distances. With 4M embeddings of 2048 dimensions (each a 4-byte float):

4,000,000 × 2,048 × 4 bytes = ~32 GB just for the embeddings

That won't fit in RAM on most machines. Solution: use MiniBatchKMeans from scikit-learn, which processes data in chunks of (say) 10,000 samples at a time. Same result, fraction of the memory.

CNN training: pre-training on 4M pseudo-labeled images takes about 50 GPU-hours → €100.

Storage costs

Raw images: 4M × ~50 KB average = 200 GB. Embeddings: 4M × 2048 × 4 bytes = 32 GB. On Azure Blob Storage at ~€0.02/GB/month, that's about €5/month.

Labeling strategy

If 200 labels aren't enough at scale, we could label more. At ~€1 per image (including quality control), 2,000 more labels would cost €2,000. But there's a smarter approach: active learning.

Active learning lets the model choose which images to label. Instead of randomly selecting 2,000 images, the model identifies the ones it's most uncertain about — the images that would teach it the most. This typically requires 3-4x fewer labels to achieve the same performance improvement.

With active learning, we might only need 500 additional labels instead of 2,000 → €500.

Total budget estimate

Feature extraction (GPU):     €400
CNN training (GPU):           €100
Storage (year 1):             €60
Additional labeling:          €500 – €2,000
──────────────────────────────────────
TOTAL:                        €1,060 – €2,560

Well within the €5,000 budget, with room to spare for experimentation and re-runs.

Five conditions for success

Use cloud GPUs, not local hardware. Rent by the hour, pay only for what you use.
Use MiniBatchKMeans instead of regular KMeans. Same quality, 100x less memory.
Build a proper data pipeline with batch processing. Never load 4M images into RAM at once. Use PyTorch DataLoader with num_workers > 0 for parallel loading.
Consider active learning to maximize the value of every human-labeled image. Each label should be selected strategically, not randomly.
Store and version embeddings, not just raw images. Re-extracting 4M embeddings costs €400; loading saved embeddings costs nothing.

Further reading:

MiniBatchKMeans (scikit-learn)
Active learning overview — how to label smarter, not more

Conclusion

Semi-supervised learning isn't magic — it's engineering. You take the structure hidden in unlabeled data (via embeddings and clustering), turn it into approximate labels, and use those to give your supervised model a head start. The unlabeled data doesn't replace real labels — it supplements them.

Let's recap the full pipeline we built:

Exploration — We scanned 10,000 images for corruption, inconsistent formats, and class imbalance. We looked at the data with our own eyes.
Preprocessing — We standardized every image into the format ResNet50 expects: 224×224, RGB, CLAHE-enhanced, ImageNet-normalized.
Feature extraction — We used a pretrained ResNet50 to convert each image into a 2048-dimensional embedding that captures its visual essence.
Clustering — We applied K-Means and DBSCAN to group unlabeled images into clusters, then assigned pseudo-labels based on cluster membership.
Semi-supervised training — We pre-trained a CNN on 9,800 pseudo-labeled images, then fine-tuned on 200 real labels, and compared against a supervised-only baseline.
Scaling analysis — We estimated compute, storage, and labeling costs for 4M images, confirming feasibility within a €5,000 budget.

The key takeaways:

A pretrained CNN can extract meaningful features from any image domain, even one it was never trained on. Clustering on embeddings reveals natural groupings that often correspond to real classes. Pseudo-labels are imperfect, but a model pre-trained on imperfect labels and then fine-tuned on real labels outperforms a model trained only on real labels. And the semi-supervised approach is most valuable precisely when labels are scarce and expensive — which is the situation you'll face in most real-world projects.

The pattern works across domains: medical imaging, industrial quality control, satellite imagery, document classification, and biodiversity monitoring. Anywhere labels are expensive and unlabeled data is abundant — which, in 2025, is nearly everywhere.

How to Train a Semi-Supervised Classifier With Pseudo-Labeling and CNN Embeddings

Table of Contents

Labels are expensive, images are free

Building a Semi-Supervised Learning Pipeline

1. Data Exploration: understanding the data you’re working with

Why you must look at your data before doing anything else

1.1 — Loading and counting the images

1.2 — Scanning for problems: resolution, color, corruption

1.3 — Class distribution: Is our labeled set balanced?

1.4 — Visualizing sample images: always look before you model

Part 2 — Preprocessing: speaking the model's language

Why we can't just feed raw images to a neural network

2.1 — What is histogram equalization, and why does it matter here?

2.2 — Building a custom PyTorch Dataset

2.3 — The transform pipeline: what each step does

2.4 — Creating datasets and DataLoaders

Part 3 — Feature extraction: turning images into meaningful numbers

3.1 — Why raw pixels are a terrible representation

3.2 — What is a pretrained model and why we don't train from scratch

3.3 — Loading ResNet50

3.4 — Freezing the parameters

3.5 — Removing the classification head

3.6 — The extraction function

3.7 — Running the extraction

3.8 — Saving embeddings (don't re-extract every time!)

3.9 — Sanity check: are the embeddings reasonable?

Part 4 — Unsupervised clustering: discovering structure in the dark

4.1 — What clustering does and why we need it

4.2 — Standardize the embeddings

4.3 — Reduce dimensions with PCA

4.4 — Visualize with t-SNE

4.5 — K-Means clustering

4.6 — DBSCAN clustering (an alternative approach)

4.7 — Visualizing the clusters on the t-SNE plot

4.8 — Assigning pseudo-labels to unlabeled images

Part 5 — Semi-supervised training: the actual experiment

5.1 — The logic behind our two-phase approach

5.2 — Building the classifier: architecture

5.3 — The loss function: handling class imbalance

5.4 — The optimizer: AdamW with weight decay

5.5 — The learning rate scheduler

5.6 — The training loop: one epoch at a time

5.7 — Preparing the data splits

5.8 — Experiment A: supervised only (baseline)

5.9 — Experiment B: semi-supervised (the two-phase approach)

5.10 — Final evaluation: the moment of truth

5.11 — Interpreting the results

Part 6 — Scaling to millions of images: a realistic roadmap

The question from the business

Computation costs

Storage costs

Labeling strategy

Total budget estimate

Five conditions for success

Conclusion