Project 4: Image Mosaics

In this project, we image to describe the process of how you can shoot different images and stitch them together to create a panorama:

This project was done in two parts. The first part is about how to stitch images together given a set of corresponding keypoints between two images. The second part is about how do you actually find these two corresponding keypoints automatically, enabling auto mosaics.

Part 1: Image Warping and Mosaicing

Shoot the Pictures

The first step of course it to shoot some photos. Here The most common way is to fix the center of projection (COP) and rotate your camera while capturing photos.We choose one image as the center image and warp all other images onto its perspective.. Here is some of the photos I took:

Images from Vally Life Science Library Dinosaurs.

Images of the “Osborne” Trex in Vally Life Science Library:

Images of the University of Berkeley Library:

Selecting Keypoints

After capturing your images, begin selecting corresponding keypoints between two images so that you know how to align one another (we’ll automatic this process in part 2).

Recovering Homographies

To begin warping an image onto another’s image perspective, we first have to align each image to the center image. We do this by recovering a homography matrix $H$ that maps one image’s perspective onto another’s. Given corresponding points $\bm{p}_1 = (x, y, 1)$ in the first image and $\bm{p}_2 = (x', y', 1)$ in the second, we seek the transformation $H$ such that

H \bm{p}_1 = w \bm{p}_2

where $H$ is a $3 \times 3$ homography matrix, and $w$ is a scalar accounting for homogeneous coordinates. Expanding $H$ as:

H = \begin{bmatrix} a & b & c \\ d & e & f \\ g & h & 1 \end{bmatrix}

we derive the following equations:

\begin{aligned} ax + by + c &= w x' \\ dx + ey + f &= w y' \\ gx + hy + 1 &= w \end{aligned}

By eliminating $w$ , we express the equations as:

\begin{aligned} ax + by + c - g x x' - h y x' &= x' \\ dx + ey + f - g x y' - h y y' &= y' \end{aligned}

These equations can be rewritten as a linear system:

\begin{bmatrix} x & y & 1 & 0 & 0 & 0 & -x x' & -y x' \\ 0 & 0 & 0 & x & y & 1 & -x y' & -y y' \end{bmatrix} \begin{bmatrix} a \\ b \\ c \\ d \\ e \\ f \\ g \\ h \end{bmatrix} = \begin{bmatrix} x' \\ y' \end{bmatrix}

Given multiple point correspondences, we can stack these equations into a larger linear system $A \bm{h} = \bm{b}$ , where $A$ is a $2n \times 8$ matrix of point coordinates, and $\bm{h}$ is the vector of homography parameters $[a, b, c, d, e, f, g, h]^T$ . We solve for $\bm{h}$ using least squares, and reshape it into the $3 \times 3$ matrix $H$ , with the bottom-right value set to 1.

Warping Images

The goal of image warping is to transform the input image based on the computed homography, aligning it with a common reference frame or another image. In the warp_image function, the image is warped using the given homography matrix $H$ to a specified output_shape.

The process begins by determining the bounding box for the entire mosaic and specifying an output_shape, which defines the target size for the warped image. A grid of target pixel locations is then created to cover this shape. Using inverse warping, each point in the target image is mapped back to its corresponding location in the source image under $H$ . Finally, a validity mask is created to ensure that only pixels within the bounds of the original image contribute to the warped result. Bilinear interpolation is applied to smoothly handle non-integer mappings of source coordinates and now the image is ready to be blended into the mosaic.

Blending Images into a Mosaic

Once each image is warped onto the common reference frame, the next step is to blend them into a single, cohesive panorama. Simply overlaying images would produce visible seams and sharp transitions in overlapping areas, so we apply a blending technique to achieve smooth transitions.

Using OpenCV’s cv2.distanceTransform, we compute a weight map for each image’s mask. This distance transform assigns higher weights to pixels near the image center and gradually lowers them toward the edges. By applying these weight maps to each image, overlapping areas are smoothly blended, minimizing visible seams.

The process involves creating a large canvas (mosaic) to accommodate all warped images. Each warped image is added to this canvas with its respective weight map.

Image Rectification

Image rectification was used as a test case to verify the functionality of the warp_image function. The objective was to transform a known planar object in an image—such as a book or poster—into a perfect rectangle using a homography. By defining corresponding points between the corners of the object and a rectangular target frame, we computed a homography matrix $H$ to perform this transformation.

This was the result:

Part 2: Auto-stitching

In Part 1, we manually selected feature correspondences. Here, we implement an automated approach for stitching images based on Brown et al.'s “Multi-Image Matching using Multi-Scale Oriented Patches” paper, with some simplifications. Here is the we used to auto stitch images together:

Detect Corner Features:

In order to find a potential feature point in each image, we use the Harris detector, which capturing areas of significant intensity variation (corners).

Below is an example of the Harris interest points for the Dinosaur Image:

Select Robust Keypoints with ANMS:

As seen from the example above, there could be many potential feature points given by the Harris detector. To reduce this number of points, we use Adaptive Non-Maximal Suppression (ANMS) to select the most distinctive points. This selection ensures that keypoints are well-distributed across the image.

ANMS works by calculating a suppression radius for each point, which measures its spatial distance to any stronger point. The radius calculation considers corner strength and spatial distribution.

We keep the top ANMS_POINTS points with the largest radii, which gives a robust set of keypoints.

Extract Feature Descriptors:

For each keypoints selected by ANMS, we generate a feature descriptor that captures the local image structure:

A 40x40 window is cropped around each keypoint, and Gaussian smoothing is applied to reduce high-frequency noise.

The 40x40 window is downsampled to an 8x8 patch, with pixels sampled every DESCRIPTOR_SAMPLE_SPACING pixels, creating a compact representation of the local area.

This patch is then normalized (mean 0, unit variance) to make it invariant to brightness and contrast differences between images.

Match Feature Descriptors:

We compare descriptors between images by measuring Euclidean distance, identifying pairs of matching points between images:

For each feature descriptor in one image, we find its closest match in the other image by comparing distances.

A modified ratio test is applied to ensure matches are distinctive. For each descriptor, the ratio of the distance to the closest match vs. the average distance to the next closest MATCH_NEIGHBORS descriptors is calculated. If the ratio is below MATCH_RATIO_THRESHOLD, the match is accepted.

Estimate Homography with RANSAC:

Using matched points, we compute a robust homography matrix with RANSAC:

In each RANSAC iteration, four randomly selected matches estimate a candidate homography matrix.

We transform each matched point pair using the homography and calculate the Euclidean distance to its corresponding match. Points with distances below a threshold (RANSAC_THRESHOLD) are considered inliers.

The homography with the most inliers is chosen as the best fit. We then recompute the homography using all inliers, hoping to get a stable transformation.

Blend and Stitch Images: Using the homography, we use the same process in the previous part to blend images together.

Results

The auto-stitching pipeline allows us to automatically create image mosaics. Below are examples of auto-stitched panoramas:

Coolest Thing I Learned From This Project

The coolest thing I learned from this project would be how to compute homographies and warp images onto another image perspective. This means that we can trick the viewer into thinking the camera was in another place than it actually was, as seen by the rectified images.