Caffe PoseNet (fork) · Udacity

Forked the Caffe implementation of PoseNet (regress 6-DOF camera pose directly from a single RGB image). Pre-PyTorch era of CV research.

What it was

Fork of alexgkendall/caffe-posenet — the original Caffe implementation of Kendall et al.'s PoseNet (ICCV 2015), one of the first end-to-end neural approaches to camera relocalization. Forked during a Robot Perception course to run it on a custom dataset.

What PoseNet does

Single RGB image in, 6-DOF camera pose (3D position + quaternion orientation) out. The architecture is a modified GoogLeNet with the classification head swapped for two regression heads (position + quaternion). Trained per-scene on labeled image→pose pairs.

Loss is a weighted sum of position L2 + quaternion-angle:

L = ||x̂ - x||₂ + β · ||q̂ - q||₂

Tuning β is the per-scene fiddly bit.

What was actually tricky

Caffe was already on its way out by late 2019. The framework's prototxt+caffemodel format was opaque; debugging anything meant reading the network definition from train_val.prototxt by hand.
GPU memory. GoogLeNet on a 1050 Ti was tight; batch size 16 was the upper bound before OOM.
The dataset bias. PoseNet learns the appearance of a scene more than its geometry. A change in lighting (day → dusk) wrecks the model. The paper acknowledged this; the fix is data augmentation + multi-time-of-day captures.

What I'd do differently with hindsight

Use a modern relocalization stack. HLoc (Hierarchical Localization), Marepo, or any retrieval-then-PnP pipeline blows end-to-end pose regression out of the water. PoseNet was the first answer, not the right one.
PyTorch port + ONNX. Caffe is dead; the cost of carrying it on any new project isn't justified.
For learning, the value is the loss function. The decoupled position-orientation loss with a tunable weight β is a really clean teaching example for multi-task regression.

What it taught me

The framework matters less than the math. Caffe was a hassle but the PoseNet idea (regress pose end-to-end from raw pixels) was a clean formulation that turned out to be a research dead-end — and that's a useful lesson too. Many of the "first" approaches to a problem age poorly; the abstractions that age well are the loss functions and the datasets, not the architectures.