What it was
Fork of alexgkendall/caffe-posenet — the original Caffe implementation of Kendall et al.’s PoseNet (ICCV 2015), one of the first end-to-end neural approaches to camera relocalization. Forked during a Robot Perception course to run it on a custom dataset.
What PoseNet does
Single RGB image in, 6-DOF camera pose (3D position + quaternion orientation) out. The architecture is a modified GoogLeNet with the classification head swapped for two regression heads (position + quaternion). Trained per-scene on labeled image→pose pairs.
Loss is a weighted sum of position L2 + quaternion-angle:
L = ||x̂ - x||₂ + β · ||q̂ - q||₂
Tuning β is the per-scene fiddly bit.
What was actually tricky
- Caffe was already on its way out by late 2019. The framework’s
prototxt+caffemodel format was opaque; debugging anything meant
reading the network definition from
train_val.prototxtby hand. - GPU memory. GoogLeNet on a 1050 Ti was tight; batch size 16 was the upper bound before OOM.
- The dataset bias. PoseNet learns the appearance of a scene more than its geometry. A change in lighting (day → dusk) wrecks the model. The paper acknowledged this; the fix is data augmentation + multi-time-of-day captures.
What I’d do differently with hindsight
- Use a modern relocalization stack. HLoc (Hierarchical Localization), Marepo, or any retrieval-then-PnP pipeline blows end-to-end pose regression out of the water. PoseNet was the first answer, not the right one.
- PyTorch port + ONNX. Caffe is dead; the cost of carrying it on any new project isn’t justified.
- For learning, the value is the loss function. The decoupled position-orientation loss with a tunable weight β is a really clean teaching example for multi-task regression.
What it taught me
The framework matters less than the math. Caffe was a hassle but the PoseNet idea (regress pose end-to-end from raw pixels) was a clean formulation that turned out to be a research dead-end — and that’s a useful lesson too. Many of the “first” approaches to a problem age poorly; the abstractions that age well are the loss functions and the datasets, not the architectures.