What it did
Train a CNN to classify 32×32 images of German traffic signs into 43 classes. Targets in the GTSRB dataset are real photos with mixed lighting, partial occlusions, and class imbalance. Project rubric was ≥93% validation accuracy.
The architecture
LeNet-ish, deliberately small:
Input (32×32×3)
→ Conv5×5 (6 filters) + ReLU + MaxPool 2×2
→ Conv5×5 (16 filters) + ReLU + MaxPool 2×2
→ Flatten → FC(120) → ReLU
→ FC(84) → ReLU
→ FC(43)
Cross-entropy loss, Adam optimizer, batch size 128, ~20 epochs to saturate.
What was actually tricky
- Class imbalance. GTSRB has 43 classes but the distribution is long-tailed — some classes have 2,000+ samples, others have under 200. Without weighted loss or augmentation, the model learns the head and ignores the tail.
- TensorFlow’s session API. Pre-Keras default, you had to wire the graph + the session + the feed-dict manually. A single typo in placeholder names produces a silent shape mismatch buried in the stack trace.
- Validation accuracy ≠ real accuracy. The val split was a hold-out of the same recording sessions; the model overfit to the lighting more than the signs. Performance on phone-camera photos was much worse.
What I’d do differently with hindsight
- Start with a small pretrained ResNet18. Fine-tuning ImageNet features for ~1 hour beats training from scratch for ~3 hours, on this small a dataset.
- Augment aggressively. Random brightness, contrast, perspective, occlusion — GTSRB is rotation-symmetric enough that horizontal flip doesn’t apply, but everything else helps.
- Use a proper CV strategy. k-fold split by recording session (not by frame) prevents the same sign appearing in train and val.
What it taught me
The first time I trained a CNN end-to-end. The shock wasn’t that it worked — it was how aggressively the model memorized the training distribution and failed on anything just outside it. That experience informed every later ML project: spend the first hour finding the weird examples in your data; the model will too.