PR2 Perception Pick-and-Place · Udacity

ROS + PCL stack for the PR2 robot: cluster a tabletop, classify objects with an SVM on color/normals features, generate grasp poses.

What it did

Final project of the Robotics nanodegree’s Perception module. PR2 (the “willow garage” robot, in simulation) sees a cluttered tabletop via its RGB-D camera, segments objects, classifies each, then publishes grasp-pose messages so a downstream motion planner can pick the right target and place it in the right bin.

The pipeline

Voxel-grid downsample the depth cloud.
Passthrough filter to crop to the table region in the Z axis.
RANSAC plane segmentation removes the table surface; remainder is “things on the table.”
Euclidean clustering → one cluster per object.
Per-cluster features: color histogram (HSV) + surface-normal histogram → ~64-dim vector.
SVM classifier (trained offline on labeled clouds from each target object) predicts the class.
Grasp pose: cluster centroid + table normal → approach vector.
Publish a PickPlaceRequest ROS message with arm + object + bin.

What was actually tricky

The PR2’s camera frame, the robot frame, and the world frame are three different things. Every transform needs to be done at the right link, with TF subscribed before the perception node starts processing.
Color features generalize poorly across lighting. HSV is better than RGB but the simulator’s directional lighting still made classifier accuracy brittle. Augmenting the training data with jittered lighting helped.
The whole pipeline is in gazebo — slow to iterate, slow to reset between tests, and the underlying simulation noise is different from real PR2 sensors.

What I’d do differently with hindsight

Replace the SVM with a small PointNet or PointNet++ classifier. Modern point-cloud classifiers eat clustered objects directly without hand-engineered features, and they generalize substantially better.
Use a learned segmentation model instead of RANSAC + Euclidean. Mask-R-CNN on the RGB stream + back-projection into the depth cloud is more robust to clutter.
Treat grasp pose as a learned policy (a la DexNet) rather than a hand-engineered geometric construction. The simple “centroid + table normal” approach fails on non-convex objects.

What it taught me

ROS is a graph-of-nodes framework, not a programming model. Most of the debugging time on this project was getting nodes to subscribe and publish at the right rate on the right topic, not the actual perception. Lesson: a robotics stack’s hardest problem is rarely the algorithms; it’s the wiring and the timing.