Estimating Robot Pose from a Single Camera
October 2025 - December 2025
For a deep learning class project at BU, two classmates and I built a system that estimates the joint angles of a UR10 robotic arm from a single RGB image. No internal sensors, no fiducial markers on the robot. Just a camera looking at the arm, and a neural network that figures out what configuration it's in.
The motivation is straightforward. Industrial robots know their joint angles from internal encoders, but those drift over time. External cameras could provide an independent check, or let one robot track another without any communication channel. The question is whether you can get accurate enough pose estimates from pixels alone.
Why Two Stages
Our first attempt was the obvious one: take an image, run it through a CNN, output six joint angles directly. It didn't work. The errors were between 68 and 79 degrees per joint, which is useless for any practical purpose.
The problem is that joint angles aren't the kind of thing CNNs are naturally good at predicting. A CNN excels at spatial reasoning in image coordinates. It can tell you where things are in the picture. Asking it to jump directly from pixels to abstract angular quantities skips over the spatial structure it's best at exploiting.
So we split the problem. Stage 1 is a ResNet-18 that localizes the robot with a bounding box. Stage 2 is a ResNet-34 that takes the cropped image at higher resolution and predicts 2D keypoint locations for each of the six joints. Then we recover joint angles through inverse kinematics using known camera intrinsics and the UR10's kinematic chain. This two-stage approach got us mean keypoint errors of about 14 pixels, translating to roughly 6 centimeters in physical space.
Synthetic Data
We didn't have access to a physical UR10, so the entire dataset was synthetic, generated in NVIDIA Isaac Sim. 8,000 training images and 2,000 test images of the robot in random configurations.
The challenge with synthetic data is always the same: will the model generalize beyond the simulator? The standard approach is domain randomization. You make the training data look deliberately unrealistic in random ways, so the model can't latch onto any simulator-specific artifacts. We randomized robot textures, floor textures, lighting, camera pose, and added Gaussian noise and cutout augmentation in post-processing. The test set used a completely different environment with no randomization, trying to approximate realistic conditions.
Whether this actually bridges the sim-to-real gap is something we never tested. We only evaluated on synthetic test data. That's the biggest limitation of the project.
What Worked, What Didn't
The bounding box detector worked well, with 86% mean IoU. The keypoint regression was decent for the base, shoulder, and elbow joints, with median angular errors under 7 degrees. But the wrist joints were bad, with errors ranging from 11 to 46 degrees.
The wrist failure has a clear explanation. We trained without an end effector attached to the robot. For the wrist joints, rotating them doesn't change the visual appearance of the arm when there's no tool to rotate. The model literally can't learn wrist orientation from pixels because the pixels don't change. This is a dataset problem, not a model problem, and it's fixable by including varied end effectors during training.
The inference pipeline runs at 56 milliseconds end-to-end, which is fast enough for real-time use at 18 FPS. We compared against published baselines like DREAM and RoboPose, but those comparisons used different robot models and different datasets, so they don't mean much. The honest summary is that the system works for proximal joints in simulation, fails for distal joints due to a dataset design choice, and remains untested on real hardware.
What I Took Away
This was a team project, and I learned things from it that I wouldn't have from working alone. Splitting a pipeline across three people forces you to define interfaces clearly. What exactly does Stage 1 output? What coordinate frame? What normalization? These questions sound trivial, but getting them wrong means your teammate's model trains on garbage.
The bigger lesson was about the gap between getting a model to train and getting it to work on the right thing. Our networks converged quickly and produced low loss values. But loss going down doesn't mean the model learned what you wanted. The wrist joint failure is a perfect example. The network did exactly what we asked: minimize pixel error on keypoints. It's just that for wrist joints, the pixel error was low regardless of the actual angle, because the visual signal wasn't there.
The code and dataset are both public: github.com/corneliusgruss/robot_pose_estimation and HuggingFace.