PenPal

A vision-guided ROS 2 system that reads a real whiteboard and writes short responses back with a 7-DoF Franka arm.

What PenPal does

I wanted a robot that could have a *physical* conversation: you write something on a board, it reads it, decides on a response, and writes back. PenPal is that loop — perception to text to motion — running live in ROS 2 with a Franka arm.

Writing demo (HD): the arm writes an arbitrary word with the tool frame locked to the board surface.

Perception + TF calibration

I built the perception stack around a RealSense RGB-D feed, OpenCV, and AprilTag detections. The board pose is computed in a consistent frame, and I publish a calibrated camera→base TF chain so every downstream step (planning, TCP alignment, and writing) stays frame-correct instead of drifting between ad-hoc transforms.

RViz view: board pose / TF frames and motion planning visualization.

Closed-loop “read → answer → write” architecture

Instead of baking OCR into the main node, I exposed OCR + question answering as a ROS 2 service. That kept the system modular (easy to swap Gemini/Qwen, add a mock node, or run without the VLM) and made the main control loop simple: wait until the board is reliably visible, trigger OCR/QA, then write the returned text.

End-to-end loop: board visible → OCR/QA → writing response on the board.

High-level architecture: vision + TF → OCR/QA service → planners/controllers → MoveIt execution.

Motion planning + TCP alignment

On the motion side, I made the writing pipeline explicitly SE(3)-driven: poses are computed as transforms (not hand-tuned Euler tweaks), and I set a custom TCP so the pen tip becomes the control point. That way, MoveIt Cartesian plans keep the pen orientation stable relative to the board normal, which is the difference between “touching the board” and actually writing clean strokes.

Reliability choices I made

To keep the loop stable, I added visibility gating (time threshold + tag-count threshold) so noisy detections don’t trigger writing. I also kept mock nodes (mock board detector + mock OCR) so I could test the full state machine and motion stack even when the camera/VLM wasn’t running.