Background

I have a goal of stacking Jenga blocks with a robot arm. Not in simulation — with
real hardware that can actually knock things over. The setup ended up being an Intel
RealSense depth camera bolted to the wrist of a UFactory xArm (an eye-in-hand
rig), OpenCV finding the best block in the frame, and the arm picking it up and
placing it on a growing stack.

But before any of that fancy motion, I needed to stop being scared of images.

Short Goal

So I started with this static photo of Jenga blocks from my desktop camera. The small
mission: draw accurate rotated boxes, center dots, and labels around the blocks by the
end of this post — and then, much later, turn those boxes into 3D points the robot can
actually reach.

Original RealSense camera image with five Jenga blocks on a table
The starting photo: five blocks, 640 x 480 pixels, shot with an Intel RealSense D435. The lighting is not perfect, the shadow is dramatic, and OpenCV is about to judge me.

After some trial, error, and mild emotional damage, this became the first usable output:

Detected Jenga blocks with green rotated boxes, center dots, and angle labels
Green boxes, center dots, sizes, and angles. The blocks are mostly found. The shadow on the right is still trying to become famous.

This post follows the whole road: from “what even is a pixel” to a real arm picking a
block off the table. Each section is one skill I had to learn, plus how it bolts onto
the one before it.

And the stupidity begins…

OpenCV 1

1
2
import cv2 as lv
import numpy as np

Typically, people import cv2 as cv, but why not make it un poco luxurious? lv sounds richer to me. HAHAH!

Everything below uses lv for OpenCV. In the actual repo it is the boring cv2, but here we stay rich.

Do Not Be Afraid by OpenClaw

OpenClaw once said, “yeah, I got you.”

Before the panic sets in, it helps to know what “computer vision” even means. Columbia’s
Prof. Shree Nayar gives the calmest possible answer:

"What is Computer Vision?" — the gentle, no-prerequisites intro from First Principles of Computer Vision (Prof. Shree Nayar, Columbia). OpenClaw was right: you've got this.

Basic

If you know this, simply skip :)

1
2
3
4
5
6
7
8
9
10
11
12
13
import numpy as np

a = np.array([1.1, 2.0, 3.6, 4.8, 5.9])

b = np.array([
[1, 2, 3],
[1, 2, 3],
[1, 2, 3],
[1, 2, 3],
])

print(a.shape)
print(b.shape)

Output:

1
2
(5,)
(4, 3)

Explanation:

  • shape returns a tuple.
  • Each number in the tuple tells you how many elements exist along that dimension.
  • a is one-dimensional, so its shape is (5,).
  • b has 4 rows and 3 columns, so its shape is (4, 3).

This matters more than it looks: an OpenCV image is a numpy array. A color frame is
(480, 640, 3) — height, width, then 3 color channels. Knowing which axis is which
saves you from a lot of “why is my image sideways” moments later.

What about if we print:

1
2
print(a.dtype)
print(b.dtype)

Output:

1
2
float64
int32

It is obvious that dtype is simply short for dog Type, um, data type I mean.

Dictionaries

1
2
3
4
5
6
block2 = dict(
center_uv=(100, 200),
angle_deg=10,
score=0.7,
depth=0.3
)

Imagine this is an actual block with values appended to the dictionary. Hold onto this
shape — by the end of the post, the real detector hands back a dictionary that looks
almost exactly like this, with center_uv, angle_deg, score, and a depth.

Question:

How should I access the values in the bracket?

1
print(block2["score"])

Output:

1
0.7

OK, this sounds nice, but what if there is no key, such as shadow? An alternative method is .get():

1
print(block2.get("shadow", "No such key!"))

Output:

1
No such key!

Iterating Over a Dictionary

The default method that I first learned is:

1
2
for item in block2:
print(item, block2[item])

Output:

1
2
3
4
center_uv (100, 200)
angle_deg 10
score 0.7
depth 0.3

But there is a more elegant method by using .items() and an f-string.

1
2
for key, value in block2.items():
print(f"key: {key}, value: {value}")

Output:

1
2
3
4
key: center_uv, value: (100, 200)
key: angle_deg, value: 10
key: score, value: 0.7
key: depth, value: 0.3

.items() gives you both the key and value directly.

Lambda

Normally when you write a function:

1
2
def get_score(block):
return block['score']

With lambda, you don’t need to write a full function. I think of it as a small one-use function.

1
2
3
get_score_lambda = lambda b: b['score']

print(get_score_lambda(block2))

Output:

1
0.7

This exact trick is how the detector later picks a winner: max(blocks, key=lambda b: b["score"]).

Example: Nested Dictionaries

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
detection_result = {
"frame_id": 12,
"timestamp": 12345415465,
"blocks": [
{
"center_uv": (120, 240),
"angle_deg": 88,
"depth": 0.67,
"score": 0.67
},
{
"center_uv": (320, 240),
"angle_deg": -22,
"depth": 0.4,
"score": 0.93
}
],
"camera_info": {
"width": 640,
"height": 480,
"fx": 615,
"fy": 615
}
}

We have two blocks in this dictionary. What should I do when I want to find the best score and the greatest depth?

1
2
3
4
5
6
7
best, deepest = (
max(detection_result["blocks"], key=lambda b: b["score"]),
max(detection_result["blocks"], key=lambda b: b["depth"])
)

print(f"best: {best}")
print(f"deepest: {deepest}")

Output:

1
2
best: {'center_uv': (320, 240), 'angle_deg': -22, 'depth': 0.4, 'score': 0.93}
deepest: {'center_uv': (120, 240), 'angle_deg': 88, 'depth': 0.67, 'score': 0.67}

Notice camera_info already carries fx and fy — the focal lengths. Those two
numbers are what turn a flat pixel into a real 3D point much later. Foreshadowing.

Enumerate

In simple terms, enumerate() provides you with both the index and the value simultaneously.

Without enumerate():

1
2
3
4
centers = [(100, 200), (300, 150), (250, 310)]

for i in range(len(centers)):
print(i, centers[i])

With enumerate():

1
2
for i, value in enumerate(centers):
print(i, value)

OpenCV 2: From Color to Edges

OpenCV loads images in BGR order, not RGB.

In this project, I tested multiple options such as Lab and HSV.
I realized that Saturation outperforms the others for this setup — the table is
washed-out and gray, while the wooden blocks actually have color. Saturation is exactly
“how colorful is this pixel,” so it separates wood from table almost for free.

Set Up

1
2
3
4
5
6
7
8
9
import numpy as np
import cv2 as lv

img = lv.imread('cv.png')
if img is None:
print('No such image!')
exit(1)

print(f'Original Shape: {img.shape[1]}x{img.shape[0]}')

Explanation:

  • lv.imread('cv.png') tries to load the image.
  • If the image does not exist, img becomes None.
  • A normal program exit would be exit(0).
  • Since missing cv.png is an error, this uses exit(1).

HSV Channels

1
2
3
hsv = lv.cvtColor(img, lv.COLOR_BGR2HSV)

h_ch, s_ch, _ = lv.split(hsv)

Explanation:

  • cvtColor converts the image from BGR into HSV.
  • split separates the image into hue, saturation, and value channels.
  • _ means “I am intentionally ignoring this value.”

During this period, I was still struggling to determine if hue or saturation works better. Still, I decided to separate them during the preprocessing step.

So I just stared at single channels and thresholds until one of them looked like a block
and not like modern art:

One color channel isolated, blocks faintly visible A second color channel isolated A thresholded channel showing the blocks as white
Channel shopping. Each one keeps a different part of the blocks (and a different part of the noise). Saturation kept winning.

Shadow Normalization

Saturation handles color, but it does nothing about uneven lighting. One side of the
table is bright, the other is in shadow, and a naive threshold either keeps the shadow
or loses the dim blocks. So before thresholding, I flatten the lighting.

The trick: take the L (lightness) channel, boost local contrast with CLAHE, then
divide the image by a heavily blurred copy of itself. The blurred copy is the slow
lighting gradient, so dividing it out leaves only the fast detail — the blocks.

1
2
3
4
5
6
7
def shadow_normalized_luma(img):
lab = lv.cvtColor(img, lv.COLOR_BGR2LAB)
l = lab[:, :, 0]
clahe = lv.createCLAHE(clipLimit=2.0, tileGridSize=(8, 8))
equalized = clahe.apply(l) # local contrast boost
background = lv.GaussianBlur(equalized, (0, 0), 21) # the slow lighting gradient
return lv.divide(equalized, background, scale=255) # divide the gradient out
  • CLAHE is “adaptive histogram equalization” — it brightens dark patches without
    blowing out the bright ones.
  • lv.divide(..., scale=255) rescales so the result is still a normal 0–255 image.

Gaussian Blur

1
s_blur = lv.GaussianBlur(s_ch, (5, 5), 0)

Explanation:

  • One obstacle that hinders Canny from identifying edges is NOISE and rough pixel transitions.
  • Once this (5, 5) Gaussian blur is applied, each pixel becomes an average of its 5x5 neighborhood.
  • This preprocessing step is vital because it increases the efficiency and accuracy of the actual detection.
Computerphile's Mike Pound on what a blur actually does to a pixel — the "calm down" button, explained properly with kernels and the Gaussian.

Canny and Threshold Tuning

Canny finds edges by looking for sharp brightness changes. It takes two thresholds —
a low one and a high one — and honestly the only way I understood them was to open six
windows at once and stare:

1
edges = lv.Canny(blur, 30, 70)   # (low, high) — too low = static, too high = nothing

If you want the actual reason those two numbers exist (the “hysteresis” trick), Computerphile explains Canny better than I ever could:

Computerphile on the Canny edge detector — the two-threshold hysteresis idea I kept fighting with, drawn out step by step.
Six-panel grid of Canny edge results at different thresholds
The same frame run through six different Canny threshold pairs. Too low and the wood grain shows up as edges; too high and the block dissolves. I picked the pair where the outline was closed but the inside stayed quiet.
Edge tuning experiment Stacked edge windows during tuning
This is the part where computer vision stops feeling like magic and starts feeling like arguing with pixels.

Morphology

Canny gives me thin, broken outlines. To turn them into a solid shape I need
morphology — grow the edges a little (dilate), then close the gaps
(MORPH_CLOSE):

1
2
3
kernel = lv.getStructuringElement(lv.MORPH_RECT, (3, 3))
edges = lv.dilate(edges, kernel, iterations=1) # fatten the lines
edges = lv.morphologyEx(edges, lv.MORPH_CLOSE, kernel, iterations=2) # seal the gaps
Side-by-side dilation results showing thick white outlines around the blocks
Dilation test. Thicker outlines make contours easier to grab, but if I push it too far, the noise also gets promoted.
Final edge map with white outlines of the blocks on a black background
Final-ish edge map. The five blocks are there, mostly behaving. The right-side shadow is still applying for block citizenship.

Before I settled on the pipeline above, I threw everything at the wall. These are the
screenshots I kept while losing my mind:

Grayscale version of the Jenga block test image
Grayscale version. The blocks are still visible, but the useful wood color is gone. A little too elegant, a little too useless.
Comparison grid of edge detection and thresholding experiments
A small gallery of experiments: Canny thresholds, Otsu masks, and adaptive thresholding. Some found the blocks. Some found the meaning of chaos.

Two Masks, OR’d Together

Edges alone are fragile — one shadow and the outline breaks. Color alone is fragile too —
one bright reflection and a block goes missing. So the real pipeline builds two masks
and combines them with bitwise_or, on the theory that a pixel is probably a block if
either test agrees.

Color mask — threshold the LAB b channel (blue ↔ yellow) and the HSV saturation,
each with Otsu, which picks its own threshold automatically:

1
2
3
4
5
6
7
8
lab = lv.cvtColor(img, lv.COLOR_BGR2LAB)
hsv = lv.cvtColor(img, lv.COLOR_BGR2HSV)
b_channel = lv.GaussianBlur(lab[:, :, 2], (5, 5), 0) # yellow-ness
saturation = lv.GaussianBlur(hsv[:, :, 1], (5, 5), 0) # color-ness

_, b_mask = lv.threshold(b_channel, 0, 255, lv.THRESH_BINARY + lv.THRESH_OTSU)
_, sat_mask = lv.threshold(saturation, 0, 255, lv.THRESH_BINARY + lv.THRESH_OTSU)
color_mask = lv.bitwise_or(b_mask, sat_mask)

Edge-fill mask — take the closed edges from before, find the outer contours, and
fill them in so each outline becomes a solid blob. Then bitwise_or the two masks
together.

Left: blocks painted green by the mask. Right: the binary mask.
Left: the mask painted back over the blocks in green. Right: the clean binary mask the rest of the pipeline actually sees.

On real wooden blocks (the live rig, not the test photo) the same mask looks like this —
and notice the little marker on the top block, which I used as a sanity reference:

Green mask overlay on real wooden blocks with angle labels
The combined mask, painted green over the real blocks, already tagging each one with an angle.

Watershed: Splitting Blocks That Touch

Here is a problem the test photo hides: when two blocks touch, the mask glues them into
one fat blob, and minAreaRect then draws one giant wrong box. The fix is
watershed — treat the mask like a landscape and “flood” it from the center of each
block so the seam between them becomes a wall.

1
2
3
4
5
6
7
8
9
10
dist = lv.distanceTransform(mask, lv.DIST_L2, 5)         # how deep inside each blob
_, sure_fg = lv.threshold(dist, 0.45 * dist.max(), 255, lv.THRESH_BINARY)
sure_fg = np.uint8(sure_fg) # confident "this is a block"
sure_bg = lv.dilate(mask, kernel, iterations=3) # confident "this is table"
unknown = lv.subtract(sure_bg, sure_fg) # the contested border

_, markers = lv.connectedComponents(sure_fg) # label each block center
markers += 1
markers[unknown == 255] = 0
markers = lv.watershed(img.copy(), markers) # carve the seams
  • distanceTransform measures how far each white pixel is from the nearest black one —
    the centers of blocks are “deepest.”
  • connectedComponents gives each separate center its own number, so two touching
    blocks become two labels instead of one.
  • watershed grows those labels until they meet, drawing a boundary exactly where two
    blocks touch.

The “treat the image like a landscape and flood the valleys” picture is hard to get from
words alone, so here it is animated:

Watershed segmentation with the topographic "flood the valleys, build walls where the water meets" intuition — the same metaphor as above, but moving.

Finding the Block

Now the mask is clean and split, so I can finally measure each block. For every contour:
filter by area, find the center with moments, and find the size + angle with
minAreaRect.

1
2
3
4
5
6
7
8
9
10
contours, _ = lv.findContours(mask, lv.RETR_EXTERNAL, lv.CHAIN_APPROX_SIMPLE)
for cnt in contours:
if lv.contourArea(cnt) < 500:
continue # ignore specks
M = lv.moments(cnt)
cx, cy = int(M["m10"] / M["m00"]), int(M["m01"] / M["m00"]) # center pixel
(_, _), (w, h), angle = lv.minAreaRect(cnt) # rotated box
if w < h: # always call the long side "w"
w, h = h, w
angle += 90.0

Scoring: which blob is actually a Jenga block?

Lots of things make rectangles. To pick the real block I give each candidate a
score — a weighted blend of four “does this look right” questions, each shaped so
that “close to the expected block” scores near 1:

1
2
3
4
5
6
fill_score   = min(area / (long_px * short_px), 1.0)             # how filled-in is the box?
aspect_score = np.exp(-((aspect - EXPECTED_RATIO) ** 2) / 4.0) # ~110/36 long:short?
size_score = (long_score + short_score) / 2.0 # right pixel size?
color_score = max(yellow_score, saturation_score) # actually wood-colored?

score = 0.30*fill_score + 0.35*aspect_score + 0.20*size_score + 0.15*color_score

Then I sort candidates by score and take the best — the max(..., key=lambda b: b["score"])
trick from the dictionaries section, finally paying off. Each detection comes back as a
dictionary that looks an awful lot like block2 from way back: center_uv, angle_deg,
score, and friends.

The angle from minAreaRect becomes the gripper yaw after normalizing it to a sane
range, so the gripper can line up with the block:

1
2
3
4
5
6
7
def normalize_angle(deg):
"""Squeeze any angle into [-90, 90) so the wrist never over-rotates."""
deg = deg % 360.0
if deg > 180.0: deg -= 360.0
if deg >= 90.0: deg -= 180.0
if deg < -90.0: deg += 180.0
return deg
Multiple blocks each with a green box, center dot, and angle label
Every block boxed, centered, and labeled with its angle. This is the static goal from the very top of the post — achieved.

The pipeline on real blocks

Stepping through the actual debug images the program writes on every grab — raw frame,
binary mask, final pick:

Raw color frame of wooden blocks on a white table Clean binary mask of the same blocks Final detection with the chosen block outlined and labeled
Input → mask → decision. The chosen block gets a green outline and a red label with its score, depth, and yaw. The others are politely ignored.

Adding Depth: From Pixels to 3D

A green box is nice, but the robot lives in millimeters, not pixels. The RealSense gives
me a depth image aligned to the color image, so every pixel also has a distance.

1
2
3
4
align = rs.align(rs.stream.color)            # make depth line up with color
aligned = align.process(frames)
color = np.asanyarray(aligned.get_color_frame().get_data())
depth = np.asanyarray(aligned.get_depth_frame().get_data())

Raw depth is noisy and full of holes, so for the block center I take a median of the
valid depths in a small patch (and fall back to the median inside the contour if the
center pixel is a dropout). A median shrugs off the zeros that wreck an average.

Then the actual magic — back-projection — turns a pixel (u, v) plus its depth into
a real camera-frame point in meters, using the camera’s focal lengths and optical center:

1
2
3
4
5
def pixel_to_3d(u, v, depth_m, fx, fy, cx, cy):
x = (u - cx) / fx * depth_m
y = (v - cy) / fy * depth_m
z = depth_m
return x, y, z # XYZ in the camera frame, in meters

Those fx, fy are the same focal lengths hiding in the camera_info dictionary from
the Dictionaries section. Everything connects.

Where do fx, fy, cx, cy even come from? They are the linear camera model. Columbia’s
First Principles of Computer Vision lays it out cleanly — this is the math my four
mystery numbers are quietly obeying:

First Principles of Computer Vision (Prof. Shree Nayar, Columbia) on the linear camera model — where fx, fy, cx, cy actually come from, and why back-projection is just running it backwards.

To actually see depth, you colorize it. Near is bright, far is dark — here is a fist
held in front of the camera, which is exactly the kind of thing that breaks naive color
masks but reads beautifully in depth:

Depth image colorized with a heat colormap, a fist glowing in front
Colorized depth. Yellow is close, red is far. The camera does not care that it is a fist — it just reports distance.

Put it together and a single detection now carries a real 3D position, a yaw, and a
confidence score:

A single detected block labeled with score, depth in meters, and yaw
score, z=0.285m, yaw — the block is no longer a picture, it is a place in space the arm can reach.

Going Live

A static photo is a comfortable lie. The real rig runs on a live stream so I can move
the camera around and watch detections update in real time. Instead of capturing one
frame, I keep the RealSense pipeline open and loop:

1
2
3
4
5
6
7
8
9
10
11
while True:
color, depth = stream.read()
best = pipeline.get_best(color, depth) # detect + back-project, every frame
preview = draw_detection(color, best)
lv.imshow("Live Stack Grab", draw_live_overlay(preview, best, stack_count))

key = lv.waitKey(1) & 0xFF
if key == ord("g"): # grab the block I'm currently looking at
stack_current_detection(...)
elif key in (ord("q"), 27):
break

The overlay shows the live score, depth, yaw, an FPS counter, and how many blocks are in
the stack so far. Press g and the arm grabs whatever is currently highlighted.

Live detection on real wooden blocks, one boxed with score and angle Live detection finding multiple wooden blocks at once
Live, from the wrist camera's point of view. Block count, score, and angle update every frame. This is the static pipeline, but now it has to keep up.

A Detour: Pointing It at My Face

Once the pipeline worked, I obviously pointed it at things that are not Jenga blocks,
because that is the law. The same HSV/skin logic happily detects a face, and depth
happily measures a hand. Useless for stacking, excellent for morale.

The detector boxing a face, with a green skin mask on the right The detector outlining a hand with a depth reading
Same masks, different victim. Skin is just another saturated, warm-colored region as far as the pipeline is concerned.

From Camera to Robot

The detector gives me a point in the camera’s frame. The robot thinks in the
base frame. Because the camera rides on the wrist (eye-in-hand), there is no single
fixed transform — the camera moves with every joint. So at the moment of capture I read
the live wrist pose and chain two transforms:

1
point_base = T_base_from_tcp_current @ T_tcp_from_camera @ point_camera
  • T_tcp_from_camera is the measured mount: where the camera sits relative to the
    tool, in millimeters and degrees. Measure it once, mark it verified, or the script
    refuses to move (a safety I was grateful for more than once).
  • T_base_from_tcp_current is the live wrist pose, read the instant the frame is taken.

If the @ (matrix multiply) and the idea of a “rotation matrix” feel like hand-waving,
this is the one video that made it click for me — a matrix is just a transformation of
space, and a rotation is one of them:

3Blue1Brown on linear transformations and matrices — the best intuition for what a rotation matrix is really doing to a point in space.

The gripper yaw gets its own little equation, combining the wrist’s current yaw, the
block’s vision yaw, and a 90° offset for grabbing the long side:

1
pick_yaw = capture_yaw - vision_yaw + yaw_offset + wrist_grasp_offset  # +90 for long side

Before anything moves, every target is checked against a safe-range box (a min/max
XYZ volume in the config). Outside the box → the move is refused. Then the arm runs a
plain, boring, beautiful 9-step pick-and-place: open gripper, rise to a safe height, go
above the block, descend, close, lift, go above the place spot, descend, release.

This is the panel I used to jog the arm and watch its live pose while tuning all of the
above:

The robot arm control panel showing joint angles and live XYZ pose
The xArm console: six joint angles on the left, live X/Y/Z and roll/pitch/yaw on the right. The vision pipeline ultimately just fills in one of those XYZ targets.

Stacking

One pick is a party trick; a tower is the goal. The stacking math is deliberately simple:
two blocks per layer, side by side, offset by the block width (25 mm). Each new layer is
raised by the block height (15 mm) and rotated 90° from the one below — the classic Jenga
weave. Between every move the arm lifts to a travel-height clearance so it never plows
through the stack it just built.

The block dimensions, the place position, the safe box, and the gripper open/close values
all live in config2.json, so re-tuning the tower never touches the code.

Reading the Debug Images

A robot that grabs the wrong block is useless if I can’t see why. So every time I press
g, the program dumps four images into grab_debug/ — plus a numbered copy each cycle
(detection_001.png, detection_002.png, …) so the whole stacking run replays frame by
frame afterward. These four are the entire debugging toolkit:

  • color.png — the raw camera frame. What the camera actually saw: lighting, glare,
    and every block on the table.
  • depth_raw.png — the 16-bit depth frame. Whether there is real depth where the
    block is. Holes here mean trouble.
  • mask.png — the binary segmentation. Did the pipeline find clean block shapes, or
    glue two touching blocks into one fat blob?
  • detection.png — the final labeled pick. Which block won, and its score, depth, and
    yaw.

Debugging is just reading them in order. Robot missed entirely? The mask shows
whether two blocks fused (a watershed problem) or a block vanished (a threshold problem).
Picked a strange block? The detection label shows the winning score — usually something
non-block scored too high. Lunged to the wrong height? The depth_raw had a hole right at
the center, so the median depth fell back to something silly. (You already met this set as
the color → mask → detection row up in Finding the Block.)

The depth frame, made visible

depth_raw.png looks pure black to the eye, because its values are tiny raw units — in
this frame, 0 to 418 out of a 16-bit range. Colorize it and the scene appears: the blocks
sit closer to the wrist camera (blue) than the table (red), which is exactly the signal
the 3D back-projection needs. The black freckles are depth dropouts the median has to
survive.

Colorized depth frame: blocks appear blue against a red table, with black dropout speckles
The same depth frame as a heatmap. Blue = near, red = far — the blocks clearly stand proud of the table. Black specks are missing depth the median quietly steps over.

One run, frame by frame

The numbered detection_* images are the receipts for a full stacking session. Early on
the table is crowded; the detector keeps locking onto the highest-scoring block, the arm
clears it, and a couple dozen grabs later the table is nearly empty:

Debug frame early in the run, many blocks on the table Debug frame mid-run, table half cleared Debug frame late in the run, table nearly empty
Frames 1, 22, and 44 of one run. One clean block gets chosen at a time while the stack (off-frame) grows. When `detection.png` stops finding anything, the table is clear and the tower is done.

Live Preview and Short Demo

Canva preview. It stays 16:9 because this one is a slide-like canvas, not a phone video pretending to be a rectangle.

YouTube Short. Strict 9:16. Vertical video deserves to remain vertical.

Skills I Picked Up

A quick honest inventory of what this little pig actually learned, start to finish:

  • OpenCV — color spaces (BGR/HSV/LAB), CLAHE shadow normalization, Otsu thresholds,
    Canny + morphology, watershed to split touching blocks, contours, moments,
    minAreaRect, and a weighted scoring function to choose the real block.
  • Depth camera — aligning depth to color, reading intrinsics and depth scale, robust
    median depth, and back-projecting a pixel into a real 3D point.
  • Calibration — the eye-in-hand transform chain, and why a wrist-mounted camera can’t
    use one fixed matrix.
  • Robot motion — talking to an xArm, gripper control, safe-range checks, a 9-step
    pick-and-place, and the stacking math that turns single grabs into a tower.

The pig is no longer scared of images. The pig now has a robot arm. This was probably a
mistake.