I Am Bloated

After finishing the Jenga stacking project from Cv1, I became bloated.

Not physically. Mentally. The dangerous kind.

Jenga stacking semi-automated with OpenCV, returning the best block candidate and waiting for human authorization before grabbing.

The robot arm stacked a tower. The Intel RealSense D435 saw the blocks. OpenCV found rotated boxes, center points, angles, and a “best” block. The xArm moved. Nothing exploded.

And just like that, my brain decided:

Ah yes. Time to do AI robotics.

That is the bloating.

Cv1 was real progress, but it was also a narrow success. I had a working pipeline, not a complete robotics system:

A camera image became contours.
Contours became block candidates.
One candidate became a grasp target.
A human still judged whether the move was safe.
The arm executed a mostly hand-designed behavior.

That is good engineering practice for a first project. It is not the same thing as “the robot understands assembly.”

Why I Got Bloated

The bloating came from confusing a working demo with a research direction.

In Cv1, I learned enough OpenCV, numpy, RealSense depth, and robot control to make something physical happen. That creates a dangerous feeling: if pixels can become a grasp, maybe language can become construction, and maybe a robot can become an intelligent builder.

Then I opened the ML papers I had collected.

The PDFs made the idea feel bigger:

ASAP talks about physical feasibility, assembly order, stability, collision, grasping, and executable robot plans.
LegoBot shows that even LEGO assembly needs planning rules, exclusion zones, and coordination constraints.
“No, to the Right” makes language correction feel practical instead of decorative.
KitchenVLA shows iterative vision-language correction during real task execution.
Speech to Reality connects natural language, 3D generation, discrete parts, fabrication limits, and robotic assembly.
SayCan reminds me that language only matters when it is grounded in actions the robot can actually perform.
AprilTag 2 is less glamorous, but it points toward controlled pose estimation and repeatable experiments.

So my brain did the bad compression:

OpenCV worked + robot moved + these papers exist = I should build an AI assembly system.

That equation is emotionally exciting and technically irresponsible.

The more honest equation is:

OpenCV worked + robot moved = I now have a small physical testbed where I can ask one narrow robotics question carefully.

That is still useful. It is just less inflated.

Questions I Need to Answer First

Before I can call this a research plan, I need to sit with some uncomfortable questions:

What part of the system am I actually testing: perception, planning, control, language, or human feedback?
How does a traditional robotics approach differ from a neural-network or LLM-based approach in this exact setup?
Where should the human stay in the loop, and where should the robot act on its own?
How do I quantify success instead of just saying “it worked”?
Why should I discretize and simplify the world before trying to solve a bigger version?
Who would benefit from this project, and why does it matter?
What would make the AI layer necessary rather than decorative?

The last question is the dangerous one. If a lookup table can solve the task, then calling an LLM is probably theater.

The Panic Question

The question that really bothered me was:

Am I building a research project, or am I just copying what smarter people already did?

If I say, “I will make a robot understand language and build structures,” that is not a project. That is me badly summarizing papers I barely understand.

But if I say:

I want to test whether short verbal corrections help a small robot arm assemble unstable rectangular blocks more reliably than one-shot instructions.

Then maybe I have a real project.

The phrasing matters because the scope matters. A vague ambition hides missing knowledge. A narrow question exposes exactly what must be built, measured, and learned.

What the Papers Actually Gave Me

I went back through the online versions of the papers, especially these:

Here is what each one should do to my project: not make it bigger, but make it more disciplined.

ASAP

ASAP is about planning physically feasible assembly sequences. It does not just ask, “What order should the parts go in?” It also asks whether the assembly is stable under gravity, whether the path is collision-free, whether a gripper can actually hold the part, and whether a robot can execute the plan.

That is the part I skipped in my first Jenga project.

I could detect blocks. I could move the arm. But I never truly reasoned about physical feasibility. I mostly trusted that if a block looked placeable, gravity and physics would be kind.

The universe does not orient itself around my contour detector.

LegoBot

LegoBot reminds me that simple parts do not automatically mean a simple planning problem.

LEGO introduces exclusion zones, collision constraints, statics problems, and overhang issues. Jenga has its own version of this. The blocks are rectangular prisms, not points. Orientation matters. Friction matters. Slight pose errors matter. A structure can look visually correct and still be physically doomed.

So if I use Jenga blocks, I should not pretend I am doing general robotic assembly. I am studying a small but annoying domain where instability is easy to observe.

That is a feature, not a weakness.

No, to the Right

This paper hit closest to my idea.

The key insight is not simply “robot follows language.” It is that one instruction is often not enough. A human can watch the robot and correct it during execution:

No, move right.

Tilt down a little.

Closer to that block.

That looks a lot like what already happened in my project. I watched the arm do something questionable, then stepped in before it did something worse.

So maybe the interesting part is not full autonomy. Maybe it is the feedback loop.

KitchenVLA

KitchenVLA also focuses on iterative correction, but in a different setting. It tries to translate human instructional videos into robot-executable actions, then uses visual feedback and language-model replanning to fix mismatches along the way.

The useful takeaway for me: a plan generated from language or video can sound perfectly reasonable while still being wrong for the robot’s actual environment.

That is exactly my problem. A sentence like “stack the blocks into a tower” hides a mountain of detail:

Which block goes first?
What orientation should it have?
Is the target pose reachable?
Will the tower survive the next placement?
What should the robot do when the camera estimate is slightly wrong?

The physical world is the tax collector.

Speech to Reality

Speech to Reality is probably the most dangerous paper for my ego, because it looks magical: speech becomes a physical object through generative AI and robotic assembly.

But the important part is not the magic. It is the constraint pipeline. The system has to modify and discretize generated geometry so it fits within robot workspace limits, inventory availability, overhang tolerances, connectivity rules, and assembly constraints.

If I use language or AI to generate a Jenga structure, I cannot accept the output at face value. I need a constraint checker that grades the plan before the robot moves.

Eventually, I could try a small search or Monte Carlo-style sampler over block states. But that only makes sense after I define the state, action, constraints, and success metric. Otherwise I am asking a robot to faithfully execute a hallucination.

SayCan and AprilTags

SayCan is a reminder that language models need grounding. A model can describe an action, but the robot should only choose actions it can actually perform.

AprilTags are far less glamorous, but probably more useful to me right now. I achieved the first Jenga stack with vision alone, but a more controlled experiment needs repeatable pose references. AprilTags could help me create known target poses, validate camera calibration, and spend less time arguing with shadows.

What Could Be Mine?

My project can be smaller:

In a small Jenga-block assembly setup using a UFactory arm and RealSense camera, does allowing a human to give short verbal corrections during execution improve assembly success compared with a single initial instruction?

That is still inspired by the papers. But it is not the same project.

The domain is different:

I am using Jenga-like rectangular blocks, not LEGO DUPLO, kitchen objects, or arbitrary CAD assemblies.
My hardware is a small educational/research setup, not a large lab system.
My focus is not fully autonomous intelligence. It is comparing interaction styles.
My main measurement is whether corrections help the robot recover from realistic mistakes.

That feels more honest.

Revised Research Question

My current research question should probably be:

How does an iterative verbal correction loop affect the success rate, correction time, placement error, and user confidence of a robotic Jenga-block assembly task compared with a one-shot instruction interface?

This is a lot less bloated than “AI robot plays Jenga like AlphaGo.”

Thank God.

Possible Experiment

I can compare two modes.

Mode A: One-Shot Instruction

The user gives one command at the start:

Build a three-level tower.

The system then follows a planned sequence. If the pose estimate is slightly off, if a block is tilted, or if the next placement becomes risky, the robot still has to push forward with the original plan unless the safety stop is triggered.

This mode tests the fantasy that one instruction is enough.

Mode B: Iterative Correction

The user can interrupt or correct the robot during execution:

Move right.

Rotate clockwise.

Place it slower.

Stop. Re-detect the block.

The system does not need to be genius-level at first. It can begin with a small command vocabulary mapped to simple robot adjustments. Later, the language layer can become more flexible.

This mode tests whether human feedback makes the system more robust.

Measurements

To keep this from sounding like pure fluff, I need measurable outcomes:

Assembly success rate: did the target structure stay standing for at least a few seconds?
Task completion time: how long did it take from first command to final placement?
Correction count: how many human corrections were needed?
Placement error: how far was the final block pose from the target pose?
Recovery rate: when the robot made a visible mistake, did correction mode actually recover it?
User confidence: did the user feel the robot was understandable and controllable?

Now the project starts sounding like an experiment instead of a dream.

What I Need to Learn Before I Start Pretending

This is the part my bloated brain does not want to hear.

Before making any real claims, I need to learn:

Linear algebra for transformations, rotations, and camera coordinates.
Basic machine learning, not just “AI” as a decorative word.
Robot kinematics and inverse kinematics.
Camera calibration and hand-eye calibration.
Pose estimation, probably with AprilTags before trying anything fancy.
Basic physics simulation, maybe PyBullet, to test whether simple structures are stable.
Experimental design, because a project without measurements is just a demo.

The painful truth is that the boring pieces are the project.

How I Should Develop

The development path should be a ladder. Each step has to produce evidence before I climb to the next one.

Perception baseline: detect block pose from the RealSense view and log the result. Exit condition: repeat the same scene several times and measure pose variance.
Calibration baseline: convert image/depth measurements into robot-frame coordinates. Exit condition: the arm can point to known test locations with acceptable error.
Single-block manipulation: pick one block and place it at one fixed target pose. Exit condition: repeated trials work without changing code between attempts.
Three-block fixed plan: build one tiny structure from a predefined sequence. Exit condition: success rate, timing, and failure modes are logged.
Manual correction vocabulary: add commands like left, right, rotate, lower, stop, and re-detect. Exit condition: each command maps to a predictable robot adjustment.
One-shot versus correction experiment: run the same structures in both modes. Exit condition: compare success rate, completion time, correction count, placement error, and user confidence.
Only then add flexible language or learned planning: use an LLM or learning method only if the fixed vocabulary becomes the bottleneck.

This keeps the project honest. If step 2 is weak, step 7 will not save it. It will only hide the weakness behind nicer words.

First Non-Bloated Plan

I should not start with “build an intelligent assembly agent.”

I should start with this:

Detect block poses reliably.
Move one block to a target pose reliably.
Build a simple three-block structure from a fixed plan.
Add manual correction commands.
Compare one-shot mode against correction mode on the same structures.
Only then think about language models or learned planning.

It is not as glamorous as “speech to reality.”

But it is real.

Current Conclusion

The new mission is:

Build a small robotic assembly system where human verbal corrections are part of the control loop, then measure whether that actually helps.

That is still ambitious.

But at least now the pig is only moderately bloated.