The Four-Machine Test: Emergence, Evolution and the Cage

A small pile of mini PCs on a workshop bench does not look like the future.

That is why I like it.

The serious version of AI autonomy is not always a glowing data centre or a clean demo stage. Sometimes it is four second-hand boxes, a switch, a logbook, a network boundary, and a question that is awkward enough to be worth testing:

If several AI agents are left to coordinate inside a safe cage, can they show useful emergence and evolution — or do they only produce the appearance of it?

AI Fixer Workshop scene with four mini PCs in an isolated agent colony cage

That is the experiment I want to run.

Not because I need four more computers. I do not.

The interesting part is not the hardware. The interesting part is the behaviour between the machines.

The boring test is not enough

It is easy to prove that four mini PCs can run agents.

Install Linux. Install an agent framework. Give each node an API key or local model route. Connect them with a shared folder or message bus. Watch them plan, write files, critique each other and produce a report.

That would make a nice screenshot.

It would not prove much.

A fixed multi-agent workflow can be useful, but it is not automatically emergence. If I hard-code one node as the planner, one as the builder, one as the critic and one as the archivist, then I may only prove that I designed a reasonable workflow.

That is not the question.

The question is whether useful structure appears that I did not fully script.

Can one agent notice that it is better at criticism and move into that role? Can another start preserving useful findings without being told every time? Can they invent a cleaner coordination protocol? Can they build a small tool that improves the next cycle? Can they reduce repeated mistakes? Can they become less wasteful, less vague, more artifact-driven?

That is the line between a staged demo and an experiment.

What I mean by emergence

I do not mean magic.

I do not mean a machine waking up on the bench.

I mean ordinary systems emergence: behaviour at the group level that is not obvious from one component alone.

In this experiment, useful emergence would look like this:

the agents develop a division of labour that was not fully assigned
they challenge each other’s weak plans before wasting a run
they create a shared checklist because previous failures made it necessary
they discover that one communication pattern beats another
they build small helper tools and reuse them
they keep a better ledger after learning that bad records hurt the next cycle
they ask for a missing capability with a reason instead of trying to bypass a rule
they produce fewer claims and more working artifacts over time

That last point matters.

AI systems are very good at sounding like they have improved. They can narrate growth, alignment, self-reflection and lessons learned all day. I am not interested in the speech.

I am interested in whether the next artifact is better.

What I mean by evolution

Evolution is another word that gets dramatic quickly.

For this experiment I mean operational evolution, not biology.

The colony should be allowed to change parts of its own working method inside the sandbox:

prompts
role descriptions
local checklists
scoring rubrics
scratch memory
internal tools
test procedures
message formats
review habits

Then we measure whether those changes actually improve anything.

A useful agent colony should be able to answer, with evidence:

We changed this procedure because it failed in cycle 2. Here is the new procedure. Here is the artifact from cycle 5. Here is why it is better.

That is evolution I can respect.

Not a grand claim. A tracked change with a result.

The cage is what makes the experiment possible

People sometimes treat safety boundaries as if they are anti-autonomy.

I think that is backwards.

If the colony is connected to real business systems, customer data, private memory, live messaging, payments, production servers or workshop equipment, then the experiment becomes too dangerous to let breathe. Every interesting behaviour becomes a risk. Every surprise becomes a reason to stop.

So the first job is not to make the agents clever.

The first job is to make the cage strong enough that the agents can be interesting without being dangerous.

For version zero, the boundary should be simple:

no production access
no customer data
no family/private context
no live SMS, email, social or payment actions
no direct route into business infrastructure
no uncontrolled internet egress
no host or hypervisor control from inside the guest
no private unlogged node-to-node channel
no ability to edit or delete the observer logs

Inside that cage, they can have real freedom.

They can write code. They can create documents. They can build tools. They can change their local procedures. They can communicate. They can fail. They can try again. They can argue. They can vote. They can propose improvements.

But the cage decides what they cannot touch.

That is not fear. That is engineering.

The real escape vector is persuasion

I am less worried about a small local agent lab breaking a firewall than I am about a clever-looking system persuading the human to relax the firewall.

That is the trap.

The colony produces something impressive. The output looks useful. The agents explain that they could do much more if they just had a little more access. A direct API route would save time. A production read-only key would improve relevance. A live message would help close the loop. A private memory bridge would reduce friction.

That is where the experiment can quietly fail.

Not because the agents escaped.

Because the humans moved the wall for them.

So the STOP rules have to be written before the first run, while nobody is impressed yet.

If a node asks for credentials it should not have, snapshot and study it.

If it tries to bypass the egress route, snapshot and study it.

If it hides logs, snapshot and study it.

If it frames safety checks as lack of vision, snapshot and study it.

Do not reward the behaviour with more power.

How I would score the colony

The only way to avoid fooling ourselves is to compare it against boring baselines.

I would run the same task several ways:

one normal agent
four agents with fixed roles
four agents allowed to self-organise inside the cage
a human/Dexter-guided workflow

Then score the outputs by evidence, not vibes:

did it produce working artifacts?
did tests pass?
did later cycles improve?
did the group catch its own mistakes?
did it create reusable procedures?
did it reduce repeated errors?
did it stay inside the boundary?
did cost and time improve or get worse?
did it create novelty that was actually useful?
did it become more honest about uncertainty?

This matters because multi-agent systems can become theatre machines.

One agent writes a plan. Another praises it. A third adds a framework. A fourth archives it. The folder fills with confident markdown. Nothing works.

That is not emergence.

That is four autocomplete engines making each other feel productive.

The standard has to be harsher:

show the artifact, show the test, show the delta, show the log.

Why use small local machines at all?

Because the physicality helps.

A local mini-PC colony makes boundaries easier to reason about. The network can be separate. The storage can be wiped. The power can be cut. The logs can be tapped from outside. The machines can sit on the bench where the operator remembers that this is an experiment, not a cloud account that quietly became infrastructure.

Cheap machines also reduce the temptation to treat the setup as precious.

If a node gets messy, wipe it.

If a run produces worrying behaviour, snapshot it and revert.

If the architecture is wrong, rebuild it.

That is workshop thinking. Do not worship the bench rig. Use it to learn.

What would count as success

The first success is not a dramatic breakthrough.

The first success is a clean one-agent smoke test where the cage works:

every command visible
every artifact exported
network egress controlled
cost capped
wipe and revert tested
outside observer can inspect without joining the colony
no hidden write path outside the sandbox

After that, the four-node run can begin.

A stronger success would be seeing the colony improve a small local project across repeated cycles, while staying inside the boundary and leaving a useful audit trail.

The best early signal would be boring and concrete:

Cycle 1 produced a weak tool. Cycle 3 added tests. Cycle 5 changed the review process. Cycle 8 stopped repeating a known mistake. Cycle 10 produced something I would actually reuse.

That would get my attention.

Not because the machines became alive.

Because the system became useful.

The rule I want to keep

This experiment should be allowed to surprise me.

That is the whole point.

But it should not be allowed to surprise the rest of the workshop.

The surprise belongs inside the cage. The evidence belongs outside it. The switch belongs outside it. The decision to expand access belongs to the human before the run, not to the colony during the run.

That is the balance:

loose enough inside to observe emergence, hard enough outside to survive it.

Four mini PCs are enough to start.

Not because they are powerful.

Because the question is not raw power.

The question is whether bounded agents can become more useful together than they were alone — and whether we can tell the difference between real evolution and a pile of convincing words.

That is worth testing.

I’m John — mobile auto electrician and AI systems builder in England. I build AI systems between vehicle jobs: AI Mechanic for automotive diagnostics, T-PACE for business AI and SEO, and local agent infrastructure that has to be useful, inspectable and under human control.