journal

$ git show 1e081ca back to index

Coding is solved? Software is not.

If coding is becoming solved, why does software still feel hard? A look at what AI agents change, and what they leave behind.

notes

11 min read

Boris Cherny, the creator of Claude Code, said in a recent talk:

…at this point, it’s safe to say that coding is largely solved - at least for the kind of programming that I do.

He described a workflow where Claude Code writes 100% of the code and Claude reviews every pull request, while humans still act as checkpoints for safety and quality.

The line works because AI coding tools can feel both magical and disappointing. A change that used to take an afternoon can arrive as a credible first draft in minutes, and then the team may still spend hours, sometimes days, deciding whether it was the right change to make.

If implementation is becoming abundant, why does building software still take so much time and effort?

Coding is not the whole job

“Coding is solved” is a provocative statement. It is also an incomplete one.

Models still hallucinate, and generated changes still need review. But the statement points at something real: for many software teams, writing code has stopped being the slowest part of building software.

And yet, software development does not feel solved.

Because coding does not equal software development.

Coding turns instructions into implementation. It remains important, and it is imperfect. But software development is larger than that: it turns ambiguous intent into a reliable system.

No matter what process a team follows, someone has to understand the problem before code exists. The team has to narrow the scope until “done” means something concrete.

After code exists, someone has to prove that the change belongs in the system, ship it safely, and keep owning the consequences.

This is where the promise frays. Implementation gets dramatically faster; the rest of software development does not disappear.

Software development reduces entropy

Not in the physics sense. But as a metaphor, it feels right.

A new feature often starts as a messy request: “Can we add team invitations?” At that point, there may not be an implementation to compare. The team is still figuring out which product behavior the request implies.

Product thinking reduces the mess first. Maybe “team invitations” means a simple email invite into an existing organization. Maybe role assignment can wait. A vague request becomes a narrower bet.

Design gives that bet a shape. The team decides who can send an invite, what an existing user sees, and what happens when an invite expires. Now there is proposed behavior, not just a product wish.

Implementation turns the behavior into a real change. Code gives the idea weight, but it also gives the team something new to distrust. The next question is no longer “can we build this?” It is “did we build the right thing, in the right way?”

Review and deployment close the loop. The change has to survive contact with the rest of the product and with real users.

At each step, software development narrows a messy space of possibilities until there is a change the team can verify. In that loose sense, software development is entropy reduction: turning confusion into a verified change.

The diagram below shows the clean version of that journey: intent becoming a shipped change the team can stand behind.

A diagram showing the software development process

But fast coding can add entropy too

At first, it feels like AI agents can own implementation. In more ambitious versions of the story, they may eventually own the whole loop. But in practice, we often find that agents are “too smart” for their own good.

The failure mode is subtle. A generated test suite can be large and still mostly confirm the implementation the agent already chose. A review thread can grow longer because the agent nitpicks around the core issue. A plan can sound thoughtful while leaving the actual product tradeoff undecided.

This is one form of “AI slop”: output that looks complete, but does not actually reduce the mess.

After introducing AI agents, entropy can decrease in one part of the process and increase in another. The implementation arrives faster, but the team may spend more time reconstructing the agent’s intent and deciding how much of the evidence to trust.

The team produces code faster, but it does not necessarily trust the result sooner.

The missing piece: a new workflow

Once agents enter day-to-day work, the magic wears off a little. They start to feel more like capable junior teammates. The work starts to look more like mentoring:

You give them enough context to begin, then keep checking whether the work is heading toward the thing you meant.

In our team, the transition happened gradually.

At first, agents were personal assistants. They helped inside the developer’s existing loop, while the rest of the development process stayed mostly the same.

Then developers started delegating larger parts of implementation. Instead of writing most of the code by hand, they became editors of an agent’s proposed change.

That worked surprisingly well. It also made the surrounding workflow feel heavier.

Review started to include more archaeology. Context had to be repeated. Noisy tests had to be interpreted. Reviewers spent more time reconstructing what happened and why. None of it looked dramatic in isolation, but it changed the shape of the work.

Chat is useful while the task is still being discovered. But once a change needs review, the transcript becomes a poor source of truth: important decisions and concrete evidence are buried in the same stream as the back-and-forth that produced them.

When humans wrote most of the code, we tolerated a lot of workflow friction because implementation itself took time. Now the code arrives sooner, so the surrounding workflow gets exposed, and in some places the problems get worse.

That does not mean “coding is solved” is wrong. It means the bottleneck has moved.

For us, four problems keep coming back: context, specs, verification, and human checkpoints.

What needs to change?

We build and operate an auth product that manages millions of user identities. That makes us conservative about code written by agents. A change that looks local in the diff can still change who gets access to what, especially in a multi-tenant system.

So we cannot treat agents as a faster way to throw code over the wall.

Context chosen on purpose

A lot of agent work succeeds or fails before the agent writes code.

Large context windows help, but more context is not automatically better. A bloated prompt can bury the one rule that actually matters.

Most teams already have the needed context, but it is scattered across docs, old pull requests, chat, and things teammates remember.

For the invitation task, the useful context lives around membership and access: who can invite, where tenant boundaries are enforced, and whether an existing account accepts differently from a new one. Someone has to choose those pieces. If that choice stays in a developer’s head, the agent guesses. If it travels with the task, review starts from shared ground.

The context created during the work matters too. If a reviewer corrects the same mistake twice, that feedback should not stay buried in two separate pull requests. If a team introduces a new convention, future runs should be able to use it without every developer pasting the same reminder again.

That discipline helps agents. It helps the team too. The agent is just the pressure that makes the old context problem harder to ignore.

Specs that stay with the work

A vague task used to be less dangerous than it is now.

When a human engineer gets a vague task, they bring judgment with them. Sometimes that judgment shows up as a product question, a remembered edge case, or a refusal to implement the request as written.

An agent is much more willing to proceed. Give it a vague request, and it may still produce a full implementation. The result can look finished even when the interpretation was wrong.

That makes the spec matter more.

The invitation spec could still be short. An admin can invite someone by email. The invite expires after seven days. Existing users join after accepting. Role assignment waits for a later change, and cross-tenant access stays out of bounds. If review turns up a missing edge case, like a suspended user accepting an old invite, the spec should change before the agent keeps going.

Most tasks only need enough shape for the risk involved. A small bug fix may only need the expected behavior and a reproduction case. As the risk goes up, the spec has to capture the boundaries that matter: user flow, permissions, constraints, and migration story.

The spec cannot disappear once the agent starts coding. The agent plans against it. The implementation is judged against it. If the team discovers a missing edge case, the spec changes and the agent continues with the updated intent.

That is the version of spec-in-the-loop we care about. The useful spec is the one that stays close enough to the work to argue with it.

Evidence reviewers can trust

When code is cheap to generate, trust becomes the expensive part.

Agents can write useful tests. They can also write tests that mostly confirm the implementation they already chose. Coverage goes up, while the reviewer still has to ask: did we actually prove the behavior we care about?

Verification has to be visible enough for reviewers to know what the agent ran, what failed, and what changed after the failure. They also need to know whether the passing command was actually the right command for this task.

Later, the reviewer should see evidence for those promises, not a generic wall of green checks. The run should show that admin and non-admin paths were exercised, expiry was covered, and acceptance worked for both a brand-new user and an existing account. The command or environment behind that evidence should be visible too.

A small utility change may only need unit tests. A product flow is different: the real signal may come from exercising the experience end to end. For auth and permission changes, we usually want evidence from a reproducible environment, especially around database state and permissions.

The right checks vary by repo and team. What matters is that reviewers can inspect them. A reviewer should not have to dig through a long chat transcript to understand why the change is believed to be safe.

Agents are good at sounding confident. The workflow has to produce evidence.

Checkpoints where judgment matters

Humans should not sit in every loop forever. That defeats the point.

But some moments still need judgment.

Before implementation, someone needs to check whether this is worth building. A missing constraint or wrong scope can send the agent toward the wrong answer very efficiently.

This is where the human checkpoint may matter before any code exists. Someone has to decide whether role assignment belongs in scope, whether both owners and admins can invite, and how to handle an email that already belongs to another tenant. If a human punts on those questions, the agent can still ship clean code for the wrong product decision.

For some tasks, this checkpoint may matter more than code review.

Clean code cannot rescue a bad spec.

After implementation, the review shifts to the result. The question is whether the agent actually solved the problem in a way that fits the product. Test presence alone tells only part of the story; the tests have to mean something. Sometimes the risky part is a maintenance problem that appears later.

The depth of review should depend on risk. A copy update should not go through the same process as a permission change. As the system earns trust, some classes of work can run with less supervision. Others should stay tightly reviewed.

Those boundaries should be part of the workflow.

What we are building for the new workflow

Arcplane is our answer to that workflow gap.

Arcplane gives teams a place to run, review, and manage agentic software work on production codebases. It sits above tools like GitHub and gives agent-authored work a real lifecycle instead of leaving it as an unstructured chat-to-diff handoff.

In Arcplane, that same invitation task would begin with the chosen membership and permission context. The spec would stay with the run as it changes. The branch would carry evidence from checks that actually matched the behavior. Review would pause at the moments a human chose in advance, instead of hoping the important decisions survive in a chat transcript.

That is the workflow we want for our own team: agent work that can be reviewed as a real change, not decoded from a conversation.

Reusable instructions and agent skills help, but they are only ingredients.

A skill can encode repeatable team practice, such as migration review or the way risky auth changes are tested. But that practice still needs a place in the run and in review.

Code is getting easier to produce. The work now is making it hold up.

If this matches what you are seeing in your own team, subscribe below. We will share what we learn as we build.