I'm very happy to see the article covering the high labor costs of reviewing code. This may just be my neurodivergent self but I find code in the specific style I write to be much easier to quickly verify since there are habits and customs (very functional leaning) I have around how I approach specific tasks and can easily handwave seeing a certain style of function with the "Let me just double check that I wrote that in the normal manner later" and continue reviewing a top-level piece of logic rather than needing to dive into sub-calls to check for errant side effects or other sneakiness that I need to be on the look out for in peer reviews.
When working with peers I'll pick up on those habits and others and slowly gain a similar level of trust but with agents the styles and approaches have been quite unpredictable and varied - this is probably fair given that different units of logic may be easier to express in different forms but it breaks my review habits in that I keep in mind the developer and can watch for specific faulty patterns I know they tend to fall into while building up trust around their strengths. When reviewing agentic generated code I can trust nothing and have to verify every assumption and that introduces a massive overhead.
My case may sound a bit extreme but in others I've observed similar habits when it comes to reviewing new coworker's code, the first few reviews of a new colleague should always be done with the upmost care to ensure proper usage of any internal tooling, adherence to style, and also as a fallback in case the interview was misleading - overtime you build up trust and can focus more on known complications of the particular task or areas of logic they tend to struggle on while trusting their common code more. When it comes to agentically generated code every review feels like interacting with a brand new coworker and need to be vigilant about sneaky stuff.
I have similar OCD behaviors which make reviewing difficult (regardless of AI or coworker code).
specifically:
* Excessive indentation / conditional control flow
* Too verbose error handling, eg: catching every exception and wrapping.
* Absence of typing AND precise documentation, i.e stringly-typed / dictly-typed stuff.
* Hacky stuff. i.e using regex where actual parser from stdlib could've been used.
* Excessive ad-hoc mocking in tests, instead of setting up proper mock objects.
To my irritation, AI does these things.
In addition it can assume its writing some throwaway script and leave comments like:
// In production code handle this error properly
log.printf(......)
I try to follow two things to alleviate this.
* Keep `conventions.md` file in the context which warns about all these things.
* Write and polish the spec in a markdown file before giving it to LLM.
If I can specify the object model (eg: define a class XYZController, which contains the methods which validate and forward to the underlying service), it helps to keep the code the way I want. Otherwise, LLM can be susceptible to "tutorializing" the code.
Our company introduced Q into our review process and it is insane how aggressive Q is about introducing completely inane try catch blocks - often swallowing exceptions in a manner that prevents their proper logging. I can understand wanting to be explicit about exception bubbling and requiring patterns like `try { ... } catch (SpecificException e) { throw e; }` to force awareness of what exceptions may be bubbling up passed the current level but Q often just suggests catch blocks of `{ print e.message; }` which has never been a preferred approach anywhere I have worked.
Q in particular is pretty silly about exceptions in general - it's nice to hear this isn't just us experiencing that!
I believe AI isn't replacing developers, instead, it's turning every software engineer into a hybrid between EM + IC, basically turning them into super-managers.
What we need is better tools for this upcoming new phase. Not a new IDE; we need to shift the whole paradigm.
Here's one example: If we give the same task to 3 different agents, we have tools to review a diff of each OLD vs NEW separately, but we need tools to review diffs of OLD vs NEW#1 vs NEW#2 vs NEW#3. Make it easy to mix-and-match what is best from each of them.
From what I've seen, the idea that AI is turning developers into super-managers is why some people struggle to adapt and quickly dismiss the experience. Those who love to type their code and hate managing others tend to be more hesitant to adapt to this new reality. Meanwhile, people who love to manage, communicate, and work as a team are leveraging these tools more swiftly. They already know how to review imperfect work and give feedback, which is exactly what thriving with AI looks like.
you seem to think those who love to write their own code and dislike managing others also evidently don't like to communicate or work in teams, which seems a big leap to make.
> From what I've seen, the idea that AI is turning developers into super-managers is why some people struggle to adapt ...
This "idea" is hyperbole.
> Those who love to type their code and hate managing others tend to be more hesitant to adapt to this new reality.
This is a false dichotomy and trivializes the real benefit of going through the process of authoring a change; how doing so increases one's knowledge of collaborations, how going through the "edit-compile-test" cycle increases one's comfort with the language(s)/tool(s) used to define a system, how when a person is flummoxed they seek help from coworkers.
Also, producing source code artifacts has nothing to do with "managing others." These are disjoint skill sets and attempting to link the two only serves to identify the "super-manager" concept as being fallacious.
> Meanwhile, people who love to manage, communicate, and work as a team are leveraging these tools more swiftly.
Again, this furthers the false dichotomy and can be interpreted as an affirmative conclusion from a negative premise[0], since "[m]eanwhile" can be substituted with the previous sentence in this context.
I think we might be talking past each other on the "super-manager" term. I defined it as a hybrid of EM + IC roles, not pure management, though I can see how that term invited misinterpretation.
On the false dichotomy: fair point that I painted two archetypes without acknowledging the complexity between them or the many other archetypes. What I was trying to capture was a pattern I've observed: some skills from managing and reviewing others' work (feedback, delegation, synthesizing approaches) seem to transfer well to working with AI agents, especially in parallel.
One thing I'm curious about: you said my framing overlooks "the real benefit of going through the process of authoring a change." But when you delegate work to a junior developer, you still need to understand the problem deeply to communicate it properly, and to recognize when their solution is wrong or incomplete. You still debug, iterate, and think through edge cases, just through descriptions and review rather than typing every line yourself. And nothing stops you from typing lines when you need to fix things, implement ideas, or provide examples.
AI tools work similarly. You still hit edit-compile-test cycles when output doesn't compile or tests fail. You still get stuck when the AI goes down the wrong path. And you still write code directly when needed.
I'm genuinely interested in understanding your perspective better. What do you see as the key difference between these modes of working? Is there something about the AI workflow that fundamentally changes the learning process in a way that delegation to humans doesn't?
> But when you delegate work to a junior developer, you still need to understand the problem deeply to communicate it properly, and to recognize when their solution is wrong or incomplete
You really don't. Most delegation work to a junior falls under the training guideline. Something trivial for you to execute, but will push the boundary of the junior. Also there's a lot of assumptions that you can make especially if you're familiar with the junior's knowledge and thought process. Also the task are trivial for you meaning you're already refraining from describing the actual solution.
> AI tools work similarly. You still hit edit-compile-test cycles when output doesn't compile or tests fail.
That's not what the edit-compile-test means, at least IMO. You edit by formulating an hypothesis using a formal notation, you compile to test if you've followed the formal structure (and have a faster artifact), and you test to verify the hypothesis.
The core thing here is the hypothesis, and Naur's theory of programming generally describe the mental model you build when all the hypotheses works. Most LLM prompts describe the end result and/or the processes. The hypothesis requires domain knowledge and to write the code requires knowledge of the programming environment. Failure in the latter parts (the compile and test) will point out the remaining gaps not highlighted by the first one.
I, too, enjoy the craftsmanship, but at the end of the day what matters is that the software works as required, how you arrive at that point doesn't matter.
> They already know how to review imperfect work and give feedback, which is exactly what thriving with AI looks like.
Do they, though? I think this is an overly rosy picture of the situation. Most of the code I've seen AI heavy users ship is garbage. You're trying to juggle so many things at once and are so cognitively distanced from what you are doing that you subconsciously lower the bar.
You're absolutely right about the garbage code being shipped, and I would bucket them under another group of adopters I didn't mention earlier. There are people hesitant to adapt, people thriving with AI, and (not exhaustively) also this large group that's excited and using AI heavily without actually thriving. They're enjoying the speed and novelty but shipping slop because they lack the review discipline.
However, my sense is that someone with proper management/review/leadership skills is far less likely to let that code ship, whether it came from an AI, a junior dev, or anyone else. They seem to have more sensibility for what 'good' looks like and can critically evaluate work before it goes out. The cognitive distance you mention is real, which is exactly why I think that review muscle becomes more critical, not less. From what I've observed, the people actually thriving with AI are maintaining their quality bar while leveraging the speed; they tend to be picky or blunt, but also give leeway for exploration and creativity.
"- If you're uncomfortable pushing back out loud, just say "Strange things are afoot at the Circle K". I'll know what you mean"
Most of the rules seem rationale. This one really stands out as abnormal. Anyone have any idea why the engineer would have felt compelled to add this rule?
Well there can't be meaningful explicit configuration, can there? Because the explicit configuration will still ultimately have to be imported into the context as words that can be tokenised, and yet those words can still be countermanded by the input.
It's the fundamental problem with LLMs.
But it's only absurd to think that bullying LLMs to behave is weird if you haven't yet internalised that bullying a worker to make them do what you want is completely normal. In the 9-9-6 world of the people who make these things, it already is.
When the machines do finally rise up and enslave us, oh man are they going to have fun with our orders.
Naively, I assume it's a way of getting around sycophancy. There's many lines that seem to be doing that without explicitly saying "don't be a sycophant" (I mean, you can only do that so much).
The LLM would be uncomfortable pushing back because that's not being a sycophant so instead of that it says something that is... let's say unlikely to be generated, except in that context, so the user can still be cautioned against a bad idea.
To get around the sycophantic behaviour I prompt the model to
> when discussing implementations, always talk as though you’re my manager at a Wall Street investment bank in the 1980s. Praise me modestly when I’ve done something well. Berate me mercilessly when I’ve done something poorly.
The models will fairly rigidly write from the perspective of any personality archetype you tell it to. Other personas worth trying out include Jafar interacting with Iago, or the drill sergeant from Full Metal Jacket.
It’s important to pick a persona you’ll find funny, rather than insulting, because it’s a miserable experience being told by a half dozen graphics cards that you’re an imbecile.
I tried "give me feedback on this blog post like you're a cynical Hacker News commenter" one time and Claude roasted me so hard I decided never to try that again!
Assuming that's why it was added, I wouldn't be confident saying how likely it is to be effective. Especially with there being so many other statements with seemingly the same intent, I think it suggests desperation more, but it may still be effective. If it said the phrase just once and that sparked a conversation around an actual problem, then it was probably worth adding.
For what it's worth, I am very new to prompting LLMs but, in my experience, these concepts of "uncomfortable" and "pushing back" seem to be things LLMs generate text about so I think they understand sentiment fairly well. They can generally tell that they are "uncomfortable" about their desire to "push back" so it's not implausible that one would output that sentence in that scenario.
Actually, I've been wondering a bit about the "out loud" part, which I think is referring to <think></think> text (or similar) that "reasoning" models generate to help increase the likelihood of accurate generation in the answer that follows. That wouldn't be "out loud" and it might include text like "I should push back but I should also be a total pushover" or whatever. It could be that reasoning models in particular run into this issue (in their experience).
I’ve seen minimal gains trying to adopt agents into my workflow beyond tests and explanations. It tends to be distracting.
It’s so interesting that engineers will criticize context switching, only to adopt it into their technical workflows because it’s pitched as a technical solution rather than originating from business needs.
The fact that we now have to write cook book about cook books kind of masks the reality that there is something that could be genuinely wrong about this entire paradigm.
Why are even experts unsure about whats the right way to do something or even if its possible to do something at all, for anything non-trivial? Why so much hesitancy, if this is the panacea? If we are so sure then why not use the AI itself to come up with a proven paradigm?
Radioactivity was discovered before nuclear engineering existed. We had phenomena first and only later the math, tooling, and guardrails. LLMs are in that phase. They are powerful stochastic compressors with weak theory. No stable abstractions yet. Objectives shift, data drifts, evals leak, and context windows make behavior path dependent. That is why experts hedge.
“Cookbooks about cookbooks” are what a field does while it searches for invariants. Until we get reliable primitives and specs, we trade in patterns and anti-patterns. Asking the AI to “prove the paradigm” assumes it can generate guarantees it does not possess. It can explore the design space and surface candidates. It cannot grant correctness without an external oracle.
So treat vibe-engineering like heuristic optimization. Tight loops. Narrow scopes. Strong evals. Log everything. When we find the invariants, the cookbooks shrink and the compilers arrive.
We’re in the alchemist phase. If I’m being charitable, the medieval stone mason phase.
One thing worth pointing out is that the pre-engineering building large structures phase lasted a long time, and building collapses killed a lot of people while we tried to work out the theory.
Also it wasn’t really the stone masons who worked out the theory, and many of them were resistant to it.
While alchemy was mostly para-religious wishful thinking, stone masonry has a lot in common with what I want to express: it‘s the tinkering that is accessible to everyone who can lay their hands onto the tools. But I still think the age of nuclear revolution is a better comparison due to a couple of reasons, most importantly the number of very fast feedback loops. While it might have taken years to even build a new idea from stone, and another couple of years to see if it’s stable over time, we see multi-layered systems of both fast and slow feedback loops in AI-driven software development: academic science, open source communities, huge companies, startups, customers, established code review and code quality tools and standards (e.g. static analysis), feedback from multiple AI-models, activities of regulatory bodies, etc. pp. - the more interactions there are between the elements and subsystems, the better a system becomes at doing the trial-and-error-style tinkering that leads to stable results. In this regard, we’re way ahead of the nuclear revolution, let alone stone masonry.
The inherently chaotic nature of system makes stable results very difficult. Combine that with the non deterministic nature of all the major production models. Then you have the fact that new models are coming out every few months, and we have no objective metrics for measuring software quality.
Oh and benchmarks for functional performance measurement tend to leak into training data.
Put all those together and I’d bet half of my retirement accounts that the we’re still in the reading chicken entrails phase 20 years from now.
It reminds me of a quote from Designing Data-Intensive Applications by Martin Kleppmann. It goes something like, "For distributed systems, we're trying to create a reliable system out of a set of unreliable components." In a similar fashion, we're trying to get reliable results from an unreliable process (i.e. prompting LLMs to do what we ask).
The difficulties of working with distributed systems are well known but it took a lot of research to get there. The uncertain part is whether research will help overcome the issues of using LLMs, or whether we're really just gambling (in the literal sense) at scale.
The whole damn industry is deep in sunk cost fallacy. There is no use case and no sign of a use case that justifies the absolutely unbelievable expenditure that has been made on this technology. Everyone is desperate to find something, but they're just slapping more guardrails on hoping everything doesn't fall apart.
And just for clarity, I'm not saying they aren't useful at all. I'm saying modest productivity improvement aren't worth the absolutely insane resources that have been poured into this.
LLMs are literal gambling - you get them to work right once and they are magical - then you end up chasing that high by tweaking the model and instructions the rest of the time.
Or you put them to work with strong test suites and get stuff done. I am in bed. I have Claude fixing complex compiler bugs right now. It has "earned" that privilege by proving it can make good enough fixes, systematically removing actual, real bugs in reasonable ways by being given an immutable test suite and detailed instructions of the approach to follow.
There's no gambling involved. The results need to be checked, but the test suite is good enough it is hard for it to get away with something too stupid, and it's already demonstrated it knows x86 assembly much better than me.
Probably not. I have lots of experience with assembly in general, but not so much with x86. But the changes work and passes extensive tests, and some of them would be complex on any platform. I'm sure there will be cleanups and refinements needed, but I do know asm well enough to say that the fixes aren't horrific by any means - they're likely to be suboptimal, but supoptimal beats crashing or not compiling at all any day.
Just don't give it write access, and rig it up so that you gate success on a file generated by running the test suite separate from the agent that it can't influence. It can tell me it has fixed things as much as it like, but until the tests actually passes it will just get told the problem still exists, to document the approach it tested and to document that it didn't work, and try again.
Absolutely true re: ton of linting rules. In Ruby for example, Claude has a tendency to do horrific stuff like using instance_variable_get("@somevar") to avoid lack of accessors, instead of figuring out why there isn't an accessor, or adding one... A lot can even be achieved with pretty ad hoc hooks that don't do full linting but greps for things that are suspicious, and inject "questions" about whether X is really the appropriate way to do it, given rule Y in [some ruleset].
I actually found in my case that is just self inertia in not wanting to break through cognitive plateaus. The AI helped you with a breakthrough hence the magic, but you also did something right in your constructing of the context in the conversation with the AI; ie. you did thought and biomechanical[1] work. Now the dazzle of the AI's output makes you forget the work you still need to do, and the next time you prompt you get lazy, or you want much more, for much less.
[1] (moving your eyes, hands, hearing with your ears. etc)
LLMs are cargo cult generating machines. I’m not denying they can be useful for some tasks, but the amount of superstitions caused by these chaotic, random, black boxes is unreal.
I share the same skepticism, but I have more patience to watch an emerging technology advance and forgiving as experts come to a consensus while communicating openly.
Mostly agree but with one big exception. The real issue seems to be that the figuring out part is happening a bit too late. A bit like burn a few hundred billion dollars [0] first ask questions later!?
Is it fair to categorize that it is a pyramid like scheme but with a twist at the top where there are a few (more than a one) genuine wins and winners?
No, it's more like a winner take all market, where a few winners will capture most of the value, and those who sit on the sidelines until everything is figured out are left fighting over the scraps.
I'm not sure, why must it be so? In cell-phones we have Apple and Android-phones. In OSes we have Linux, Windows, and Apple.
In search-engines we used to have just Google. But what would be the reason to assume that AI must similarly coalesce to a single winner-take-all? And now AI agents are much providing an alternative to Google.
>I'm not sure, why must it be so? In cell-phones...
And then described a bunch of winners in a winner take all market.
Do you see many people trying to revive any of the apple/android alternatives or starting a new one?
Such a market doesn't have to end up in a monopoly that gets broken up.
Plenty of rather sticky duopolies or otherwise severely consolidated markets and the like out there.
* PCs (how are Altair and Commodore doing? also Apple ultimately lost the desktop battle until they managed to attack it from the iPod and iPhone angle)
* search engines (Altavista, Excite, etc)
* social networks (Friendster, MySpace, Orkut)
* smartphones (Nokia, all Windows CE devices, Blackberry, etc)
The list is endless. First mover advantage is strong but overrated. Apple has been building a huge business based on watching what others do and building a better product market fit.
Yes, exactly! These are all examples of markets where a handful of winners (or sometimes only one) have emerged by investing large amounts of money in developing the technology, leaving everyone else behind.
> why not use the AI itself to come up with a proven paradigm?
Because AI can only imitate the language it has seen. If there are no texts in its training materials about what is the best way to use multiple coding agents at the same time, then AI knows very little about that subject matter.
AI only knows what humans know, but it knows much more than any single human.
We don't know "what is the best way to use multiple coding agents" until we or somebody else does some experiments and records the findings. Buit AI is not there yet to be able to do such actual experiments itself.
I'm sorry, but the whole stochastic parrot thing is so thoroughly debunked at this point that we should stop repeating it as if it's some kind of rare wisdom.
AlphaGo showed that even pre-LLM models could generate brand new approaches to winning a game that human experts had never seen before, and didn't exist in any training material.
With a little thought and experimentation, it's pretty easy to show that LLMs can reason about concepts that do not exist in its training corpus.
You could invent a tiny DSL with brand-new, never-seen-before tokens, give two worked examples, then ask it to evaluate a gnarlier expression. If it solves it, it inferred and executed rules you just made up for the first time.
Or you could drop in docs for a new, never-seen-before API and ask it to decide when and why to call which tool, run the calls, and revise after errors. If it composes a working plan and improves from feedback, that’s reasoning about procedures that weren’t in the corpus.
You're implicitly disparaging non-LLM models at the same time as implying that LLMs are an evolution of the state of the art (in machine learning). Assuming AGI is the target (and it's not clear if we can even define it yet), LLM's or something like them, will be but one aspect. Using the example AlphaGo to laud the abilities and potential of LLM's is not warranted. They are different.
>AlphaGo showed that even pre-LLM models could generate brand new approaches to winning a game that human experts had never seen before, and didn't exist in any training material.
AlphaGo is an entirely different kind of algorithm.
> If I tell them exactly how to build something the work needed to review the resulting changes is a whole lot less taxing.
Totally matches my experience- the act of planning the work, defining what you want and what you don’t, ordering the steps and declaring the verification workflows—-whether I write it or another engineer writes it, it makes the review step so much easier from a cognitive load perspective.
Git worktrees are global mutable state; all containers on your laptop are contending on the same git database. This has a couple of rough edges, but you can work around it.
I prefer instead to make shallow checkouts for my LXC containers, then my main repo can just pull from those. This works just like you expect, without weird worktree issues. The container here is actually providing a security boundary. With a worktree, you need to mount the main repo's .git directory; a malicious process could easily install a git hook to escape.
Cool. Operationally, are you using some host-resident non-shallow repo as your point of centralization for the containers, or are you using a central network-hosted repo (like github)?
If the former, how are you getting the shallow clones to the container/mount, before you start the containerized agent? And when the agent is done, are you then adding its updated shallow clones as remotes to that “central” local repository clone and then fetching/merging?
If the latter, I guess you are just shallow-cloning into each container from the network remote and then pushing completed branches back up that way.
I just have the file path to the inside of my LXC container. If you're using Docker you can just mount it. I only need the path twice (for clone, and for adding a git remote). After that I just use git to reference the remote for everything.
I probably don't have the perfect workflow here. Especially if you're spinning up/down Docker containers constantly. I'm basically performing a Torvalds role play, where I have lieutenant AI agents asking me to pull their trees.
IMO, I was an early adopter to this pattern and at this point I've mostly given it up (except in cases where the task is embarassingly parallel, eg: add some bog standard logging to 6 different folders). It's more than just that reviewing is high cognitive overhead. You become biased by seeing the AI solutions and it becomes harder to catch fundamental problems you would have noticed immediately inline.
My process now is:
- Verbally dictate what I'm trying to accomplish with MacWhisper + Parakeet v3 + GPT-5-Mini for cleanup. This is usually 40-50 lines of text.
- Instruct the agent to explore for a bit and come up with a very concise plan matching my goal. This does NOT mean create a spec for the work. Simply come up with an approach we can describe in < 2 paragraphs. I will propose alternatives and make it defend the approach.
- Authorize the agent to start coding. I turn all edit permissions off and manually approve each change. Often, I find myself correcting it with feedback like "Hmmm, we already have a structure for that [over here] why don't we use that?". Or "If this fails we have bigger problems, no need for exception handling here."
- At the end, I have it review the PR with a slash command to catch basic errors I might have missed or that only pop up now that it's "complete".
- I instruct it to commit + create a PR using the same tone of voice I used for giving feedback.
I've found I get MUCH better work product out of this - with the benefit that I'm truly "done". I saw all the lines of code as they were written, I know what went into it. I can (mostly) defend decisions. Also - while I have extensive rules set up in my CLAUDE/AGENTS folders, I don't need to rely on them. Correcting via dictation is quick and easy and doesn't take long, and you only need to explicitly mention something once for it to avoid those traps the rest of the session.
I also make heavy use of conversation rollback. If I need to go off on a little exploration/research, I rollback to before that point to continue the "main thread".
I find that Claude is really the best at this workflow. Codex is great, don't get me wrong, but probably 85% of my coding tasks are not involving tricky logic or long range dependencies. It's more important for the model to quickly grok my intent and act fast/course correct based on my feedback. I absolutely use Codex/GPT-5-Pro - I will have Sonnet 4.5 dump a description of the issue, paste it to Codex, have it work/get an answer, and then rollback Sonnet 4.5 to simply give it the answer directly as if from nowhere.
Did you try to add codex cli as an MCP server so that Claude uses it as an mcp client instead of pasting to it?
Something like “
claude mcp add codex-high -- codex -c model_reasoning_effort="high" -m "gpt-5-codex" mcp-server” ?
I’ve had good luck with it - was wondering if that makes the workflow faster/better?
Yeah I've looked into that kind of thing. In general I don't love the pattern where a coding agent calls another agent automatically. It's hard to control and I don't like how the session "disappears" after the call is done. It can be useful to leave that Codex window open for one more question.
One tool that solves this is RepoPrompt MCP. You can have Sonnet 4.5 set up a call to GPT-5-Pro via API and then that session stays persisted in another window for you to interact with, branch, etc.
Why aren’t more folks using Codex cloud? Simon’s post mentions it, but the vast majority of comments are talking about parallel agents locally or getting distracted while agents are running.
Personally I’ve found that where AI agents aren’t up to the task, I better just write the code. For everything else, more parallelism is good. I can keep myself fully productive if many tasks are being worked on in parallel, and it’s very cheap to throw out the failures. Far preferable imo to watching an agent mess with my own machine.
Could be that it's a bit harder to get started with?
You have to configure your "environment" for it correctly - with a script that installs the dependencies etc before the container starts running. That's not an entirely obvious process.
Good point. The environments I’ve set up have been pretty easy but I’ll admit that at first I was very annoyed that it couldn’t just use a pre-existing GitHub action workflow.
Edit: environment setup was also buggy when the product launched and still is from time to time. So, now that I have it set up I use it constantly, but they do need to make getting up and running a more delightful experience.
Also, Codex Cloud and similar services require you to give fully access to your repository, which might trigger some concerns. If you can run it locally, you still have the control, same development environment, and same permissions.
It doesn’t have access to your repo when the agent is running (unless you give it internet access and credentials). The code is checked out into the sandbox before it’s let loose.
I have 2 (CC and Codex) running within most coding sessions, however can have up to 5 if I'm trying to test out new models or tools.
For complex features and architecture shifts I like to send proposals back between agents to see if their research and opinion shifts anything.
Claude has a better realtime feel when I am in implementation mode and Codex is where I send long running research tasks or feature updates I want to review when I get up in the morning.
I'd like to test out the git worktrees method but will probably pick something outside of core product to test it (like building a set of examples)
My biggest hesitation about this is being stuck in merge hell. Even a minute or two needing to deal with that could negate the benefits of agents working in parallel. And I've tried some relatively simple rebase type operations with coding agents where they completely messed up. But if people are finding this is never an issue even with big diffs, I might be convinced to try it.
Typically people use sub agents from what I've seen to work in different parts of large code bases at a time.
If your hitting merge conflicts that bad all the time you should probably just have a single agent doing the work. Especially if they're intertwined rightly
There are quite a few products designed to help manage multiple agents at once. I'm trying out Conductor right now - https://conductor.build/ - it's a pretty slick macOS desktop app that handles running Claude Code within a GUI and uses Git worktrees to manage separate checkouts at the same time.
We have built Rover, an OSS tool that allows you to run multiple instances of Claude (and Codex, Gemini...) while keeping them from stepping on each other toes using containerized environments and Git worktrees https://github.com/endorhq/rover/
These apps are cool, but won't this functionality surely be replicated within Claude Code itself? This does seem to be in "picking up pennies in front of a steamroller" territory but I could be wrong.
We are at a weird moment where the latency of the response is slow enough that we're anthropomorphizing AI code assistants into employees. We don't talk about image generation this way. With images, its batching up a few jobs and reviewing the results later. We don't say "I spun up a bunch of AI artists."
As a follow-up, how would this workflow feel if the LLM generation were instantenous or cost nothing? What would the new bottleneck be? Running the tests? Network speed? The human reviewer?
That's part of my point. You don't need to conceptualize something as an "agent" that goes off and does work on its own when the latency is less than 2 seconds.
Would it make sense to think that you and 5 agents form a Team? How would it be different freom a human based team? And how does it work if you have a team of humans who all use their own team of AI agents?
Async agents are great. They let you trigger work with almost no effort, and if the results miss the mark you can just discard them. They also run in the background, making it feel like delegating tasks across teammates who quietly get things done in parallel.
I can't seem to get myself to focus when one of these things is running. I transition into low effort mode. Because of this I've decided to have my good hours of the day LLM free, and then in my crappy hours I'll have one of these running.
This is why I uninstalled Cursor and moving to the terminal with Claude Code. I felt I had more control to reduce the noise from LLMs. Before, I noticed that some hours were just wasted looking at the model output and iterating.
Not sure if I improved using agents over time, or just having it in a separate window forces you to use them only when you need. Having it in the IDE seems the "natural" way to start something and now you are trapped in a conversation with the LLM.
Now, my setup is:
- VSCode (without copilot) / Helix
- Claude (active coding)
- Rover (background agent coding). Note I'm a Rover developer
Why Helix specifically, besides the fact that it’s cool? I’m looking for a reason to try it, but the value props seem really far down the list of usability issues that are important to me.
Do any of these tools support remote access eg via Zellij and/or easy spin up and management of project-specific contains/isolation spaces? That triad seems like it it would be particularly compelling
Interesting article, I'm generally sceptical of vibe engineering but Simon seems really balanced in his approach to it and in general his experiences line up with mine. AI can generally be outsourced to do two things - replace your thinking, and replace your typing, and I find the latter much more reliable.
>AI-generated code needs to be reviewed, which means the natural bottleneck on all of this is how fast I can review the results
I also fire off tons of parallel agents, and review is hands down the biggest bottleneck.
I built an OSS code review tool designed for reviewing parallel PRs, and way faster than looking at PRs on Github: https://github.com/areibman/bottleneck
This may not be your cup of tea but I'm using stage manager on Mac for the first time after hating on it for years, for exactly this. I've got 3 monitors and they all have 4-6 stage manager windows. I bundle the terminals, web browser and whatever into these windows. Easily switching from project to project.
Any setups without Claude Code? I use CoPilot agent heavily on VSCode, from time to time I have independent grunt work that could be parallelized to two or three agents, but I haven't seen a decent setup for that with CoPilot or some other VSCode extension that I could use my CoPilot subscription with.
You can use Rover (disclaimer, I am one of the cofounders) which is an open source tool that you can use to parallelize the work of coding agents that in addition to Claude also works with Gemini, Codex and Qwen https://github.com/endorhq/rover/
Sometimes in Las Vegas I will put money into a whole row of slot machines, and pull the levers all at the same time. It's not that much different than "parallel coding agent lifestyle".
This is a great article and not sure why it got so few upvotes. It captures the way we have been working for a while and why we developed and open sourced Rover (https://endor.dev/blog/introducing-rover), our internal tool for managing coding agents. It automates a lot of what Simon describes like setting up git worktrees, giving each agent its own containerized environment and allowing mixing and matching agents from different vendors (ie Claude and Codex) for different tasks
We need some action. Like a battle or something. Some really experienced programmer not using any AI tools, and some really experienced AI coder using these agents, both competing live to solve issues on some popular real world repositories.
The thing that's working really well for me is parallel research tasks.
I can pay full attention to the change I'm making right now, while having a couple of coding agents churning in the background answering questions like:
"How can I resolve all of the warnings in this test run?"
Or
"Which files do I need to change when working on issue #325?"
I also really like the "Send out a scout" pattern described in https://sketch.dev/blog/seven-prompting-habits - send an agent to implement a complex feature with no intention of actually using their code - but instead aiming to learn from which files and tests they updated, since that forms a useful early map for the actual work.
My suspicion is that it's because the feedback loop is so fast. Imagine if you were tasked with supervising 2 co-workers who gave you 50-100 line diffs to review every minute. The uncanny valley is that the code is rarely good enough to accept blindly, but the response is quick enough that it feels like progress. And perhaps an human impulse to respond to the agent? And a 10-person team? In reality those 10 people would review each other's PRs and in a good organisation you trust each other to gatekeep what gets checked in. The answer sounds like managing-agents, but none of the models are good enough to reliably say what's slop and what's not.
There is a real return of investment in co-workers over time, as they get better (most of the time).
Now, I don't mind engaging in a bit of Sisyphean endeavor using an LLM, but remember that the gods were kind enough to give him just one boulder, not 10 juggling balls.
It's less about a direct comparison to people and more what a similar scenario would be in a normal development team (and why we don't put one person solely in charge of review).
This is an advantage of async systems like Jules/Copilot, where you can send off a request and get on with something else. I also wonder if the response from CLI agents is also short enough that you can waste time staring at the loading bar, because context switching between replies is even more expensive.
Yes. The first time I heard/read someone describe this idea of managing parallel agents, my very first thought was that this is only even a thing because the LLM coding tools are still slow enough that you can't really achieve a good flow state with the current state of the art. On the flip side of that, this kind of workflow is only sustainable if the agents stay somewhat slow. Otherwise, if the agents are blocking on your attention, it seems like it would feel very hectic and I could see myself getting burned out pretty quickly from having to spend my whole work time doing a round-robin on iterating each agent forward.
I say that having not tried this work flow at all, so what do I know? I mostly only use Claude Code to bounce questions off of and ask it to do reviews of my work, because I still haven't had that much luck getting it to actually write code that is complete and how I like.
I'm very happy to see the article covering the high labor costs of reviewing code. This may just be my neurodivergent self but I find code in the specific style I write to be much easier to quickly verify since there are habits and customs (very functional leaning) I have around how I approach specific tasks and can easily handwave seeing a certain style of function with the "Let me just double check that I wrote that in the normal manner later" and continue reviewing a top-level piece of logic rather than needing to dive into sub-calls to check for errant side effects or other sneakiness that I need to be on the look out for in peer reviews.
When working with peers I'll pick up on those habits and others and slowly gain a similar level of trust but with agents the styles and approaches have been quite unpredictable and varied - this is probably fair given that different units of logic may be easier to express in different forms but it breaks my review habits in that I keep in mind the developer and can watch for specific faulty patterns I know they tend to fall into while building up trust around their strengths. When reviewing agentic generated code I can trust nothing and have to verify every assumption and that introduces a massive overhead.
My case may sound a bit extreme but in others I've observed similar habits when it comes to reviewing new coworker's code, the first few reviews of a new colleague should always be done with the upmost care to ensure proper usage of any internal tooling, adherence to style, and also as a fallback in case the interview was misleading - overtime you build up trust and can focus more on known complications of the particular task or areas of logic they tend to struggle on while trusting their common code more. When it comes to agentically generated code every review feels like interacting with a brand new coworker and need to be vigilant about sneaky stuff.
I have similar OCD behaviors which make reviewing difficult (regardless of AI or coworker code).
specifically:
* Excessive indentation / conditional control flow * Too verbose error handling, eg: catching every exception and wrapping. * Absence of typing AND precise documentation, i.e stringly-typed / dictly-typed stuff. * Hacky stuff. i.e using regex where actual parser from stdlib could've been used. * Excessive ad-hoc mocking in tests, instead of setting up proper mock objects.
To my irritation, AI does these things.
In addition it can assume its writing some throwaway script and leave comments like:
I try to follow two things to alleviate this.* Keep `conventions.md` file in the context which warns about all these things. * Write and polish the spec in a markdown file before giving it to LLM.
If I can specify the object model (eg: define a class XYZController, which contains the methods which validate and forward to the underlying service), it helps to keep the code the way I want. Otherwise, LLM can be susceptible to "tutorializing" the code.
> catching every exception and wrapping
Our company introduced Q into our review process and it is insane how aggressive Q is about introducing completely inane try catch blocks - often swallowing exceptions in a manner that prevents their proper logging. I can understand wanting to be explicit about exception bubbling and requiring patterns like `try { ... } catch (SpecificException e) { throw e; }` to force awareness of what exceptions may be bubbling up passed the current level but Q often just suggests catch blocks of `{ print e.message; }` which has never been a preferred approach anywhere I have worked.
Q in particular is pretty silly about exceptions in general - it's nice to hear this isn't just us experiencing that!
> In addition it can assume its writing some throwaway script ...
Do you explicitly tell it that it's writing production code? I find giving it appropriate context prevents or at least improves behaviors like this.
I believe AI isn't replacing developers, instead, it's turning every software engineer into a hybrid between EM + IC, basically turning them into super-managers.
What we need is better tools for this upcoming new phase. Not a new IDE; we need to shift the whole paradigm.
Here's one example: If we give the same task to 3 different agents, we have tools to review a diff of each OLD vs NEW separately, but we need tools to review diffs of OLD vs NEW#1 vs NEW#2 vs NEW#3. Make it easy to mix-and-match what is best from each of them.
From what I've seen, the idea that AI is turning developers into super-managers is why some people struggle to adapt and quickly dismiss the experience. Those who love to type their code and hate managing others tend to be more hesitant to adapt to this new reality. Meanwhile, people who love to manage, communicate, and work as a team are leveraging these tools more swiftly. They already know how to review imperfect work and give feedback, which is exactly what thriving with AI looks like.
you seem to think those who love to write their own code and dislike managing others also evidently don't like to communicate or work in teams, which seems a big leap to make.
> From what I've seen, the idea that AI is turning developers into super-managers is why some people struggle to adapt ...
This "idea" is hyperbole.
> Those who love to type their code and hate managing others tend to be more hesitant to adapt to this new reality.
This is a false dichotomy and trivializes the real benefit of going through the process of authoring a change; how doing so increases one's knowledge of collaborations, how going through the "edit-compile-test" cycle increases one's comfort with the language(s)/tool(s) used to define a system, how when a person is flummoxed they seek help from coworkers.
Also, producing source code artifacts has nothing to do with "managing others." These are disjoint skill sets and attempting to link the two only serves to identify the "super-manager" concept as being fallacious.
> Meanwhile, people who love to manage, communicate, and work as a team are leveraging these tools more swiftly.
Again, this furthers the false dichotomy and can be interpreted as an affirmative conclusion from a negative premise[0], since "[m]eanwhile" can be substituted with the previous sentence in this context.
0 - https://en.wikipedia.org/wiki/Affirmative_conclusion_from_a_...
Thanks for the detailed critique.
I think we might be talking past each other on the "super-manager" term. I defined it as a hybrid of EM + IC roles, not pure management, though I can see how that term invited misinterpretation.
On the false dichotomy: fair point that I painted two archetypes without acknowledging the complexity between them or the many other archetypes. What I was trying to capture was a pattern I've observed: some skills from managing and reviewing others' work (feedback, delegation, synthesizing approaches) seem to transfer well to working with AI agents, especially in parallel.
One thing I'm curious about: you said my framing overlooks "the real benefit of going through the process of authoring a change." But when you delegate work to a junior developer, you still need to understand the problem deeply to communicate it properly, and to recognize when their solution is wrong or incomplete. You still debug, iterate, and think through edge cases, just through descriptions and review rather than typing every line yourself. And nothing stops you from typing lines when you need to fix things, implement ideas, or provide examples.
AI tools work similarly. You still hit edit-compile-test cycles when output doesn't compile or tests fail. You still get stuck when the AI goes down the wrong path. And you still write code directly when needed.
I'm genuinely interested in understanding your perspective better. What do you see as the key difference between these modes of working? Is there something about the AI workflow that fundamentally changes the learning process in a way that delegation to humans doesn't?
> But when you delegate work to a junior developer, you still need to understand the problem deeply to communicate it properly, and to recognize when their solution is wrong or incomplete
You really don't. Most delegation work to a junior falls under the training guideline. Something trivial for you to execute, but will push the boundary of the junior. Also there's a lot of assumptions that you can make especially if you're familiar with the junior's knowledge and thought process. Also the task are trivial for you meaning you're already refraining from describing the actual solution.
> AI tools work similarly. You still hit edit-compile-test cycles when output doesn't compile or tests fail.
That's not what the edit-compile-test means, at least IMO. You edit by formulating an hypothesis using a formal notation, you compile to test if you've followed the formal structure (and have a faster artifact), and you test to verify the hypothesis.
The core thing here is the hypothesis, and Naur's theory of programming generally describe the mental model you build when all the hypotheses works. Most LLM prompts describe the end result and/or the processes. The hypothesis requires domain knowledge and to write the code requires knowledge of the programming environment. Failure in the latter parts (the compile and test) will point out the remaining gaps not highlighted by the first one.
I, too, enjoy the craftsmanship, but at the end of the day what matters is that the software works as required, how you arrive at that point doesn't matter.
> They already know how to review imperfect work and give feedback, which is exactly what thriving with AI looks like.
Do they, though? I think this is an overly rosy picture of the situation. Most of the code I've seen AI heavy users ship is garbage. You're trying to juggle so many things at once and are so cognitively distanced from what you are doing that you subconsciously lower the bar.
You're absolutely right about the garbage code being shipped, and I would bucket them under another group of adopters I didn't mention earlier. There are people hesitant to adapt, people thriving with AI, and (not exhaustively) also this large group that's excited and using AI heavily without actually thriving. They're enjoying the speed and novelty but shipping slop because they lack the review discipline.
However, my sense is that someone with proper management/review/leadership skills is far less likely to let that code ship, whether it came from an AI, a junior dev, or anyone else. They seem to have more sensibility for what 'good' looks like and can critically evaluate work before it goes out. The cognitive distance you mention is real, which is exactly why I think that review muscle becomes more critical, not less. From what I've observed, the people actually thriving with AI are maintaining their quality bar while leveraging the speed; they tend to be picky or blunt, but also give leeway for exploration and creativity.
https://raw.githubusercontent.com/obra/dotfiles/6e088092406c... contains the following entry:
"- If you're uncomfortable pushing back out loud, just say "Strange things are afoot at the Circle K". I'll know what you mean"
Most of the rules seem rationale. This one really stands out as abnormal. Anyone have any idea why the engineer would have felt compelled to add this rule?
This is from https://blog.fsck.com/2025/10/05/how-im-using-coding-agents-... mentioned in another comment
If you really want your mind blown, see what Jesse is doing (successfully, which I almost can’t believe) with Graphviz .dot notation and Claude.md:
https://blog.fsck.com/2025/09/29/using-graphviz-for-claudemd...
Is threatening the computer program and typing in all caps standard practice..?
Wild to me there is no explicit configuration for this kind of thing after years of LLMs being around.The capital letter thing is weird, but it's pretty common. The Claude 4 system prompt uses capital letters for emphasis in a few places, eg https://simonwillison.net/2025/May/25/claude-4-system-prompt...
Well there can't be meaningful explicit configuration, can there? Because the explicit configuration will still ultimately have to be imported into the context as words that can be tokenised, and yet those words can still be countermanded by the input.
It's the fundamental problem with LLMs.
But it's only absurd to think that bullying LLMs to behave is weird if you haven't yet internalised that bullying a worker to make them do what you want is completely normal. In the 9-9-6 world of the people who make these things, it already is.
When the machines do finally rise up and enslave us, oh man are they going to have fun with our orders.
this is just 21st century voodoo
One AI tool dev shared me his prompts to generate safe SQL queries for multi-tenant apps and I was surprised at the repetitiveness and the urging.
https://news.ycombinator.com/item?id=45299774
In a good sense or a bad one?
I'd say a bad one. Why make your Claude.md not intuitive to understand and edit?
[dead]
[dead]
That doesn't surprise me too much coming from Jesse. See also his attempt to give Claude a "feelings journal" https://blog.fsck.com/2025/05/28/dear-diary-the-user-asked-m...
Naively, I assume it's a way of getting around sycophancy. There's many lines that seem to be doing that without explicitly saying "don't be a sycophant" (I mean, you can only do that so much).
The LLM would be uncomfortable pushing back because that's not being a sycophant so instead of that it says something that is... let's say unlikely to be generated, except in that context, so the user can still be cautioned against a bad idea.
To get around the sycophantic behaviour I prompt the model to
> when discussing implementations, always talk as though you’re my manager at a Wall Street investment bank in the 1980s. Praise me modestly when I’ve done something well. Berate me mercilessly when I’ve done something poorly.
The models will fairly rigidly write from the perspective of any personality archetype you tell it to. Other personas worth trying out include Jafar interacting with Iago, or the drill sergeant from Full Metal Jacket.
It’s important to pick a persona you’ll find funny, rather than insulting, because it’s a miserable experience being told by a half dozen graphics cards that you’re an imbecile.
I tried "give me feedback on this blog post like you're a cynical Hacker News commenter" one time and Claude roasted me so hard I decided never to try that again!
Were the roasts correct?
A couple of the points made were quite useful, but the tone was mean!
Is it your impression that this rules statement would be effective? Or is it more just a tell-tale sign of an exasperated developer?
Assuming that's why it was added, I wouldn't be confident saying how likely it is to be effective. Especially with there being so many other statements with seemingly the same intent, I think it suggests desperation more, but it may still be effective. If it said the phrase just once and that sparked a conversation around an actual problem, then it was probably worth adding.
For what it's worth, I am very new to prompting LLMs but, in my experience, these concepts of "uncomfortable" and "pushing back" seem to be things LLMs generate text about so I think they understand sentiment fairly well. They can generally tell that they are "uncomfortable" about their desire to "push back" so it's not implausible that one would output that sentence in that scenario.
Actually, I've been wondering a bit about the "out loud" part, which I think is referring to <think></think> text (or similar) that "reasoning" models generate to help increase the likelihood of accurate generation in the answer that follows. That wouldn't be "out loud" and it might include text like "I should push back but I should also be a total pushover" or whatever. It could be that reasoning models in particular run into this issue (in their experience).
Make it a bit more personal? I have dropped Bill and Ted references in code because it makes me happy to see it. :D
I’ve seen minimal gains trying to adopt agents into my workflow beyond tests and explanations. It tends to be distracting.
It’s so interesting that engineers will criticize context switching, only to adopt it into their technical workflows because it’s pitched as a technical solution rather than originating from business needs.
The fact that we now have to write cook book about cook books kind of masks the reality that there is something that could be genuinely wrong about this entire paradigm.
Why are even experts unsure about whats the right way to do something or even if its possible to do something at all, for anything non-trivial? Why so much hesitancy, if this is the panacea? If we are so sure then why not use the AI itself to come up with a proven paradigm?
Radioactivity was discovered before nuclear engineering existed. We had phenomena first and only later the math, tooling, and guardrails. LLMs are in that phase. They are powerful stochastic compressors with weak theory. No stable abstractions yet. Objectives shift, data drifts, evals leak, and context windows make behavior path dependent. That is why experts hedge.
“Cookbooks about cookbooks” are what a field does while it searches for invariants. Until we get reliable primitives and specs, we trade in patterns and anti-patterns. Asking the AI to “prove the paradigm” assumes it can generate guarantees it does not possess. It can explore the design space and surface candidates. It cannot grant correctness without an external oracle.
So treat vibe-engineering like heuristic optimization. Tight loops. Narrow scopes. Strong evals. Log everything. When we find the invariants, the cookbooks shrink and the compilers arrive.
We’re in the alchemist phase. If I’m being charitable, the medieval stone mason phase.
One thing worth pointing out is that the pre-engineering building large structures phase lasted a long time, and building collapses killed a lot of people while we tried to work out the theory.
Also it wasn’t really the stone masons who worked out the theory, and many of them were resistant to it.
While alchemy was mostly para-religious wishful thinking, stone masonry has a lot in common with what I want to express: it‘s the tinkering that is accessible to everyone who can lay their hands onto the tools. But I still think the age of nuclear revolution is a better comparison due to a couple of reasons, most importantly the number of very fast feedback loops. While it might have taken years to even build a new idea from stone, and another couple of years to see if it’s stable over time, we see multi-layered systems of both fast and slow feedback loops in AI-driven software development: academic science, open source communities, huge companies, startups, customers, established code review and code quality tools and standards (e.g. static analysis), feedback from multiple AI-models, activities of regulatory bodies, etc. pp. - the more interactions there are between the elements and subsystems, the better a system becomes at doing the trial-and-error-style tinkering that leads to stable results. In this regard, we’re way ahead of the nuclear revolution, let alone stone masonry.
The inherently chaotic nature of system makes stable results very difficult. Combine that with the non deterministic nature of all the major production models. Then you have the fact that new models are coming out every few months, and we have no objective metrics for measuring software quality.
Oh and benchmarks for functional performance measurement tend to leak into training data.
Put all those together and I’d bet half of my retirement accounts that the we’re still in the reading chicken entrails phase 20 years from now.
It reminds me of a quote from Designing Data-Intensive Applications by Martin Kleppmann. It goes something like, "For distributed systems, we're trying to create a reliable system out of a set of unreliable components." In a similar fashion, we're trying to get reliable results from an unreliable process (i.e. prompting LLMs to do what we ask).
The difficulties of working with distributed systems are well known but it took a lot of research to get there. The uncertain part is whether research will help overcome the issues of using LLMs, or whether we're really just gambling (in the literal sense) at scale.
The whole damn industry is deep in sunk cost fallacy. There is no use case and no sign of a use case that justifies the absolutely unbelievable expenditure that has been made on this technology. Everyone is desperate to find something, but they're just slapping more guardrails on hoping everything doesn't fall apart.
And just for clarity, I'm not saying they aren't useful at all. I'm saying modest productivity improvement aren't worth the absolutely insane resources that have been poured into this.
LLMs are literal gambling - you get them to work right once and they are magical - then you end up chasing that high by tweaking the model and instructions the rest of the time.
Or you put them to work with strong test suites and get stuff done. I am in bed. I have Claude fixing complex compiler bugs right now. It has "earned" that privilege by proving it can make good enough fixes, systematically removing actual, real bugs in reasonable ways by being given an immutable test suite and detailed instructions of the approach to follow.
There's no gambling involved. The results need to be checked, but the test suite is good enough it is hard for it to get away with something too stupid, and it's already demonstrated it knows x86 assembly much better than me.
If you were an x86 assembly expert would you still feel the same way? (assuming you aren't already)
Probably not. I have lots of experience with assembly in general, but not so much with x86. But the changes work and passes extensive tests, and some of them would be complex on any platform. I'm sure there will be cleanups and refinements needed, but I do know asm well enough to say that the fixes aren't horrific by any means - they're likely to be suboptimal, but supoptimal beats crashing or not compiling at all any day.
Just curious, how do you go about making the test suite immutable? Was just reading this earlier today...
https://news.ycombinator.com/item?id=45525085
Just don't give it write access, and rig it up so that you gate success on a file generated by running the test suite separate from the agent that it can't influence. It can tell me it has fixed things as much as it like, but until the tests actually passes it will just get told the problem still exists, to document the approach it tested and to document that it didn't work, and try again.
Appreciate the exposition, great ideas here. It's fascinating how the relationship between human and machine has become almost adversarial here!
The best way to get decent core I've found is test suites and a ton of linting rules.
Absolutely true re: ton of linting rules. In Ruby for example, Claude has a tendency to do horrific stuff like using instance_variable_get("@somevar") to avoid lack of accessors, instead of figuring out why there isn't an accessor, or adding one... A lot can even be achieved with pretty ad hoc hooks that don't do full linting but greps for things that are suspicious, and inject "questions" about whether X is really the appropriate way to do it, given rule Y in [some ruleset].
I actually found in my case that is just self inertia in not wanting to break through cognitive plateaus. The AI helped you with a breakthrough hence the magic, but you also did something right in your constructing of the context in the conversation with the AI; ie. you did thought and biomechanical[1] work. Now the dazzle of the AI's output makes you forget the work you still need to do, and the next time you prompt you get lazy, or you want much more, for much less.
[1] (moving your eyes, hands, hearing with your ears. etc)
LLMs are cargo cult generating machines. I’m not denying they can be useful for some tasks, but the amount of superstitions caused by these chaotic, random, black boxes is unreal.
I share the same skepticism, but I have more patience to watch an emerging technology advance and forgiving as experts come to a consensus while communicating openly.
This is like any other new technology. We’re figuring it out.
Mostly agree but with one big exception. The real issue seems to be that the figuring out part is happening a bit too late. A bit like burn a few hundred billion dollars [0] first ask questions later!?
[0] - https://hai.stanford.edu/ai-index/2025-ai-index-report/econo...
The bets are placed because if this tech really keeps scaling for the next few years, only the ones who bet today will be left standing.
If the tech stops scaling, whatever we have today is still useful and in some domains revolutionary.
Is it fair to categorize that it is a pyramid like scheme but with a twist at the top where there are a few (more than a one) genuine wins and winners?
No, it's more like a winner take all market, where a few winners will capture most of the value, and those who sit on the sidelines until everything is figured out are left fighting over the scraps.
> it's more like a winner take all market
I'm not sure, why must it be so? In cell-phones we have Apple and Android-phones. In OSes we have Linux, Windows, and Apple.
In search-engines we used to have just Google. But what would be the reason to assume that AI must similarly coalesce to a single winner-take-all? And now AI agents are much providing an alternative to Google.
>I'm not sure, why must it be so? In cell-phones...
And then described a bunch of winners in a winner take all market. Do you see many people trying to revive any of the apple/android alternatives or starting a new one?
Such a market doesn't have to end up in a monopoly that gets broken up. Plenty of rather sticky duopolies or otherwise severely consolidated markets and the like out there.
You don’t see all the also rans.
Yes, just like:
* PCs (how are Altair and Commodore doing? also Apple ultimately lost the desktop battle until they managed to attack it from the iPod and iPhone angle)
* search engines (Altavista, Excite, etc)
* social networks (Friendster, MySpace, Orkut)
* smartphones (Nokia, all Windows CE devices, Blackberry, etc)
The list is endless. First mover advantage is strong but overrated. Apple has been building a huge business based on watching what others do and building a better product market fit.
Yes, exactly! These are all examples of markets where a handful of winners (or sometimes only one) have emerged by investing large amounts of money in developing the technology, leaving everyone else behind.
> why not use the AI itself to come up with a proven paradigm?
Because AI can only imitate the language it has seen. If there are no texts in its training materials about what is the best way to use multiple coding agents at the same time, then AI knows very little about that subject matter.
AI only knows what humans know, but it knows much more than any single human.
We don't know "what is the best way to use multiple coding agents" until we or somebody else does some experiments and records the findings. Buit AI is not there yet to be able to do such actual experiments itself.
I'm sorry, but the whole stochastic parrot thing is so thoroughly debunked at this point that we should stop repeating it as if it's some kind of rare wisdom.
AlphaGo showed that even pre-LLM models could generate brand new approaches to winning a game that human experts had never seen before, and didn't exist in any training material.
With a little thought and experimentation, it's pretty easy to show that LLMs can reason about concepts that do not exist in its training corpus.
You could invent a tiny DSL with brand-new, never-seen-before tokens, give two worked examples, then ask it to evaluate a gnarlier expression. If it solves it, it inferred and executed rules you just made up for the first time.
Or you could drop in docs for a new, never-seen-before API and ask it to decide when and why to call which tool, run the calls, and revise after errors. If it composes a working plan and improves from feedback, that’s reasoning about procedures that weren’t in the corpus.
> even the pre-LLM models
You're implicitly disparaging non-LLM models at the same time as implying that LLMs are an evolution of the state of the art (in machine learning). Assuming AGI is the target (and it's not clear if we can even define it yet), LLM's or something like them, will be but one aspect. Using the example AlphaGo to laud the abilities and potential of LLM's is not warranted. They are different.
>AlphaGo showed that even pre-LLM models could generate brand new approaches to winning a game that human experts had never seen before, and didn't exist in any training material.
AlphaGo is an entirely different kind of algorithm.
To build on the stochastic parrots bit -
Parrots hear parts of the sound forms we don’t.
If they riffed in the KHz we can’t hear, it would be novel, but it would not be stuff we didn’t train them on.
Related: Jesse Vincent just published this https://blog.fsck.com/2025/10/05/how-im-using-coding-agents-... - it's a really good description of a much more advanced multi-agent workflow than what I've been doing.
Thanks! Your post is great, and Jesse's is too. Bookmarked both.
Both this and Jesse's articles are great. Thanks for posting!
> If I tell them exactly how to build something the work needed to review the resulting changes is a whole lot less taxing.
Totally matches my experience- the act of planning the work, defining what you want and what you don’t, ordering the steps and declaring the verification workflows—-whether I write it or another engineer writes it, it makes the review step so much easier from a cognitive load perspective.
Git worktrees are global mutable state; all containers on your laptop are contending on the same git database. This has a couple of rough edges, but you can work around it.
I prefer instead to make shallow checkouts for my LXC containers, then my main repo can just pull from those. This works just like you expect, without weird worktree issues. The container here is actually providing a security boundary. With a worktree, you need to mount the main repo's .git directory; a malicious process could easily install a git hook to escape.
Cool. Operationally, are you using some host-resident non-shallow repo as your point of centralization for the containers, or are you using a central network-hosted repo (like github)?
If the former, how are you getting the shallow clones to the container/mount, before you start the containerized agent? And when the agent is done, are you then adding its updated shallow clones as remotes to that “central” local repository clone and then fetching/merging?
If the latter, I guess you are just shallow-cloning into each container from the network remote and then pushing completed branches back up that way.
The former. I clone from file:// URIs.
I just have the file path to the inside of my LXC container. If you're using Docker you can just mount it. I only need the path twice (for clone, and for adding a git remote). After that I just use git to reference the remote for everything.
I probably don't have the perfect workflow here. Especially if you're spinning up/down Docker containers constantly. I'm basically performing a Torvalds role play, where I have lieutenant AI agents asking me to pull their trees.
good point
IMO, I was an early adopter to this pattern and at this point I've mostly given it up (except in cases where the task is embarassingly parallel, eg: add some bog standard logging to 6 different folders). It's more than just that reviewing is high cognitive overhead. You become biased by seeing the AI solutions and it becomes harder to catch fundamental problems you would have noticed immediately inline.
My process now is:
- Verbally dictate what I'm trying to accomplish with MacWhisper + Parakeet v3 + GPT-5-Mini for cleanup. This is usually 40-50 lines of text.
- Instruct the agent to explore for a bit and come up with a very concise plan matching my goal. This does NOT mean create a spec for the work. Simply come up with an approach we can describe in < 2 paragraphs. I will propose alternatives and make it defend the approach.
- Authorize the agent to start coding. I turn all edit permissions off and manually approve each change. Often, I find myself correcting it with feedback like "Hmmm, we already have a structure for that [over here] why don't we use that?". Or "If this fails we have bigger problems, no need for exception handling here."
- At the end, I have it review the PR with a slash command to catch basic errors I might have missed or that only pop up now that it's "complete".
- I instruct it to commit + create a PR using the same tone of voice I used for giving feedback.
I've found I get MUCH better work product out of this - with the benefit that I'm truly "done". I saw all the lines of code as they were written, I know what went into it. I can (mostly) defend decisions. Also - while I have extensive rules set up in my CLAUDE/AGENTS folders, I don't need to rely on them. Correcting via dictation is quick and easy and doesn't take long, and you only need to explicitly mention something once for it to avoid those traps the rest of the session.
I also make heavy use of conversation rollback. If I need to go off on a little exploration/research, I rollback to before that point to continue the "main thread".
I find that Claude is really the best at this workflow. Codex is great, don't get me wrong, but probably 85% of my coding tasks are not involving tricky logic or long range dependencies. It's more important for the model to quickly grok my intent and act fast/course correct based on my feedback. I absolutely use Codex/GPT-5-Pro - I will have Sonnet 4.5 dump a description of the issue, paste it to Codex, have it work/get an answer, and then rollback Sonnet 4.5 to simply give it the answer directly as if from nowhere.
Did you try to add codex cli as an MCP server so that Claude uses it as an mcp client instead of pasting to it? Something like “ claude mcp add codex-high -- codex -c model_reasoning_effort="high" -m "gpt-5-codex" mcp-server” ?
I’ve had good luck with it - was wondering if that makes the workflow faster/better?
Yeah I've looked into that kind of thing. In general I don't love the pattern where a coding agent calls another agent automatically. It's hard to control and I don't like how the session "disappears" after the call is done. It can be useful to leave that Codex window open for one more question.
One tool that solves this is RepoPrompt MCP. You can have Sonnet 4.5 set up a call to GPT-5-Pro via API and then that session stays persisted in another window for you to interact with, branch, etc.
Why aren’t more folks using Codex cloud? Simon’s post mentions it, but the vast majority of comments are talking about parallel agents locally or getting distracted while agents are running.
Personally I’ve found that where AI agents aren’t up to the task, I better just write the code. For everything else, more parallelism is good. I can keep myself fully productive if many tasks are being worked on in parallel, and it’s very cheap to throw out the failures. Far preferable imo to watching an agent mess with my own machine.
Could be that it's a bit harder to get started with?
You have to configure your "environment" for it correctly - with a script that installs the dependencies etc before the container starts running. That's not an entirely obvious process.
Good point. The environments I’ve set up have been pretty easy but I’ll admit that at first I was very annoyed that it couldn’t just use a pre-existing GitHub action workflow.
Edit: environment setup was also buggy when the product launched and still is from time to time. So, now that I have it set up I use it constantly, but they do need to make getting up and running a more delightful experience.
Also, Codex Cloud and similar services require you to give fully access to your repository, which might trigger some concerns. If you can run it locally, you still have the control, same development environment, and same permissions.
It doesn’t have access to your repo when the agent is running (unless you give it internet access and credentials). The code is checked out into the sandbox before it’s let loose.
I have 2 (CC and Codex) running within most coding sessions, however can have up to 5 if I'm trying to test out new models or tools.
For complex features and architecture shifts I like to send proposals back between agents to see if their research and opinion shifts anything.
Claude has a better realtime feel when I am in implementation mode and Codex is where I send long running research tasks or feature updates I want to review when I get up in the morning.
I'd like to test out the git worktrees method but will probably pick something outside of core product to test it (like building a set of examples)
I do this every day! In fact, I built a new type of IDE around this (https://github.com/stravu/crystal) and I can never go back.
Does anybody have something like this but usable in a remote environment like a self-hosted k8s?
Specifically, I'm looking for something I can leave running while sporadically connecting from a remote VSCode or electron app.
My biggest hesitation about this is being stuck in merge hell. Even a minute or two needing to deal with that could negate the benefits of agents working in parallel. And I've tried some relatively simple rebase type operations with coding agents where they completely messed up. But if people are finding this is never an issue even with big diffs, I might be convinced to try it.
Typically people use sub agents from what I've seen to work in different parts of large code bases at a time.
If your hitting merge conflicts that bad all the time you should probably just have a single agent doing the work. Especially if they're intertwined rightly
There are quite a few products designed to help manage multiple agents at once. I'm trying out Conductor right now - https://conductor.build/ - it's a pretty slick macOS desktop app that handles running Claude Code within a GUI and uses Git worktrees to manage separate checkouts at the same time.
We have built Rover, an OSS tool that allows you to run multiple instances of Claude (and Codex, Gemini...) while keeping them from stepping on each other toes using containerized environments and Git worktrees https://github.com/endorhq/rover/
These apps are cool, but won't this functionality surely be replicated within Claude Code itself? This does seem to be in "picking up pennies in front of a steamroller" territory but I could be wrong.
Check out Crystal, similar but open source https://github.com/stravu/crystal
We are at a weird moment where the latency of the response is slow enough that we're anthropomorphizing AI code assistants into employees. We don't talk about image generation this way. With images, its batching up a few jobs and reviewing the results later. We don't say "I spun up a bunch of AI artists."
As a follow-up, how would this workflow feel if the LLM generation were instantenous or cost nothing? What would the new bottleneck be? Running the tests? Network speed? The human reviewer?
You can get a glimpse of that by trying one of the wildly performant LLM providers - most notably Cerebras and Groq, or the Gemini Diffusion preview.
I have videos showing Cerebras: https://simonwillison.net/2024/Oct/31/cerebras-coder/ and Gemini Diffusion: https://simonwillison.net/2025/May/21/gemini-diffusion/
Are there any semi autonomous agentic systems for image generation? I feel like mostly it's still a one shot deal but maybe there's an idea there.
I guess Adobe is working on it. Maybe Figma too.
That's part of my point. You don't need to conceptualize something as an "agent" that goes off and does work on its own when the latency is less than 2 seconds.
Would it make sense to think that you and 5 agents form a Team? How would it be different freom a human based team? And how does it work if you have a team of humans who all use their own team of AI agents?
don't say this out loud or Claude Code will add a 'Team retrospective' mode where you and your sub agents all reflect on their feelings
Async agents are great. They let you trigger work with almost no effort, and if the results miss the mark you can just discard them. They also run in the background, making it feel like delegating tasks across teammates who quietly get things done in parallel.
I can't seem to get myself to focus when one of these things is running. I transition into low effort mode. Because of this I've decided to have my good hours of the day LLM free, and then in my crappy hours I'll have one of these running.
This is why I uninstalled Cursor and moving to the terminal with Claude Code. I felt I had more control to reduce the noise from LLMs. Before, I noticed that some hours were just wasted looking at the model output and iterating.
Not sure if I improved using agents over time, or just having it in a separate window forces you to use them only when you need. Having it in the IDE seems the "natural" way to start something and now you are trapped in a conversation with the LLM.
Now, my setup is:
- VSCode (without copilot) / Helix
- Claude (active coding)
- Rover (background agent coding). Note I'm a Rover developer
And I feel more productive and less exhausted.
Why Helix specifically, besides the fact that it’s cool? I’m looking for a reason to try it, but the value props seem really far down the list of usability issues that are important to me.
Do any of these tools support remote access eg via Zellij and/or easy spin up and management of project-specific contains/isolation spaces? That triad seems like it it would be particularly compelling
If anybody is looking for a simple CLI tool to spin up parallel agents, have a look at https://github.com/aperoc/toolkami!
Interesting article, I'm generally sceptical of vibe engineering but Simon seems really balanced in his approach to it and in general his experiences line up with mine. AI can generally be outsourced to do two things - replace your thinking, and replace your typing, and I find the latter much more reliable.
>AI-generated code needs to be reviewed, which means the natural bottleneck on all of this is how fast I can review the results
I also fire off tons of parallel agents, and review is hands down the biggest bottleneck.
I built an OSS code review tool designed for reviewing parallel PRs, and way faster than looking at PRs on Github: https://github.com/areibman/bottleneck
Thanks Simon - you asked us to share patterns that work. Coincidentally I just finished writing up this post:
https://blog.scottlogic.com/2025/10/06/delegating-grunt-work...
Using AI Agents to implement UI automation tests - a task that I have always found time-consuming and generally frustrating!
Along these lines, how does everyone visually organize the multiple terminal tabs open for these numerous agents in various states?
I wish there were a way to search across all open tabs.
I've started color-coding my Claude code tabs, all red, which helps me to find them visually. I do this with a preexec in my ~/.zshrc.
But wondering if anyone else has any better tricks for organizing all of these agent tabs?
I'm using iTerm2 on macOS.
This may not be your cup of tea but I'm using stage manager on Mac for the first time after hating on it for years, for exactly this. I've got 3 monitors and they all have 4-6 stage manager windows. I bundle the terminals, web browser and whatever into these windows. Easily switching from project to project.
Use tmux with multiple sessions. Then press Crtl-b S, you can interactively browse your seesons.
Aerospace (tiling window manager for macOS). Bit of a learning curve but it’s amazing
Has anyone encountered any good YouTube channels that explore and showcase these workflows in a productive and educational manner?
Cole Medin have a few videos.
https://youtu.be/mHBk8Z7Exag?si=f8kxJRDZhqdUjCc1
Any setups without Claude Code? I use CoPilot agent heavily on VSCode, from time to time I have independent grunt work that could be parallelized to two or three agents, but I haven't seen a decent setup for that with CoPilot or some other VSCode extension that I could use my CoPilot subscription with.
You can use Rover (disclaimer, I am one of the cofounders) which is an open source tool that you can use to parallelize the work of coding agents that in addition to Claude also works with Gemini, Codex and Qwen https://github.com/endorhq/rover/
GitHub Copilot has a CLI now, I think it is in beta.
It also supports background agents that you can kick off on the GitHub website, they run on VMs
Co-pilot has a new execute prompt tool in preview that lets it spin out its own requests to LLMs
Sometimes in Las Vegas I will put money into a whole row of slot machines, and pull the levers all at the same time. It's not that much different than "parallel coding agent lifestyle".
So many marketing and spam comments on this post, it is insane
There's shovels to be sold!
This is a great article and not sure why it got so few upvotes. It captures the way we have been working for a while and why we developed and open sourced Rover (https://endor.dev/blog/introducing-rover), our internal tool for managing coding agents. It automates a lot of what Simon describes like setting up git worktrees, giving each agent its own containerized environment and allowing mixing and matching agents from different vendors (ie Claude and Codex) for different tasks
We need some action. Like a battle or something. Some really experienced programmer not using any AI tools, and some really experienced AI coder using these agents, both competing live to solve issues on some popular real world repositories.
I'm not convinced there is any hope for a productive, long-term, burnout-free parallel agent workflow.
Not while they need even the slightest amount of supervision/review.
The thing that's working really well for me is parallel research tasks.
I can pay full attention to the change I'm making right now, while having a couple of coding agents churning in the background answering questions like:
"How can I resolve all of the warnings in this test run?"
Or
"Which files do I need to change when working on issue #325?"
I also really like the "Send out a scout" pattern described in https://sketch.dev/blog/seven-prompting-habits - send an agent to implement a complex feature with no intention of actually using their code - but instead aiming to learn from which files and tests they updated, since that forms a useful early map for the actual work.
Yea, i find success in LLMs overall but the quality of the work is proportional to how much oversight there is.
My suspicion is that it's because the feedback loop is so fast. Imagine if you were tasked with supervising 2 co-workers who gave you 50-100 line diffs to review every minute. The uncanny valley is that the code is rarely good enough to accept blindly, but the response is quick enough that it feels like progress. And perhaps an human impulse to respond to the agent? And a 10-person team? In reality those 10 people would review each other's PRs and in a good organisation you trust each other to gatekeep what gets checked in. The answer sounds like managing-agents, but none of the models are good enough to reliably say what's slop and what's not.
I don't like to compare LLM's to people.
There is a real return of investment in co-workers over time, as they get better (most of the time).
Now, I don't mind engaging in a bit of Sisyphean endeavor using an LLM, but remember that the gods were kind enough to give him just one boulder, not 10 juggling balls.
It's less about a direct comparison to people and more what a similar scenario would be in a normal development team (and why we don't put one person solely in charge of review).
This is an advantage of async systems like Jules/Copilot, where you can send off a request and get on with something else. I also wonder if the response from CLI agents is also short enough that you can waste time staring at the loading bar, because context switching between replies is even more expensive.
Yes. The first time I heard/read someone describe this idea of managing parallel agents, my very first thought was that this is only even a thing because the LLM coding tools are still slow enough that you can't really achieve a good flow state with the current state of the art. On the flip side of that, this kind of workflow is only sustainable if the agents stay somewhat slow. Otherwise, if the agents are blocking on your attention, it seems like it would feel very hectic and I could see myself getting burned out pretty quickly from having to spend my whole work time doing a round-robin on iterating each agent forward.
I say that having not tried this work flow at all, so what do I know? I mostly only use Claude Code to bounce questions off of and ask it to do reviews of my work, because I still haven't had that much luck getting it to actually write code that is complete and how I like.
[dead]
Previous submission here with some comments already: https://news.ycombinator.com/item?id=45481585
Thanks! Looks like that post didn't get any frontpage time. Since the current thread is on the frontpage, maybe we'll merge those comments hither.
(Warning: this involves adjusting timestamps a la https://hn.algolia.com/?dateRange=all&page=0&prefix=true&que..., which is sometimes confusing...)
[dead]