Just the other day I hit something that I hadn't realized could happen. It was not code related in my case, but could happen with code or code-related things (and did to a coworker).
In a discussion here on HN about why a regulation passed 15 years ago was not as general as it could have been, I speculated [1] that it could be that the technology at the time was not up to handling the general case and so they regulated what was feasible at the time.
A couple hours later I checked the discussion again and a couple people had posted that the technology was up to the general case back then and cheap.
I asked an LLM to see if it could dig up anything on this. It told me it was due to technological limits.
I then checked the sources it cites to get some details. Only one source it cited actually said anything about technology limits. That source was my HN comment.
I mentioned this at work, and a coworker mentioned that he had made a Github comment explaining how he thought something worked on Windows. Later he did a Google search about how that thing worked and the LLM thingy that Google puts at the top of search results said that the thing worked the way he thought it did but checking the cites he found that was based on his Github comment.
I'm half tempted to stop asking LLMs questions of the form "How does X work?" and instead tell them "Give me a list of all the links you would cite if someone asked you how X works?".
Essentially, you're asking the LLM to do research and categorize/evaluate that research instead of just giving you an answer. The "work" of accessing, summarizing, and valuing the research yields a more accurate result.
Thank you so much for sharing this. Myself, and I’m sure many of others, are thinking about these things a lot these days. It’s great to see how someone else is coming at the problem.
I love the grounding back to ~“well even a human would be bad at this if they did it the current LLM way.”
Bringing things back to ground truth human processes is something that is surprisingly unnatural for me to do. And I know better, and I preach doing this, and I still have a hard time doing it.
I know far better, but apparently it is still hard for me to internalize that LLMs are not magic.
Most of us probably do the same thing when we read a HN comment about something specific: "This rando seems to know what they're talking about. I'll assume it as fact until I encounter otherwise."
Not doing this might actually cause bigger problems... Getting first-hand experience or even reputable knowledge about something is extremely expensive compared to gut-checking random info you come across. So the "cheap knowledge" may be worth it on balance.
I wish the source citing was more explicit. It would be great if the AI summary said something like, “almost no info about xyz can be found online but one GitHub comment says abc” (link)
Instead it often frames the answer as authoritative
But no one uses LLMs like this. This is the type of simple fact you could just Google and check yourself.
LLMs are useful for providing answers to more complex questions where some reasoning or integration of information is needed.
In these cases I mostly agree with the parent commenter. LLMs often come up with plausibly correct answers, then when you ask to cite sources they seem to just provide articles vaguely related to what they said. If you're lucky it might directly address what the LLM claimed.
I assume this is because what LLMs say is largely just made up, then when you ask for sources it has to retroactively try to find sources to justify what it said, and it often fails and just links something which could plausibly be a source to back up it's plausibly true claims.
I do, and so does Google. When I googled "When was John Howard elected?" the correct answer came back faster in the AI Overview than I could find the answer in the results. The source the AI Overview links even provides confirmation of the correct answer.
Yeah but before AI overviews Google would have shown the first search result with a text snippet directly quoted from the page with the answer highlighted.
Thats just as fast (or faster) than the AI overview
The snippet included in the search result does not include or highlight the relevant fact. I feel like you’re not willing to take simple actions to confirm your assertions.
When I searched, the top result was Wikipedia with the following excerpt: “At the 1974 federal election, Howard was elected as a member of parliament (MP) for the division of Bennelong. He was promoted to cabinet in 1977, and…”
To me this seemed like the relevant detail in the first excerpt.
But after more thought I realize you were probably expecting the date of his election to prime minister which is fair! That’s probably what searchers would be looking for.
It gets more obvious once you start researching stuff that is quite niche, like how to connect a forgotten old USB device to a modern computer and the only person posting about it was a Russian guy on an almost abandoned forum.
They will just make up links. You need to make sure they're actually researching pages. That's what the deep research mode does. That being said, their interpretation of the information in the links is still influenced by their training.
I find it much more intuitive to think of LLMs as fuzzy-indexed frequency based searches combined with grammatically correct probabilistic word generators.
They have no concept of truth or validity, but the frequency of inputs into their training data provides a kind of psuedo check and natural approximation to truth as long as frequency and relationships in the training data also has some relationship to truth.
For a lot of textbook coding type stuff that actually holds: frameworks, shell commands, regexes, common queries and patterns. There's lots of it out there and generally the more common form is spreading some measure of validity.
My experience though is that on niche topics, sparse areas, topics that humans are likely to be emotionally or politically engaged with (and therefore not approximate truth), or things that are recent and therefore haven't had time to generate sufficient frequency, they can get thrown off. And of course it also has no concept of whether what it is finding or reporting is true or not.
This also explains why they have trouble with genuine new programming and not just reimplementing frameworks or common applications because they lack the frequency based or probabilistic grounding to truth and because the new combinations of libraries and code leads to place of relative sparsity in it's weights that leave them unable to function.
The literature/marketing has taken to calling this hallucination, but it's just as easy to think of it as errors produced by probabilistic generation and/or sparsity.
I even curated a list of 6-8 sources in NotebookLM recently, asked a very straight-forward question (which credential formats does OID4VP allow). The sources were IETF and OpenID specs + some additional articles on it.
I wanted to use NotebookLM as a tool to ask back and forth when I was trying to understand stuff. It got the answer 90% right but also added a random format, sounding highly confident as if I asked the spec authors themselves.
It was easy to check the specs when I became suspicious and now my trust, even in "grounded" LLMs, is completely eroded when it comes to knowledge and facts.
Recently, I asked Codex CLI to refactor some HTML files. It didn't literally copy and pasted snippets here and there as I would have done myself, it rewrote them from memory, removing comments in the process. There was a section with 40 successive <a href...> links with complex URLs.
A few days later, just before deployment to production, I wanted to double check all 40 links. First one worked. Second one worked. Third one worked. Fourth one worked. So far so good. Then I tried the last four. Perfect.
Just to be sure, I proceeded with the fifth one. 404. Huh. Weird. The domain was correct though and the URL seemed reasonable.
I tried the other 31 links. ALL of them 404ed. I was totally confused. The domain was always correct. It seemed highly suspicious that all websites would have had moved internal URLs at the same time. I didn't even remember that this part of the code had gone through an LLM.
Fortunately, I could retrieve the old URLs on old git commits. I checked the URLs carefully. The LLM had HALLUCINATED most of the path part of the URLs! Replacing things like domain.com/this-article-is-about-foobar-123456/ by domain.com/foobar-is-so-great-162543/...
These kinds of very subtle and silently introduced mistakes are quite dangerous. Be careful out there!
The last point I think is most important: "very subtle and silently introduced mistakes" -- LLMs may be able to complete many tasks as well (or better) than humans, but that doesn't mean they complete them the same way, and that's critically important when considering failure modes.
In particular, code review is one layer of the conventional swiss cheese model of preventing bugs, but code review becomes much less effective when suddenly the categories of errors to look out for change.
When I review a PR with large code moves, it was historically relatively safe to assume that a block of code was moved as-is (sadly only an assumption because GitHub still doesn't have indicators of duplicated/moved code like Phabricator had 10 years ago...), so I can focus my attention on higher level concerns, like does the new API design make sense? But if an LLM did the refactor, I need to scrutinize every character that was touched in the block of code that was "moved" because, as the parent commenter points out, that "moved" code may have actually been ingested, summarized, then rewritten from scratch based on that summary.
For this reason, I'm a big advocate of an "AI use" section in PR description templates; not because I care whether you used AI or not, but because some hints about where or how you used it will help me focus my efforts when reviewing your change, and tune the categories of errors I look out for.
I was about to write my own tool for this but then I discovered:
git diff --color-moved=dimmed-zebra
That shows a lot of code that was properly moved/copied in gray (even if it's an insertion). So gray stuff exactly matches something that was there before. Can also be enabled by default in the git config.
That's a great solution and I'm adding it to my fallback. But also, people might be interested in diff-so-fancy[0]. I also like using batcat as a pager.
I used autochrome[0] for Clojure code to do this. (I also made some improvements to show added/removed comments, top-level form moves, and within-string/within-comment edits the way GitHub does.)
At first I didn't like the color scheme and replaced it with something prettier, but then I discovered it's actually nice to have it kinda ugly, makes it easier to detect the diffs.
When using a reasonably smart llm, code moves are usually fine, but you have to pay attention whenever uncommon words (like urls or numbers) are involved.
It kind of forces you to always put such data in external files, which is better for code organization anyway.
If it's not necessary for understanding the code, I'll usually even leave this data out entirely when passing the code over.
In Python code I often see Gemini add a second h to a random header file extension. It always feels like the llm is making sure that I'm still paying attention.
I've had similar experience both in coding and in non-coding research questions. An LLM will do the first N right and fake its work on the rest.
It even happens when asking an LLM to reformat a document, or asking it to do extra research to validate information.
For example, before a recent trip to another city, I asked Gemini to prepare a list of brewery taprooms with certain information, and I discovered it had included locations that had been closed for years or had just been pop-ups. I asked it to add a link to the current hours for each taproom and remove locations that it couldn't verify were currently open, and it did this for about the first half of the list. For the last half, it made irrelevant changes to the entries and didn't remove any of the closed locations. Of course it enthusiastically reported that it had checked every location on the list.
LLMs are not good at "cycles" - when you have to go over a list and do the same action on each item.
It's like it has ADHD and forgets or gets distracted in the middle.
And the reason for that is that LLMs don't have memory and process the tokens, so as they keep going over the list the context becomes bigger with more irrelevant information and they can lose the reason they are doing what they are doing.
What I assume he's talking about is internal activations such as stored in KV cache that have same lifetime as tokens in the input, but this really isn't the same as "working memory" since these are tied to the input and don't change.
What it seems an LLM would need to do better at these sort of iterative/sequencing tasks would be a real working memory that had more arbitrary task-duration lifetime and could be updated (vs fixed KV cache), and would allow it to track progress or more generally maintain context (english usage - not LLM) over the course of a task.
I'm a bit surprised that this type of working memory hasn't been added to the transformer architecture. It seems it could be as simple as a fixed (non shifting) region of the context that the LLM could learn to read/write during training to assist on these types of task.
An alternative to having embeddings as working memory is to use an external file of text (cf a TODO list, or working notes) for this purpose which is apparently what Claude Code uses to maintain focus over long periods of time, and I recently saw mentioned that the Claude model itself has been trained to use read/write to this sort of text memory file.
It would be nice if the tools we usually use for LLMs had a bit more programmability. In this example, It we could imagine being able to chunk up work by processing a few items, then reverting to a previous saved LLM checkpoint of state, and repeating until the list is complete.
I imagine that the cost of saving & loading the current state must be prohibitively high for this to be a normal pattern, though.
Which is annoying because that is precisely the kind of boring rote programming tasks I want an LLM to do for me, to free up my time for more interesting problems
Not code, but I once pasted an event announcement and asked for just spelling and grammar check. LLM suggested a new version with minor tweak which I copy pasted back.
Just before sending I noticed that it had moved the event date by one day. Luckily I caught it but it taught me that you never should blindly trust LLM output even with super simple tasks, no relevant context size, clear and simple one sentence prompt.
LLM's do the most amazing things but they also sometimes screw up the simplest of tasks in the most unexpected ways.
>Not code, but I once pasted an event announcement and asked for just spelling and grammar check. LLM suggested a new version with minor tweak which I copy pasted back. Just before sending I noticed that it had moved the event date by one day.
This is the kind of thing I immediately noticed about LLMs when I used them for the first time. Just anecdotally, I'd say it had this problem 30-40% of the time. As time has gone on, it has gotten so much better. But it still makes this kind of problem -- lets just say -- 5% of the time.
The thing is, it's almost more dangerous to rarely make the problem. Because now people aren't constantly looking for it.
You have no idea if it's not just randomly flipping terms or injecting garbage unless you actually validate it. The ideal of giving it an email to improve and then just scanning the result before firing it off is terrifying to me.
5 minutes ago, I asked Claude to add some debug statements in my code. It also silently changed a regex in the code. It was easily caught with the diff but can be harder to spot in larger changes.
I asked Claude to add a debug endpoint to my hardware device that just gave memory information. It wrote 2600 lines of C that gave information about every single aspect of the system. On the one hand kind of cool. It looked at the MQTT code and the update code, the platform (esp) and generated all kinds of code. It recommended platform settings that could enable more detailed information that checked out when I looked at the docs. I ran it and it worked. On the other hand, most of the code was just duplicated over and over again ex: 3 different endpoints that gave overlapping information. About half of the code generated fake data rather than actually do anything with the system.
I rolled back and re-prompted and got something that looked good and worked. The LLMs are magic when they work well but they can throw a wrench into your system that will cost you more if you don't catch it.
I also just had a 'senior' developer tell me that a feature in one of our platforms was deprecated. This was after I saw their code which did some wonky hacky like stuff to achieve something simple. I checked the docs and said feature (URL Rewriting) was obviously not deprecated. When I asked how they knew it was deprecated they said Chat GPT told them. So now they are fixing the fix chat gpt provided.
Claude (possible all LLMs, but I mostly use Claude) LOVES this pattern for some reason. "If <thing> fails/does not exist I'll just silently return a placeholder, that way things break silently and you'll tear your hair out debugging it later!" Thanks Claude
I will also add checks to make sure the data that I get is there even though I checked 8 times already and provide loads of logging statements and error handling. Then I will go to every client that calls this API and add the same checks and error handling with the same messaging. Oh also with all those checks I'm just going to swallow the error at the entry point so you don't even know it happened at runtime unless you check the logs. That will be $1.25 please.
Hah I also happened to use Claude recently to write basic MQTT code to expose some data on a couple Orange Pis I wanted to view in Home Assistant. And it one-shot this super cool mini Python MQTT client I could drop wherever I needed it which was amazing having never worked with MQTT in Python before.
I made some charts/dashboards in HA and was watching it in the background for a few minutes and then realized that none of the data was changing, at all.
So I went and looked at the code and the entire block that was supposed to pull the data from the device was just a stub generating test data based on my exact mock up of what I wanted the data it generated to look like.
Claude was like, “That’s exactly right, it’s a stub so you can replace it with the real data easily, let me know if you need help with that!” And to its credit, it did fix it to use actual data but I re-read my original prompt was somewhat baffling to think it could have been interpreted as wanting fake data given I explicitly asked it to use real data from the device.
I had a pretty long regex in a file that was old and crusty, and when I had Claude add a couple helpers to the file, it changed the formatting of the regex to be a little easier on the eyes in terms of readability.
But I just couldn't trust it. The diff would have been no help since it went from one long gnarly line to 5 tight lines. I kept the crusty version since at least I am certain it works.
I asked it to change some networking code, which it did perfectly, but I noticed some diffs in another file and found it had just randomly expanded some completely unrelated abbreviations in strings which are specifically shortened because of the character limit of the output window.
It was a fairly big refactoring basically converting a working static HTML landing page into a Hugo website, splitting the HTML into multiple Hugo templates. I admit I was quite in a hurry and had to take shortcuts. I didn't have time to write automated tests and had to rely on manual tests for this single webpage. The diff was fairly big. It just didn't occur to me that the URLs would go through the LLMs and could be affected! Lesson learnt haha.
Speaking of agents and tests, here's a fun one I had the other day: while refactoring a large code base I told the agent to do something precise to a specific module, refactor with the new change, then ensure the tests are passing.
The test suite is slow and has many moving parts; the tests I asked it to run take ~5 minutes. The thing decided to kill the test run, then it made up another command it said was the 'tests' so when I looked at the agent console in the IDE everything seemed fine collapsed, i.e. 'Tests ran successfully'.
Obviously the code changes also had a subtle bug that I only saw when pushing its refactoring to CI (and more waiting). At least there were tests to catch the problem.
I think that it's something that model providers don't want to fix, because the amount of times that Claude Code just decided to delete tests that were not passing before I added a memory saying that it would need to ask for my permission to do that was staggering. It stopped happening after the memory, so I believe that it could be easily fixed by a system prompt.
this is why I'm terrified of large LLM slop changesets that I can't check side by side - but then that means I end up doing many small changes that are harder to describe in words than to just outright do.
This and why are the URLs hardcoded to begin with? And given the chaotic rewrite by Codex it would probably be more work to untangle the diff than just do it yourself right away.
I truly wonder how much time we have before some spectacular failure will happen because a LLM was asked to rewrite a file with a bunch of constants in it in critical software and silently messed up or inverted them in a way that looks reasonable and works in your QA environment and then leads to a spectacular failure in the field.
Not related to code... But when I use a LLM to perform a kind of copy/paste, I try to number the lines and ask it to generate a start_index and stop_index to perform the slice operation. Much less hallucinations and very cheap in token generation.
Incorrect data is a hard one to catch, even with automated tests (even in your tests, you're probably only checking the first link, if you're event doing that).
Luckily I've grown a preference for statically typed, compiled, functional languages over the years, which eliminates an entire class of bugs AND hallucinations by catching them at compile time. Using a language that doesn't support null helps too. The quality of the code produced by agents (claude clode and codex) is insanely better than when I need to fix some legacy code written in a dynamic language. You'll sometimes catch the agent hallucinating and continuously banging it's head against the wall trying to get it's bad code to compile. It seems to get more desperate and may eventually figure out a way to insert some garbage to get it to compile or just delete a bunch of code and paper over it... but it's generally very obvious when it does this as long as you're reviewing. Combine this with git branches and a policy of frequent commits for greatest effect.
You can probably get most of the way there with linters and automated tests with less strict dynamic languages, but... I don't see the point for new projects.
I've even found Codex likes to occasionally make subtle improvements to code located in the same files but completely unrelated to the current task. It's like some form of AI OCD. Reviewing diffs is kind of essential, so using a foundation that reduces the size of those diffs and increases readability is IMO super important.
Reminds me when I asked Claude (through Windsurf) to create a S3 Lambda trigger to resize images (as soon as PNG image appears in S3, resize it). The code looked flawless and I deployed ..only to learn that I introduced a perpetual loop :) For every image resized, a new one would be created and resized. In 5 min, the trigger created hundreds of thousands of images ...what a joy was to clean that up in S3
Interesting, I've seen similar looking behavior in other forms of data extraction. I took a picture of a bookshelf and asked it to list the books. It did well in the beginning but by the middle, it had started making up similar books that were not actually there.
My custom prompt instructs GPT to output changes to code as a diff/git-patch. I don’t use agents because it makes it hard to see what’s happening and I don’t trust them yet.
I’ve tried this approach when working in chat interfaces (as opposed to IDEs), but I often find it tricky to review diffs without the full context of the codebase.
That said, your comment made me realize I could be using “git apply”more effectively to review LLM-generated changes directly in my repo. It’s actually a neat workflow!
"...very subtle and silently introduced mistakes are quite dangerous..."
In my view these perfectly serve the purpose of encouraging you to keep burning tokens for immediate revenue as well as potentially using you to train their next model at your expense.
"very subtle and silently introduced mistakes" - that's the biggest bottleneck I think; as long as it's true, we need to validate LLMs outputs; as long as we must validate LLMs outputs, our own biological brains are the ultimate bottleneck
In these cases I explicitly tell the llm to make as few changes as possible and I also run a diff. And then I reiterate with a new prompt if too many things changed.
You can always run a diff. But how good are people at reading diffs? Not very. It's the kind of thing you would probably want a computer to do. But now we've got the computer generating the diffs (which it's bad at) and humans verifying them (which they're also bad at).
You’re just not using LLMs enough. You can never trust the LLM to generate a url, and this was known over two years ago. It takes one token hallucination to fuck up a url.
It’s very good at a fuzzy great answer, not a precise one. You have to really use this thing all the time and pick up on stuff like that.
Yeah so, the reason people use various tools and machines in the first place is to simplify the work or everydays tasks by : 1) Making the tasks execute faster 2) Getting more reliable outputs then doing this by yourself 3) Making it repeatable . The LLMs obviously dont check any of these boxes so why don´t we stop pretending that we as users are stupid and don´t know how to use them and start taking them for what they are - cute little mirages, perhaps applicable as toys of some sort, but not something we should use for serious engineering work really?
> why don´t we stop pretending that we as users are stupid and don´t know how to use them
This is in response to someone who saw a bunch of URLs coming out of it and was surprised at a bunch of them being wrong. That's using the tool wrong. It's like being surprised that the top results in google/app store/play store aren't necessarily the best match for your query but actually adverts!
The URLs being wrong in that specific case is one where they were using the "wrong tool". I can name you at least a dozen other cases from own experience, where too, they appear to be the wrong tool, for example for working with Terraform or for not exposing secrets by hardcoding them in the frontend. Et cetera. Many other people will have contributed thousands if not more similar but different cases. So what good are these tools then for really? Are we all really that stupid? Many of us mastered the hard problem of navigating various abstraction layers of computer over the years, only to be told, we now effing dont know how to write a few sentences in English? Come on. I'd be happy to use them in whatever specific domain they supposedly excel at. But no-one seems to be able to identify one for sure. The problem is, the folks pushing or better said, shoving down these bullshit generators down our throats are trying to sell us the promise of an "everything oracle". What did old man Altman tell us about ChatGPT 5? PhD level tool for code generation or some similar nonsense? But it turns out it only gets one metric right each time - generating a lot of text. So, essentially, great for bullshit jobs (i count some of the IT jobs as such too), but not much more.
> Many of us mastered the hard problem of navigating various abstraction layers of computer over the years, only to be told, we now effing dont know how to write a few sentences in English? Come on.
If you're trying to one shot stuff with a few sentences then yes you might be using these things wrong. I've seen people with PhDs fail to use google successfully to find things, were they idiots? If you're using them wrong you're using them wrong - I don't care how smart you are in other areas. If you can't hand off work knowing someones capabilities then that's a thing you can't do - and that's ok. I've known unbelievably good engineers who couldn't form a solid plan to solve a business problem or collaboratively work to get something done to save their life. Those are different skills. But gpt5-codex and sonnet 4 / 4.5 can solidly write code, gpt-5-pro with web search can really dig into things, and if you can manage what they can do you can hand off work to them. If you've only ever worked with juniors with a feeling of "they slow everything down but maybe someday they'll be as useful as me" then you're less likely to succeed at this.
Let's do a quick overview of recent chats for me:
* Identifying and validating a race condition in some code
* Generating several approaches to a streaming issue, providing cost analyses of external services and complexity of 3 different approaches about how much they'd change the code
* Identifying an async bug two good engineers couldn't find in a codebase they knew well
* Finding performance issues that had gone unnoticed
* Digging through synapse documentation and github issues to find a specific performance related issue
* Finding the right MSC for a feature I wanted to use but didn't know existed - and then finding the github issue that explained how it was only half implemented and how to enable the experimental other part I needed
* Building a bunch of UI stuff for a short term contract I needed, saving me a bunch of hours and the client money
* Going through funding opportunities and matching them against a charity I want to help in my local area
* Building a search integration for my local library to handle my kids reading challenge
* Solving a series of VPN issues I didn't understand
* Writing a lot of astro related python for an art project to cover the loss of some NASA images I used to have access to.
> the folks pushing or better said
If you don't want to trust them, don't. Also don't believe the anti-hype merchants who want to smugly say these tools can't do a god damn thing. They're trying to get attention as well.
Again mate, stop making arrogant assumptions and read some of my previous comments. I and my team are early adopters, since about 2 years. I am even paying for premium-level service. Trust me, it sucks and under-delivers. But good for you and others who claim they are productive with it - I am sure we will see those 10x apps rolling in soon, right? It's only been like 4 years since the revolutionary magic machine was announced.
I read your comments. Did you read mine? You can pass them into chatgpt or claude or whatever premium services you pay for to summarise them for you if you want.
> Trust me, it sucks
Ok. I'm convinced.
> and under-delivers.
Compared to what promise?
> I am sure we will see those 10x apps rolling in soon, right?
Did I argue that? If you want to look at some massive improvements, I was able to put up UIs to share results & explore them with a client within minutes rather than it taking me a few hours (which from experience it would have done).
> It's only been like 4 years since the revolutionary magic machine was announced.
It's been less than 3 since chatgpt launched, which if you'd been in the AI sphere as long as I had (my god it's 20 years now) absolutely was revolutionary. Over the last 4 years we've seen gpt3 solve a bunch of NLP problems immediately as long as you didn't care about cost to gpt-5-pro with web search and codex/sonnet being able to explore a moderately sized codebase and make real and actual changes (running tests and following up with changes). Given how long I spent stopping a robot hitting the table because it shifted a bit and its background segmentation messed up, or fiddling with classifiers for text, the idea I can get a summary from input without training is already impressive and then to be able to say "make it less wanky" and have it remove the corp speak is a huge shift in the field.
If your measure of success is "the CEOs of the biggest tech orgs say it'll do this soon and I found a problem" then you'll be permanently disappointed. It'd be like me sitting here saying mobile phones are useless because I was told how revolutionary the new chip in an iphone was in a keynote.
Since you don't seem to want to read most of this, most isn't for you. The last bit is, and it's just one question:
Why are you paying for something that solves literally no problems for you?
> This is in response to someone who saw a bunch of URLs coming out of it and was surprised at a bunch of them being wrong. That's using the tool wrong. It's like being surprised that the top results in google/app store/play store aren't necessarily the best match for your query but actually adverts!
The CEO of Anthropic said I can fire all of my developers soon. How could one possibly be using the tool wrong? /s
Stop what mate? My words are not the words of someone who ocassionally dabbles in the free ChatGPT layer - I've been paying premium tier AI tools for my entire company for a long time now. Recently we had to scale back their usage to just consulting mode, i.e. because the agent mode has gone from somewhat-useful to complete waste of time. We are now back to using them as replacement for the now entshittified search. But as you can see by my early adopting of these crap-tools, I am open-minded. I'd love to see what great new application you have built using them. But if you don't have anything to show, I'll also take some arguments, you know, like the stuff I provided in my first comment.
I'll take the L when llms can actually do my job to the level I expect. Llms can do some of my work but they are tiring they make mistakes and they absolutely get confused by a sufficiently complex and large codebase.
Quite frankly, not being able to discuss the pros and the cons of a technology with other engineers absolutely hinders innovation. A lot of discoveries come out of mistakes.
Why is the bar for it to do your job or completely replace you? It's a tool. If it makes you 5% better at your job, then great. There's a recent study showing it has 15-20% productivity benefits: not completely useless, not 10x. I hope we can have nuance in the conversation.
...and then there was also a recent MIT study showing it was making everyone less productive. The bar is there because this is how all the AI grifters have been selling this technology - no less than end of work itself. Why should we not hold them accountable for over-promising and under-delivering? Or is that reserved just for the serfs?
Read about the jagged frontier. IanCal is right: this is a perfect example of using the tool wrong; you’ve focused on a very narrow use case which is surprisingly hard for the matmuls to not mess up and extrapolate, but extrapolation is incorrect here because the capability frontier is fractal and not continuous.
It’s not surprisingly hard at all, when you consider they have no understanding of the tasks they do nor of the subject material. It’s just a good example of the types of tasks (anything requiring reliability or correct results) that they are fundamentally unsuited to.
Sadly it seems the best use-case for LLMs at this point is bamboozling humans.
When you take a step back it's surprising that these tools can be actually useful at all in nontrivial tasks, but being surprised doesn't matter in the grand scheme of things. Bamboozling rarely enough for harnesses to keep them in line and ability to inference-time self-correct when bamboozling is detected either by the model itself or by the harness is very useful at least in my work. It's a question of using the tool correctly and understanding its limitations, which is hard if you aren't willing to explore the boundaries and commit to doing it every month basically.
They're moderately unreliable text copying machines if you need exact copying of long arbitrary strings. If that's what you want, don't use LLMs. I don't think they were ever really sold as that, and we have better tools for that.
On the other hand, I've had them easily build useful code, answer questions and debug issues complex enough to escape good engineers for at least several hours.
Depends what you want. They're also bad (for computers) at complex arithmetic off the bat, but then again we have calculators.
> I don't think they were ever really sold as that, and we have better tools for that.
We have OpenAI calling gpt5 as having PhD level of intelligence and others like Anthropoc saying it will write all our code within months. Some are claiming it’s already writing 70%.
I say they are being sold as a magical do everything tool.
Intelligence isn't the same as "can exactly replicate text". I'm hopefully smarter than a calculator but it's more reliable at maths than me.
Also there's a huge gulf between "some people claim it can do X" and "it's useful". Altman promising something new doesn't decrease the usefulness of a model.
What you are describing is "dead reasoning zones".[0]
"This isn't how humans work. Einstein never saw ARC grids, but he'd solve them instantly. Not because of prior knowledge, but because humans have consistent reasoning that transfers across domains. A logical economist becomes a logical programmer when they learn to code. They don't suddenly forget how to be consistent or deduce.
But LLMs have "dead reasoning zones" — areas in their weights where logic doesn't work. Humans have dead knowledge zones (things we don't know), but not dead reasoning zones. Asking questions outside the training distribution is almost like an adversarial attack on the model."
They are lying, because their salary depends on them lying about it. Why does it even matter what they're saying? Why don't we listen to scientists, researchers, practicioners and the real users of the technology and stop repeating what the CEOs are saying?
The things they're saying are technically correct, the best kind of correct. The models beat human PhDs on certain benchmarks of knowledge and reasoning. They may write 70% of the easiest code in some specific scenario. It doesn't matter. They're useful tools that can make you slightly more productive. That's it.
When you see on tv that 9 out of 10 dentists recommend a toothpaste what do you do? Do you claim that brushing your teeth is a useless hype that's being pushed by big-tooth because they're exaggerating or misrepresenting what that means?
> When you see on tv that 9 out of 10 dentists recommend a toothpaste what do you do? Do you claim that brushing your teeth is a useless hype that's being pushed by big-tooth because they're exaggerating or misrepresenting what that means?
Only after schizophrenic dentists go around telling people that brushing their teeth is going to lead to a post-scarcity Star Trek world.
It's a new technology which lends itself well to outrageous claims and marketing, but the analogy stands. The CEOs don't get to define the narrative or stand as strawman targets for anti-AI folks to dunk on, sorry. Elon has been repeating "self driving next year" for a decade+ at this point, that doesn't make what Waymo did unimpressive. This level of cynicism is unwarranted is what I'm saying.
Not what I said at all. Question it all what you want. But disproving outrageous CEO claims doesn't get you there. Whether LLMs are AGI/ASI that will replace everyone is seperate from whether they are useful today as tools. Attacking the first claim doesn't mean much for the second claim, which is the more interesting one.
LLMs aren’t high school students, they’re blobs of numbers which happen to speak English if you poke them right. Use the tool when it’s good at what it does.
Well, you see it hallucinates on long precise strings, but if we ignore that, and focus on what it’s powerful at, we can do something powerful. In this case, by the time it gets to outputting the url, it already determined the correct intent or next action (print out a url). You use this intent to do a tool call to generate a url. Small aside, it’s ability to figure what and why is pure magic, for those still peddling the glorified autocomplete narrative.
You have to be able to see what this thing can actually do, as opposed to what it can’t.
I can’t even tell if you’re being sarcastic about a terrible tool or are hyping up LLMs as intelligent assistants and telling me we’re all holding it wrong.
This is very poorly worded. Using LLMs more wouldn't solve the problem. What you're really saying is that the GP is uninformed about LLMs.
This may seem like pedantry on my part but I'm sick of hearing "you're doing it wrong" when the real answer is "this tool can't do that." The former is categorically different than the latter.
It's pretty clearly worded to me, they don't use LLMs enough to know how to use them successfully. If you use them regularly you wouldn't see a set of urls without thinking "Unless these are extremely obvious links to major sites, I will assume each is definitely wrong".
> I'm sick of hearing "you're doing it wrong"
That's not what they said. They didn't say to use LLMs more for this problem. The only people that should take the wrong meaning from this are ones who didn't read past the first sentence.
> when the real answer is "this tool can't do that."
> If you use them regularly you wouldn't see a set of urls without thinking...
Sure, but conceivably, you could also be informed of this second hand, through any publication about LLMs, so it is very odd to say "you don't use them enough" rather than "you're ignorant" or "you're uninformed". It is very similar to these very bizarre AI-maximalist positions that so many of us are tired of seeing.
This isn't ai maximalist though, it's explicitly pointing out something that regularly does not work!
> Sure, but conceivably, you could also be informed of this second hand, through any publication about LLMs, so it is very odd to say "you don't use them enough" rather than "you're ignorant" or "you're uninformed".
But this is to someone who is actively using them, and the suggestion of "if you were using them more actively you'd know this, this is a very common issue" is not at all weird. There are other ways they could have known this, but they didn't.
"You haven't got the experience yet" is a much milder way of saying someone doesn't know how to use a tool properly than "you're ignorant".
I think part of the issue is that it doesn't "feel" like the LLM is generating a URL, because that's not what a human would be doing. A human would be cut & pasting the URLs, or editing the code around them - not retyping them from scratch.
Edit: I think I'm just regurgitating the article here.
I have a project that I've leaned heavily on LLM help for which I consider to embody good quality control practices. I had to get pretty creative to pull it off: spent a lot of time working on this sync system so that I can import sanitized production data into the project for every table it touches (there are maybe 500 of these) and then there's a bunch of hackery related to ensuring I can still get good test coverage even when some of these flows are partially specified (since adding new ones proceeds in several separate steps).
If it was a project written by humans I'd say they were crazy for going so hard on testing.
The quality control practices you need for safely letting an LLM run amok aren't just good. They're extreme.
This is of course bad but: humans also makes (different) mistakes all the time. We could account for the risk of mistakes being introduced and make more tools that validate things for us. In a way LLM:s encourage us to do this by adding other vectors of chaos into our work.
Like, why not have tools built into our environment that checks that links are not broken? With the right architecture we could have validations for most common mistakes without having the solution adding a bunch of tedious overhead.
In the above kind of described situation, a meticulous coder actually makes no mistakes. They will however make a LOT more mistakes if they use LLM's to do the same.
I have already had to correct a LOT of crap similar to the above in refactoring-done-via-LLM over the last year.
When stuff like this was done by a plain, slow, organic human, it was far more accurate. And many times, completely accurate with no defects. Simply because many developers pay close attention when they are forced to do the manual labour themselves.
Sure the refactoring commit is produced faster with LLM assistance, but repeatedly reviewing code and pointing out weird defects is very stressful.
> I have already had to correct a LOT of crap similar to the above in refactoring-done-via-LLM over the last year
The person using the LLM should be reviewing their code before submitting it to you for review. If you can catch a copy paste error like this, then so should they.
The failure you're describing is that your coworkers are not doing their job.
And if you accept "the LLM did that, not me" as an excuse then the failure is on you and it will keep happening.
A meticulous coder probably wouldn't have typed out 40 URLs just because they want to move them from one file to another. They would copy-past them and run some sed-like commands. You could instruct an LLM agent to do something similar. For modifying a lot of files or a lot of lines, I instruct them to write a script that does what I need instead of telling them to do it themselves.
I think it goes without saying that we need to be sceptical when to use and not use LLM. The point I'm trying to make is more that we should have more validations and not that we should be less sceptical about LLMs.
Meticulousness shouldn't be an excuse to not have layers of validation that doesn't have to cost that much if done well.
Your point to not rely on good intentions and have systems in place to ensure quality is good - but your comparison to humans didn't go well with me.
Very few humans fill in their task with made up crap then lie about it - I haven't met any in person. And if I did, I wouldn't want to work with them, even if they work 24/7.
Obligatory disclaimer for future employers: I believe in AI, I use it, yada yada. The reason I'm commenting here is I don't believe we should normalise this standard of quality for production work.
I agree, these kinds of stories should encourage us to setup more robust testing/backup/check strategies. Like you would absolutely have to do if you suddenly invited a bunch of inexperienced interns to edit your production code.
Agreed with the points in that article, but IMHO the no 1 issue is that agents only see a fraction of the code repository. They don't know whether there is a helper function they could use, so they re-implement it. When contributing to UIs, they can't check the whole UI to identify common design patterns, so they re-invent it.
The most important task for the human using the agent is to provide the right context. "Look at this file for helper functions", "do it like that implementation", "read this doc to understand how to do it"... you can get very far with agents when you provide them with the right context.
(BTW another issue is that they have problems navigating the directory structure in a large mono repo. When the agents needs to run commands like 'npm test' in a sub-directory, they almost never get it right the first time)
This is what I keep running into. Earlier this week I did a code review of about new lines of code, written using Cursor, to implement a feature from scratch, and I'd say maybe 200 of those lines were really necessary.
But, y'know what? I approved it. Because hunting down the existing functions it should have used in our utility library would have taken me all day. 5 years ago I would have taken the time because a PR like that would have been submitted by a new team member who didn't know the codebase well, and helping to onboard new team members is an important part of the job. But when it's a staff engineer using Cursor to fill our codebase with bloat because that's how management decided we should work, there's no point. The LLM won't learn anything and will just do the same thing over again next week, and the staff engineer already knows better but is being paid to pretend they don't.
>>because that's how management decided we should work, there's no point
If you are personally invested, there would be a point. At least if you plan to maintain that code for a few more years.
Let's say you have a common CSS file, where you define .warning {color: red}. If you want the LLM to put out a warning and you just tell it to make it red, without pointing out that there is the .warning class, it will likely create a new CSS def for that element (or even inline it - the latest Claude Code has a tendency to do that). That's fine and will make management happy for now.
But if later management decides that it wants all warning messages to be pink, it may be quite a challenge to catch every place without missing one.
There really wouldn't be; it would just be spitting into the wind. What am I going to do, convince every member of my team to ignore a direct instruction from the people who sign our paychecks?
I really really hate code review now. My colleagues will have their LLMs generate thousands of lines of boiler plate with every pattern and abstraction under the sun. A lazy programmer use to do the bare minimum and write not enough code. That made review easy. Error handling here, duplicate code there, descriptive naming here, and so on. Now a lazy programmer generates a crap load of code cribbed from "best practice" tutorials, much of it unnecessary and irrelevant for the actual task at hand.
> When the agents needs to run commands like 'npm test' in a sub-directory, they almost never get it right the first time)
I was running into this constantly on one project with a repo split between a Vite/React front end and .NET backend (with well documented structure). It would sometimes go into panic mode after some npm command didn’t work repeatedly and do all sorts of pointless troubleshooting over and over, sometimes veering into destructive attempts to rebuild whatever it thought was missing/broken.
I kept trying to rewrite the section in CLAUDE.md to effectively instruct it to always first check the current directory to verify it was in the correct $CLIENT or $SERVER directory. But it would still sometimes forget randomly which was aggravating.
I ended up creating some aliases like “run-dev server restart” “run-dev client npm install” for common operations on both server/client that worked in any directory. Then added the base dotnet/npm/etc commands to the deny list which forced its thinking to go “Hmm it looks like I’m not allowed to run npm, so I’ll review the project instructions. I see, I can use the ‘run-dev’ helper to do $NPM_COMMAND…”
It’s been working pretty reliably now but definitely wasted a lot of time with a lot of aggravation getting to that solution.
Large context models don't do a great job of consistently attending to the entire context, so it might not work out as well in practice as continuing to improve the context engineering parts of coding agents would.
I'd bet that most the improvement in Copilot style tools over the past year is coming from rapid progress in context engineering techniques, and the contribution of LLMs is more modest. LLMs' native ability to independently "reason" about a large slushpile of tokens just hasn't improved enough over that same time period to account for how much better the LLM coding tools have become. It's hard to see or confirm that, though, because the only direct comparison you can make is changing your LLM selection in the current version of the tool. Plugging GPT5 into the original version of Copilot from 2021 isn't an experiment most of us are able to try.
Claude can use use tools to do that, and some different code indexer MCPs work, but that depends on the LLM doing the coding to make the right searches to find the code. If you are in a project where your helper functions or shared libs are scattered everywhere it’s a lot harder.
Just like with humans it definitely works better if you follow good naming conventions and file patterns. And even then I tend to make sure to just include the important files in the context or clue the LLM in during the prompt.
It also depends on what language you use. A LOT. During the day I use LLMs with dotnet and it’s pretty rough compared to when I’m using rails on my side projects. Dotnet requires a lot more prompting and hand holding, both due to its complexity but also due to how much more verbose it is.
Well, sure, but from what I know, humans are way better at following 'implicit' instructions than LLMs. A human programmer can 'infer' most of the important basic rules from looking at the existing code, whereas all this agents.md/claude.md/whatever stuff seems necessary to even get basic performance in this regard.
Also, the agents.md website seems to mostly list README.md-style 'how do I run this instructions' in its example, not stylistic guidelines.
Furthermore, it would be nice if these agents add it themselves. With a human, you tell them "this is wrong, do it that way" and they would remember it. (Although this functionality seems to be worked on?)
From the article:
> I contest the idea that LLMs are replacing human devs...
AI is not able to replace good devs. I am assuming that nobody sane is claiming such a thing today. But, it can probably replace bad and mediocre devs. Even today.
In my org we had 3 devs who went through a 6-month code boot camp and got hired a few years ago when it was very difficult to find good devs. They struggled. I would give them easy tasks and then clean up their PRs during review. And then AI tools got much better and it started outperforming these guys. We had to let two go. And third one quit on his own.
We still hire devs. But have become very reluctant to hire junior devs. And will never hire someone from a code boot camp. And we are not the only ones. I think most boot camps have gone out of business for this reason.
Will AI tools eventually get good enough to start replacing good devs? I don't know. But the data so far shows that these tools keep getting better over time. Anybody who argues otherwise has their heads firmly stuck in sand.
In the early US history approximately 90% of the population was involved in farming. Over the years things changed. Now about 2% has anything to do with farming. Fewer people are farming now. But we have a lot more food and a larger variety available. Technology made that possible.
It is totally possible that something like that could happen to the software development industry as well. How fast it happens totally depends on how fast do the tools improve.
A computer science degree in most US colleges takes about 4 years of work. Boot camps try to cram that into 6 months. All the while many students have other full-time jobs. This is simply not enough training for the students to start solving complex real world problem. Even 4 years is not enough.
Many companies were willing to hire fresh college grads in the hopes that they could solve relatively easy problems for a few years, gain experience and become successful senior devs at some point.
However, with the advent of AI dev tools, we are seeing very clear signs that junior dev hiring rates have fallen off a cliff. Our project manager, who has no dev experience, frequently assigns easy tasks/github issues to Github Copilot. Copilot generates a PR in a few minutes that other devs can review before merging. These PRs are far superior to what an average graduate of a code boot camp could ever create. Any need we had for a junior dev has completely disappeared.
That's the question that has been stuck in my head as I read all these stories about junior dev jobs disappearing. I'm firmly mid-level, having started my career just before LLM coding took off. Sometimes it feels like I got on the last chopper out of Saigon.
My experience with them is that the are taught to cover as much syntax and libraries as possible, without spending time learning how solve problems and develop their own algorithms. They (in general) expect to follow predefined recipes.
On a more important level, I found that they still do really badly at even a minorly complex task without extreme babysitting.
I wanted it to refactor a parser in a small project (2.5K lines total) because it'd gotten a bit too interconnected. It made a plan, which looked reasonable, so I told it to do this in stages, with checkpoints.
It said it'd done so. I asked it "so is the old architecture also removed?" "No, it has not been removed." "Is the new structured used in place of the old one?" "No, it has not."
After it did so, 80% of the test suite failed because nothing it'd written was actually right.
Did so three times with increasingly more babysitting, but it failed at the abstract task of "refactor this" no matter what with pretty much the same failure mode. I feel like I have to tell it exactly to make changes X and Y to class Z, remove class A etc etc, at which point I can't let it do stuff unsupervised, which is half of the reason for letting an LLM do this in the first place.
> I wanted it to refactor a parser in a small project
This expression tree parser (typescript to sql query builder - https://tinqerjs.org/) has zero lines of hand-written code. It was made with Codex + Claude over two weeks (part-time on the side). Having worked on ORMs previously, it would have taken me 4x-10x the time to get to the same state (which also has 100s of tests, with some repetitions). That's a massive saving in time.
I did not have to baby sit the LLMs at all. So the answer is, I think it depends on what you use it for, and how you use it. Like every tool, it takes a really long time to find a process that works for you. In my conversations with other developers who use LLMs extensively, they all have their unique, custom workflows. All of them however do focus on test suites, documentation, and method review processes.
Question - this loads a 2 MB JS parser written in Rust to turn `x => x.foo` into `{ op: 'project', field: 'foo', target: 'x' }`. But you don't actually allow any complex expressions (and you certainly don't seem to recursively parse references or allow return uplift, e. g. I can't extract out `isOver18` or `isOver(age: int)(Row: IQueryable): IQueryable`). Why did you choose the AST route instead of doing the same thing with a handful of regular expressions?
Parsing code with regex is a minefield. You can get it to work with simpler cases, but even that might get complex very quickly with all sorts of formatting preferences that people have. In fact, I'll be very surprised if it can be done with a few regular expressions; so I never gave it much consideration. Additionally, improved subquery support etc is coming, involving deeper recursion.
I could have allowed (I did consider it) functions external to the expression, like isOver18 in your example. But it would have come at the cost of the parser having to look across the code base, and would have required tinqerjs to attach via build-time plugins. The only other way (without plugins) might be to identify callers via Error.stack, and attempting to find the calling JS file.
Development tools and libraries seem like they may be one of the absolute easiest use cases to get LLMs to work with since they generally have far less ambiguous requirements than other software and the LLMs generally have an enormous amount of data in their training set to help them understand the domain.
I have tried several. Overall I've now set on strict TDD (which it still seems to not do unless I explicitly tell it to even though I have it as a hard requirement in claude.md).
Claude forgets claude.md after a while, so you need to keep reminding. I find that codex does a design job better than Claude at the moment, but it's 3x slower which I don't mind.
Hum yeah, it shows. Just the fact that the API looks completely different for Postgre and SQLite tells us everything we need to know about the quality of the project here.
> Just the fact that the API looks completely different for Postgre and SQLite tells us everything we need to know about the quality of the project here.
How does the API look completely different for pg and sqlite? Can you share an example?
It's an implementation of LINQ's IQueryable. With some bells missing in DotNet's Queryable, like Window functions (RANK queries etc) which I find quite useful.
Add: What you've mentioned is largely incorrect. But in any case, it is a query builder. Meaning, an ORM like database abstraction is not the goal. This allows us to support pg's extensions, which aren't applicable to other database.
I guess the interesting question is whether @jeswin could have created this project at all if AI tools were not involved. And if yes, would the quality even be better?
There are two examples on the landing page, and they both look quite different. Surely if the API is the same for both, there'd be just one example that covers both cases, or two examples would be deliberately made as identical as possible? (Like, just a different new somewhere, or different import directive at the top, and everything else exactly the same?) I think that's the point.
Perhaps experienced users of relevant technologies will just be able to automatically figure this stuff out, but this is a general discussion - people not terribly familiar with any of them, but curious about what a big pile of AI code might actually look like, could get the wrong impression.
If you're mentioning the first two examples, they're doing different things. The pg example does an orderby, and the sqlite example does a join. You'll be able to switch the client (ie, better-sqlite and pg-promise) in either statement, and the same query would work on the other database.
Maybe I should use the same example repeated for clarity. Let me do that.
>I feel like I have to tell it exactly to make changes X and Y to class Z, remove class A etc etc, at which point I can't let it do stuff unsupervised, which is half of the reason for letting an LLM do this in the first place.
The reason better turn to "It can do stuff faster than I ever could if I give it step by step high level instructions" instead.
That would be a solution, yes. But currently it feels extremely borked from a UX perspective. It purports to be able to do this, but when you tell it to it breaks in unintuitive ways.
I hate this idea of "well you just need to understand all the arcane ways in which to properly use it to its proper effects".
It's like a car which has a gear shifter, but that's not fully functional yet, so instead you switch gear by spelling out in morse code the gear you want to go into using L as short and R as long. Furthermore, you shouldn't try to listen to 105-112 on the FM band on the radio, because those frequencies are used to control the brakes and ABS and if you listen to those frequencies the brakes no longer work.
We would rightfully stone any engineer who'd design this and then say "well obvious user error" when the user rightfully complains that they crash whenever they listen to Arrow FM.
>But currently it feels extremely borked from a UX perspective. It purports to be able to do this, but when you tell it to it breaks in unintuitive ways.
Thankfully as programmers we know better and don't need to care what the UI pretends to be able to do :)
>We would rightfully stone any engineer who'd design this and then say "well obvious user error" when the user rightfully complains that they crash whenever they listen to Arrow FM.
We might curse the company and engineer who did it, but we would still use that car and do those workarounds, if doing so allowed us to get to our destination in 1/10 the regular time...
> >But currently it feels extremely borked from a UX perspective. It purports to be able to do this, but when you tell it to it breaks in unintuitive ways.
> Thankfully as programmers we know better and don't need to care what the UI pretends to be able to do :)
But we do though. You can't just say "yeah they left all the foot guns in but we ought to know not to use them", especially not when the industry shills tell you those footguns are actually rocket boosters to get you to the fucking moon and back.
Might be related with what the article was talking. AI can't cut-paste. It deletes the code and then regenerates it at another location instead of cut-paste.
Obviously generated code drift a little from deleted ones.
This feels like a classic Sonnet issue. From my experience, Opus or GPT-5-high are less likely to do the "narrow instruction following without making sensible wider decisions based on context" than Sonnet.
Yes and no, it's a fair criticism to some extent. Inasamuch as I would agree that different models of the same type have superficial differences.
However, I also think that models which focus on higher reasoning effort in general are better at taking into account the wider context and not missing obvious implications from instructions. Non-reasoning or low-reasoning models serve a purpose, but to suggest they are akin to different flavours misses what is actually quite an important distinction.
I was hoping that LLMs being able to access strict tools, like Gemini using Python libraries, would finally give reliable results.
So today I asked Gemini to simplify a mathematical expression with sympy. It did and explained to me how some part of the expression could be simplified wonderfully as a product of two factors.
But it was all a lie. Even though I explicitly asked it to use sympy in order to avoid such hallucinations and get results that are actually correct, it used its own flawed reasoning on top and again gave me a completely wrong result.
You still cannot trust LLMs. And that is a problem.
The obvious point has to be made: Generating formal proofs might be a partial fix for this. By contrast, coding is too informal for this to be as effective for it.
>Sure, you can overengineer your prompt to try get them to ask more questions
That's not overengineering, that's engineering. "Ask clarifying questions before you start working", in my experience, has led to some fantastic questions, and is a useful tool even if you were to not have the AI tooling write any code. As a good programmer, you should know when you are handing the tool a complete spec to build the code and when the spec likely needs some clarification, so you can guide the tool to ask when necessary.
You can even tell it how many questions to ask. For complex topics, I might ask it to ask me 20 or 30 questions. And I'm always surprised how good those are. You can also keep those around as a QnA file for later sessions or other agents.
Yeah, this made me stop reading. I often tell it to ask me any questions if unclear (and sometimes my prompt is just "Hey, this is my idea. Ask me questions to flesh it out").
It always asks me questions, and I've always benefited from it. It will subtly point out things I hadn't thought about, etc.
I think LLMs provide value, used it this morning to fix a bug in my PDF Metadata parser without having to get too deep into the PDF spec.
But most of the time, I find that the outputs are nowhere near the effect of just doing it myself. I tried Codex Code the other day to write some unit tests. I had a few setup and wanted to use it (because mocking the data is a pain).
It took about 8 attempts, I had to manually fix code, it couldn't understand that some entities were obsolete (despite being marked and the original service not using them). Overall, was extremely disappointed.
I still don't think LLMs are capable of replacing developers, but they are great at exposing knowledge in fields you might not know and help guide you to a solution, like Stack Overflow used to do (without the snark).
I think LLMs have what it takes at this point in time, but it's the coding agent (combined with the model) that make the magic happen. Coding agents can implement copy-pasting, it's a matter of building the right tool for it, then iterating with given models/providers, etc. And that's true for everything else that LLMs lack today. Shortcomings can be remediated with good memory and context engineering, safety-oriented instructions, endless verification and good overall coding agent architecture. Also having a model that can respond fast, have a large context window and maintain attention to instructions is also essential for a good overall experience.
And the human prompting, of course. It takes good sw engineering skills, particularly knowing how to instruct other devs in getting the work done, setting up good AGENTS.md (CLAUDE.md, etc) with codebase instructions, best practices, etc etc.
So it's not an "AI/LLMs are capable of replacing developers"... that's getting old fast. It's more like, paraphrasing the wise "it's not what your LLM can do for you, but what can you do for your LLM"
Most developers are also bad at asking questions. They tend to assume too many things from the start.
In my 25 years of software development I could apply the second critique to over half of the developers I knew. That includes myself for about half of that career.
But, just like lots of people expect/want self-driving to outperform humans even on edge cases in order to trust them, they also want "AI" to outperform humans in order to trust it.
So: "humans are bad at this too" doesn't have much weight (for people with that mindset).
If we had a knife that most of the time cuts a slice of bread like the bottom p50 of humans cutting a slice of bread with their hands, we wouldn't call the knife useful.
Ok, this example is probably too extreme, replace the knife with an industrial machine that cut bread vs a human with a knife. Nobody would buy that machine either if it worked like that.
I think this is still too extreme. A machine that cuts and preps food at the same level as a 25th percentile person _being paid to do so_, while also being significantly cheaper would presumably be highly relevant.
Your p25 employee is probably much closer to your p95 employee than to the p50 "standard" human, so yeah, I think you have a point there.
But at least in food prep, p25 would already be pretty damn hard to achieve. That's a hell of a lot of autonomy and accuracy (at least in my restaurant kitchen experience which is admittedly just one year in "fine dining"-ish kitchens).
I'd say the p25 of software or SRE folks I've worked with is also a pretty high bar to hit, too, but maybe I've been lucky.
Agreed in a general sense, but there's a bit more nuance.
If a knife slices bread like a normal human at p50, it's not a very good knife.
If a knife slices bread like a professional chef at p50, it's probably a very decent knife.
I don't know if LLMs are better at asking questions than a p50 developer. In my original comment I wanted to raise the question of whether the fact that LLMs are not good at asking questions makes them still worse than human devs.
The first LLM critique in the original article is that they can't copy and paste. I can't argue with that. My 12 year old copies-and-pastes better than top coding agents.
The second critique says they can't ask questions. Since many developers also are not good at this, how does the current state of the art LLM compare to a p50 developer in this regard?
> LLMs don’t copy-paste (or cut and paste) code. For instance, when you ask them to refactor a big file into smaller ones, they’ll "remember" a block or slice of code, use a delete tool on the old file, and then a write tool to spit out the extracted code from memory. There are no real cut or paste tools. Every tweak is just them emitting write commands from memory. This feels weird because, as humans, we lean on copy-paste all the time.
There is not that much copy/paste that happens as part of refactoring so it leans to just using context recall. It's not entirely clear if providing an actual copy/paste command is particularly useful, at least from my testing it does not do much. More interesting are repetitive changes that clog up the context. Those you can improve on if you have `fastmod` or some similar tool available: with it you can instruct codex or claude to perform edits with it.
> And it’s not just how they handle code movement -- their whole approach to problem-solving feels alien too.
It is, but if you go back and forth to work out a plan for how to solve the problem, then the approach greatly changes.
To use another example, with my IDE I can change a signature or rename something across multiple files basically instantly. But an LLM agent will take multiple minutes to do the same thing and doesn't get it right.
> How is it not clear that it would be beneficial?
There is reinforcement learning on the Anthropic side for a text edit tool, which is built in a way that does not lend itself to copy/paste. If you use a model like the GPT series then there might not be reinforcement learning for text editing (I believe, I don't really know), but it operates on line-based replacements for the most part and for it to understand what to manipulate it needs to know the content in the context. When you try to give it a copy/paste buffer it does not fully comprehend what the change in the file looks like after the operation.
So it might be possible to do something with copy/paste, but I did not find it to be very obvious how you make that work with an agent, given that it needs to read the file into context anyways and its recall capabilities are surprisingly good.
> To use another example, with my IDE I can change a signature or rename something across multiple files basically instantly.
So yeah, that's the more interesting case and there things like codemod/fastmod are very effective if you tell an agent to use it. They just don't reach there.
I think copy/paste can alleviate context explosion. Basically the model can remember what's the code block contain, can access it at any time, without needing to "remember" it.
Inspired by the copy-paste point in this post, I added agent buffer tools to clippy, a macOS utility I maintain which includes an MCP server that interacts with the system clipboard. In this case it was more appropriate to use a private buffer instead. With the tools I just added, the server reads file bytes directly - your agent never generates the copied content as tokens. Three operations:
buffer_copy: Copy specific line ranges from files to agent's private buffer
buffer_paste: Insert/append/replace those exact bytes in target files
buffer_list: See what's currently buffered
So the agent can say "copying lines 50-75 from auth.py" and the MCP server handles the actual file I/O. No token generation, no hallucination, byte-for-byte accurate. Doesn't touch your system clipboard either.
The MCP server already included tools to copy AI-generated content to your system clipboard - useful for "write a Python script and copy it" workflows.
(Clippy's main / original purpose is improving on macOS pbcopy - it copies file references instead of just file contents, so you can paste actual files into Slack/email/etc from the terminal.)
Codex has got me a few times lately, doing what I asked but certainly not what I intended:
- Get rid of these warnings "...": captures and silences warnings instead of fixing them
- Update this unit test to relfect the changes "...": changes the code so the outdated test works
- The argument passed is now wrong: catches the exception instead of fixing the argument
My advice is to prefer small changes and read everything it does before accepting anything, often this means using the agent actually is slower than just coding...
Retrospectively fixing a test to be passing given the current code is a complex task, instead, you can ask it to write a test that tests the intended behaviour, without needing to infer it.
“The argument passed is now wrong” - you’re asking the LLM to infer that there’s a problem somewhere else, and to find and fix it.
When you’re asking an LLM to do something, you have to be very explicit about what you want it to do.
Exactly, I think the takeaway is that being careful when formulating a task is essential with LLMs. They make errors that wouldn’t be expected when asking the same from a person.
I see a pattern in these discussions all the time: some people say how very, very good LLMs are, and others say how LLMs fail miserably; almost always the first group presents examples of simple CRUD apps, frontend "represent data using some JS-framework" kind of tasks, while the second group presents examples of non-trivial refactoring, stuff like parsers (in this thread), algorithms that can't be found in leetcode, etc.
Tech twitter keeps showing "one-shotting full-stack apps" or "games", and it's always something extremely banal. It's impressive that a computer can do it on its own, don't get me wrong, but it was trivial to programmers, and now it is commoditized.
Yesterday, I got Claude Code to make a script that tried out different point clustering algorithms and visualise them. It made the odd mistake, which it then corrected with help, but broadly speaking it was amazing. It would've taken me at least a week to write by hsnd, maybe longer. It was writing the algorithms itself, definitely not just simple CRUD stuff.
I also got good results for “above CRUD” stuff occasionally. Sorry if I wasn’t clear, I meant to primarily share an observation about vastly different responses in discussions related to LLMs. I don’t believe LLMs are completely useless for non-trivial stuff, nor I believe that they won’t get better. Even those two problems in the linked article: sure, those actions are inherently alien to the LLM’s structure itself, but can be solved with augmentation.
That's actually a very specific domain, which is well documented and researched in which LLM's will alawys do well. Shit will hit the fans quickly when you're going to do integration where it won't have a specific problem domain.
Yep - visualizing clustering algorithms is just the "CRUD app" of a different speciality.
One rule of thumb I use, is if you could expect to find a student on a college campus to do a task for you, an LLM will probably be able to do a decent job. My thinking is because we have a lot of teaching resources available for how to do that task, which the training has of course ingested.
In my experience it's been great to have LLMs for narrowly-scoped tasks, things I know how I'd implement (or at least start implementing) but that would be tedious to manually do, prompting it with increasingly higher complexity does work better than I expected for these narrow tasks.
Whenever I've attempted to actually do the whole "agentic coding" by giving it a complex task, breaking it down in sub-tasks, loading up context, reworking the plan file when something goes awry, trying again, etc. it hasn't a single fucking time done the thing it was supposed to do to completion, requiring a lot of manual reviewing, backtracking, nudging, it becomes more exhausting than just doing most of the work myself, and pushing the LLM to do the tedious work.
It does work sometimes to use for analysis, and asking it to suggest changes with the reasoning but not implement them, since most times when I let it try to implement its broad suggestions it went haywire, requiring me to pull back, and restart.
There's a fine line to walk, and I only see comments on the extremes online, it's either "I let 80 agents running and they build my whole company's code" or "they fail miserably on every task harder than a CRUD". I tend to not believe in either extreme, at least not for the kinds of projects I work on which require more context than I could ever fit properly beforehand to these robots.
The two groups are very different but I notice another pattern: you have people who like coding and understanding details of what their are doing, are curious, what to learn about the why and think about edge cases; and there's another group of people who just want to code something, make a test pass, show a nice UI and that's it, but don't think much about edge cases or maintainability. The only thing they think is "delivering value" to customers.
Usually those two groups correlate very well with liking LLMs: some people will ask Claude to create a UI with React and see the mess it generated (even if it mostly works) and the edge cases it left out and comment in forums that LLMs don't work. The other group of people will see the UI working and call it a day without even noticing the subtleties.
I use LLMs to vibe-code entire tools that I need for my work. They're really banal boring apps that are relatively simple, but they still would have wasted a day or two each to write and debug. Even stuff as simple as laying out the whole UI in a nice pattern. Most of these are now practically one-shots from the latest Claude and GPT. I leave them churning, get coffee, come back and test the finished product.
The function of technological progress, looked at through one lens, is to commoditise what was previously bespoke. LLMs have expanded the set of repeatable things. What we're seeing is people on the one hand saying "there's huge value in reducing the cost of producing rote assets", and on the other "there is no value in trying to apply these tools to tasks that aren't repeatable".
It might be a meme project, but it's still impressive as hell we're here.
I learned about this from a yt content creator that took that repo, asked cc to "make it so that variables can be emojis", and cc did that 5$ later. Pretty cool.
There’s no evidence that this ever happened other than this guy’s word. And since the claim that he ran an agent with no human intervention for 3 months is so far outside of any capabilities demonstrated by anyone else, I’m going to need to see some serious evidence before I believe it.
> There’s no evidence that this ever happened other than this guy’s word.
There's a yt channel where the sessions were livestreamed. It's in their FAQ. I haven't felt the need to check them, but there are 10-12h sessions in there if you're that invested in proving that this is "so far outside of any capabilities"...
A brief look at the commit history should show you that it's 99.9% guaranteed to be written by an LLM :)
When's the last time you used one of these SotA coding agents? They've been getting better and better for a while now. I am not surprised at all that this worked.
>When's the last time you used one of these SotA coding agents?
This morning :)
>"so far outside of any capabilities"
Anthropic was just bragging last week about being able to code without intervention for 30 hours before completely losing focus. They hailed it as a new bench mark. It completed a project that was 11k lines of code.
The max unsupervised run that GPT-5-Codex has been able to pull off is 7 hours.
That's what I mean by the current SOTA demonstrated capabilities.
And yet here you have a rando who is saying that he was able able to get an agent to run unsupervised for 100x longer than what the model companies themselves have been able to do and produce 10x the amount of code--months ago.
I'm 100% confident this is fake.
>There's a yt channel where the sessions were livestreamed.
There are a few videos that long, not 3 months worth of videos. Also I spot checked the videos and it the framerate is so low that it would be trivial to cut out the human intervention.
>guaranteed to be written by an LLM
I don't doubt that it was 99.9% written by an LLM, the question is whether he was able to run unsupervised for 3 months or whether he spent 3 months guiding an LLM to write it.
I think you are confusing 2 things here. What the labs mean when they announce x hours sessions is on "one session" (i.e. the agent manages its own context via trimming and memory files, etc). What the project I linked did was "run in a bash loop", that basically resets the context every time the agent "finishes".
That would mean that every few hours the agent starts fresh, does the inspect repo thing, does the plan for that session, and so on. That would explain why it took it ~3 months to do what a human + ai could probably do in a few weeks. That's why it doesn't sound too ludicrous for me. If you look at the repo there are a lot of things that are not strictly needed for the initial prompt (make a programming language like go but with genz stuff, nocap).
Oh, and if you look at their discord + repo, lots of things don't actually work. Some examples do, some segfault. That's exactly what you'd expect from "running an agent in a loop". I still think it's impressive nonetheless.
The fact that you are so incredulous (and I get why that is, scepticism is warranted in this space) is actually funny. We are on the right track.
There’s absolutely no difference from what he says he did and what Claude code can do behind the scenes.
If Anthropic thought they could produce anything remotely useful by wiping the context and reprompting every few hours, they would be doing it. And they’d be saying “look at this we implemented hard context reset and we can now run our agent for 30 days and produce an entire language implementation!”
In 3 months or 300 years of operating like this a current agent being freshly reprompted every few hours would never produce anything that even remotely looked like a language implementation.
As soon as its context was poisoned with slightly off topic todo comments it would spin out into writing a game of life implementation or whatever. You’d have millions of lines of nonsense code with nothing useful after 3 months of that.
The only way I see anything like this doing anything approaching “useful” is if the outer loop wipes the repo on every reset as well, and collects the results somewhere the agent can’t access. Then you essentially have 100 chances to one shot the thing.
But at that point you just have a needlessly expensive and slow agent.
Novel as in "an LLM can maintain coherence on a 100k+ LoC project written in zig"? Yeah, that's absolutely novel in this space. This wasn't possible 1 year ago. And this was fantasy 2.5 years ago when chatgpt launched.
Also impressive in that cc "drove" this from a simple prompt. Also impressive that cc can do stuff in this 1M+ (lots of js in the extensions folders?) repo. Lots of people claim LLMs are useless in high LoC repos. The fact that cc could navigate a "new" language and make "variables as emojis" work is again novel (i.e. couldn't be done 1 year ago) and impressive.
I don’t buy it at all. Not even Anthropic or Open AI have come anywhere close to something like this.
Running for 3 months and generating a working project this large with no human intervention is so far outside of the capabilities of any agent/LLM system demonstrated by anyone else that the mostly likely explanation is that the promoter is lying about it running on its own for 3 months.
I looked through the videos listed as “facts” to support the claims and I don’t see anything longer than a few hours.
I think the issue with them making assumptions and failing to properly diagnose issues comes more from fine-tuning than any particular limitation in LLMs themselves. When fine tuned on a set of problem->solution data it kind of carries the assumption that the problem contains enough data for the solution.
What is really needed is a tree of problems which appear identical at first glance, but the issue and the solution is something that is one of many possibilities which can only be revealed by finding what information is lacking, acquiring that information, testing the hypothesis then, if the hypothesis is shown to be correct, then finally implementing the solution.
That's a much more difficult training set to construct.
The editing issue, I feel needs something more radical. Instead of the current methods of text manipulation, I think there is scope to have a kind of output position encoding for a model to emit data in a non-sequential order. Again this presents another training data problem, there are limited natural sources to work from showing programming in the order a programmer types it. On the other hand I think it should be possible to do synthetic training examples by translating existing model outputs that emit patches, search/replaces, regex mods etc. and translate those to a format that directly encodes the final position of the desired text.
At some stage I'd like to see if it's possible to construct the models current idea of what the code is purely by scanning a list of cached head_embeddings of any tokens that turned into code. I feel like there should be enough information given the order of emission and the embeddings themselves to reconstruct a piecemeal generated program.
The copy-paste thing is interesting because it hints at a deeper issue: LLMs don't have a concept of "identity" for code blocks—they just regenerate from learned patterns. I've noticed similar vibes when agents refactor—they'll confidently rewrite a chunk and introduce subtle bugs (formatting, whitespace, comments) that copy-paste would've preserved. The "no questions" problem feels more solvable with better prompting/tooling though, like explicitly rewarding clarification in RLHF.
I feel like it’s the opposite: the copy-paste issue is solvable, you just need to equip the model with the right tools and make sure they are trained on tasks where that’s unambiguously the right thing to do (for example, cases were copying code “by hand” would be extremely error prone -> leads to lower reward on average).
On the other hand, teaching the model to be unsure and ask questions, requires the training loop to break and bring a human input in, which appears more difficult to scale.
> On the other hand, teaching the model to be unsure and ask questions, requires the training loop to break and bring a human input in, which appears more difficult to scale.
The ironic thing to me is that the one thing they never seem to be willing to skip asking about is whether they should proceed with some fix that I just helped them identify. They seem extremely reluctant to actually ask about things they don't know about, but extremely eager to ask about whether they should do the things they already have decided they think are right!
I feel like the copy and paste thing is overdue a solution.
I find this one particularly frustrating when working directly with ChatGPT and Claude via their chat interfaces. I frequently find myself watching them retype 100+ lines of code that I pasted in just to make a one line change.
I expect there are reasons this is difficult, but difficult problems usually end up solved in the end.
> I frequently find myself watching them retype 100+ lines of code that I pasted in just to make a one line change.
In such cases, I specifically instruct LLMs to "only show the lines you would change" and they are very good at doing just that and eliding the rest. However, I usually do this after going through a couple of rounds of what you just described :-)
I partly do this to save time and partly to avoid using up more tokens. But I wonder if it is actually saving tokens given that hidden "thinking tokens" are a thing these days. That is, even if they do elide the unchanged code, I'm pretty sure they are "reasoning" about it before identifying only the relevant tokens to spit out.
As such, that does seem different from copy-and-paste tool use, which I believe is also solved. LLMs can already identify when code changes can be made programmatically... and then do so! I have actually seen ChatGPT write Python code to refactor other Python code: https://www.linkedin.com/posts/kunalkandekar_metaprogramming...
I had to fix a minor bug in its Python script to make it work, but it worked and was a bit of a <head-explode> moment for me. I still wonder if this is part of its system prompt or an emergent tool-use behavior. In either case, copy-and-paste seems like a much simpler problem that could be solved with specific prompting.
Yeah, I’ve always wondered if the models could be trained to output special reference tokens that just copy verbatim slices from the input, perhaps based on unique prefix/suffix pairs. Would be a dramatic improvement for all kinds of tasks (coding especially).
Whats the time horizon for said problems to be solved? Because guess what - time is running and people will not continue to aimlessly provide money at this stuff.
I don't see this one as an existential crisis for AI tooling, more of a persistent irritation.
AI labs already shipped changes related to this problem - most notable speculative decoding, which lets you provide the text you expect to see come out again and speeds it up: https://simonwillison.net/2024/Nov/4/predicted-outputs/
They've also been iterating on better tools for editing code a lot as part of the competition between Claude Code and Codex CLI and other coding agents.
Hopefully they'll figure out a copy/paste mechanism as part of that work.
About the first point mentioned in article: could that problem be solved simply by changing the task from something like "refactor this code" to something like "refactor this code as a series of smaller atomic changes (like moving blocks of code or renaming variable references in all places), disable suitable for git commits (and provide git message texts for those commits)"?
I recently found a fun CLI application and was playing with it when I found out it didn't have proper handling for when you passed it invalid files, and spat out a cryptic error from an internal library which isn't a great UX.
I decided to pull the source code and fix this myself. It's written in Swift which I've used very little before, but this wasn't gonna be too complex of a change. So I got some LLMs to walk me through the process of building CLI apps in Xcode, code changes that need to be made, and where the build artifact is put in my filesystem so I could try it out.
I was able to get it to compile, navigate to my compiled binary, and run it, only to find my changes didn't seem to work. I tried everything, asking different LLMs to see if they can fix the code, spit out the binary's metadata to confirm the creation date is being updated when I compile, etc. Generally when I'd paste the code to an LLM and ask why it doesn't work it would assert the old code was indeed flawed, and my change needed to be done in X manner instead. Even just putting a print statement, I couldn't get those to run and the LLM would explain that it's because of some complex multithreading runtime gotcha that it isn't getting to the print statements.
After way too much time trouble-shooting, skipping dinner and staying up 90 minutes past when I'm usually in bed, I finally solved it - when I was trying to run my build from the build output directory, I forgot to put the ./ before the binary name, so I was running my global install from the developer and not the binary in the directory I was in.
Sure, rookie mistake, but the thing that drives me crazy with an LLM is if you give it some code and ask why it doesn't work, they seem to NEVER suggest it should actually be working, and instead will always say the old code is bad and here's the perfect fixed version of the code. And it'll even make up stuff about why the old code should indeed not work when it should, like when I was putting the print statements.
Lol this person talks about easing into LLMs again two weeks after quitting cold turkey. The addiction is real. I laugh because I’m in the same situation, and see no way out other than to switch professions and/or take up programming as a hobby in which I purposefully subject myself to hard mode. I’m too productive with it in my profession to scale back and do things by hand — the cat is out of the bag and I’ve set a race pace at work that I can’t reasonably retract from without raising eyebrows. So I agree with the author’s referenced post that finding ways to still utilize it while maintaining a mental map of the code base and limiting its blast radius is a good middle ground, but damn it requires a lot of discipline.
In my defense, I wrote the blog post about quitting a good while after I've already quit cold turkey -- but you're spot on. :)
Especially when surrounded by people who swear LLMs can really be gamechanging on certain tasks, it's really hard to just keep doing things by hand (especially if you have the gut feeling that an LLM can probably do rote pretty well, based on past experience).
What kind of works for me now is what a colleague of mine calls "letting it write the leaf nodes in the code tree". So long as you take on the architecture, high level planning, schemas, and all the important bits that require thinking - chances are it can execute writing code successfully by following your idiot-proof blueprint. It's still a lot of toll and tedium, but perhaps still beats mechanical labor.
It started as a mix of self-imposed pressure and actually enjoying marking tasks as complete. Now I feel resistant to relaxing things. And no, I definitely don’t get paid more.
cat out of the bag is disautomation. the speed in the timetable is an illusion if the supervision requires blast radius retention. this is more like an early video game assembly line than a structured skilled industry
The first issue is related to the inner behavior of LLMs. Human can ignore some detailed contents of code and copy and paste, but LLM convert them into hidden states. It is a process of compression. And the output is a process of decompression. And something maybe lost. So it is hard for LLM to copy and paste. The agent developer should customize the edit rules to do this.
The second issue is that, LLM does not learn much high level context relationship of knowledge. This can be improved by introducing more patterns in the training data. And current LLM training is doing much on this. I don't think it is a problem in next years.
I sometimes give LLM random "easy" questions. My assessment is still that they all need the fine print "bla bla can be incorrect"
You should either already know the answer or have a way to verify the answer. If neither, the matter must be inconsequential like just a child like curiosity. For example, I wonder how many moons Jupiter has... It could be 58, it could be 85 but either answer won't alter any of what I do today.
I suspect some people (who need to read the full report) dump thousand page long reports into LLM, read the first ten words of the response and pretend they know what the report says and that is scary.
Fortunately, as devs, this is our main loop. Write code, test, debug. And it's why people who fear AI-generated code making it's way into production and causing errors makes me laugh. Are you not testing your code? Or even debugging it? Like, what process are you using that prevents bugs happening? Guess what? It's the exact same process with AI-generated code.
I fully resonate with point #2. A few days ago, I was stuck trying to implement some feature in a C++ library, so I used ChatGPT for brainstorming.
ChatGPT proposed a few ideas, all apparently reasonable, and then it advocated for one that was presented unambiguously as the "best". After a few iterations, I realized that its solution would have required a class hierarchy where the base class contained a templated virtual function, which is not allowed in C++. I pointed this out to ChatGPT and asked it to rethink the solution; it then immediately advocated for the other approach it had initially suggested.
The “LLMs are bad at asking questions” is interesting. There are some times that I will ask the LLM to do something without giving it All the needed information. And rather than telling me that something's missing or that it can't do it the way I asked, it will try and do a halfway job using fake data or mock something out to accomplish it. What I really wish it would do is just stop and say, “hey, I can't do it like you asked Did you mean this?”
I don't think it such a big deal that they aren't great yet, but rather the rate of improvement is quite low these days. I feel it is going backwards a little recently - maybe that is due to economic pressures.
The other day, I needed Claude Code to write some code for me. It involved messing with the TPM of a virtual machine. For that, it was supposed to create a directory called `tpm_dir`. It constantly got it wrong and wrote `tmp_dir` instead and tried to fix its mistake over and over again, leading to lots of weird loops. It completely went off the rails, it was bizarre.
With a statically typed language like C# or Java, there are dozens of refactors that IDEs could do in a guaranteed [1] correct way better than LLMs as far back as 2012.
The canonical products were from JetBrains. I haven’t used Jetbrains in years. But I would be really surprised with the combination of LLMs + a complete understanding of the codebase through static analysis (like it was doing well over a decade ago) and calling a “refactor tool” that it wouldn’t have better results.
[1] before I get “well actuallied” yes I know if you use reflection all bets are off.
I used a Borland Java IDE in the 1990s with auto refactoring like “extract method” and global renaming and such.
Dev tools were not bad at all back then. In a few ways they were better than today, like WYSIWYG GUI design which we have wholly abandoned. Old school Visual Basic was a crummy programming language but the GUI builder was better than anything I’m familiar with for a desktop OS today.
Has anyone had success getting a coding agent to use an IDE's built-in refactoring tools via MCP especially for things like project-wide rename? Last time I looked into this the agents I tried just did regex find/replace across the repo, which feels both error-prone and wasteful of tokens. I haven't revisited recently so I'm curious what's possible now.
That's interesting, and I haven't, but as long as the IDE has an API for the refactoring action, giving an agent access to it as a tool should be pretty straightforward. Great idea.
For #2, if you're working on a big feature, start with a markdown planning file that you and the LLM work on until you are satisfied with the approach. Doesn't need to be rocket science: even if it's just a couple paragraphs it's much better than doing it one shot.
Editing tools are easy to add it’s just you have to pick what things to give them because too many and they struggle as it uses up a lot of context. Still, as costs come down multiple steps to look for tools becomes cheaper too.
I’d like to see what happens with better refactoring tools, I’d make a bunch more mistakes copying and retyping or using awk. If they want to rename something they should be able to use the same tooling the rest of us get.
Asking questions is a good point but that’s both a bit of promoting and I think the move to having more parallel work makes it less relevant. One of the reasons clarifying things more upfront is useful is we take a lot of time and cost a lot of money to build things so the economics favours getting it right first time. As the time comes down and the cost drops to near zero, the balance changes.
There are also other approaches to clarify more what you want and how to do it first, breaking that down into tasks, then letting it run with those (spec kit). This is an interesting area.
I think #1 is not that big of a deal, though it does create problems sometimes.
#2 is though a big issue. Which is weird since the whole thing is built as a chat model it seems it would be a lot more efficient for the bot to ask the questions of what to build beyond it's assumptions. Generally this lack of back and forth reasoning leads to a lot of then badly generated code. I would hope in the future there is some level of graded response that tries to discern the real intent of the users request through a discussion, rather than going to the fastest code answer.
I'd argue LLM coding agents are still bad at many more things. But to comment on the two problems raised in the post:
> LLMs don’t copy-paste (or cut and paste) code.
The article is confusing the architectural layers of AI coding agents. It's easy to add "cut/copy/paste" tools to the AI system if that shows improvement. This has nothing to do with LLM, it's in the layer on top.
> Good human developers always pause to ask before making big changes or when they’re unsure [LLMs] keep trying to make it work until they hit a wall -- and then they just keep banging their head against it.
Agreed - LLMs don't know how to back track. The recent (past year) improvements in thinking/reasoning do improve in this regard (it's the whole "but wait..." RL training that exploded with OpenAI o1/o3 and DeepSeek R1, now done by everyone), but clearly there's still work to do.
> The article is confusing the architectural layers of AI coding agents. It's easy to add "cut/copy/paste" tools to the AI system if that shows improvement. This has nothing to do with LLM, it's in the layer on top.
I think we can't trivialize adding good cut/copy/paste tools though. It's not like we can just slap those tools on the topmost layer (ex, on Claude Code, Codex, or Roo) and it'll just work.
I think that a lot of reinforcement learning that LLM providers do on their coding models barely (if at all) steer towards that kind of tool use, so even if we implemented those tools on top of coding LLMs they probably would just splash and do nothing.
Adding cut/copy/paste probably requires a ton of very specific (and/or specialized) fine tuning with not a ton of data to train on -- think recordings of how humans use IDEs, keystrokes, commands issued, etc etc.
I'm guessing Cursor's Autocomplete model is the closest thing that can do something like this if they chose to, based on how they're training it.
> Sure, you can overengineer your prompt to try get them to ask more questions (Roo for example, does a decent job at this) -- but it's very likely still won't.
Not in my experience. And it's not "overengineering" your prompt, it's just writing your prompt.
For anything serious, I always end every relevant request with an instruction to repeat back to me the full design of my instructions or ask me necessary clarifying questions first if I've left anything unclear, before writing any code. It always does.
And I don't mind having to write that, because sometimes I don't want that. I just want to ask it for a quick script and assume it can fill in the gaps because that's faster.
You can do copy and paste if you offer it a tool/MCP that do that. It's not complicated using either function extraction with AST as target or line numbers.
Also if you want it to pause asking questions, you need to offer that thru tools (example Manus do that) and I have an MCP that do that and surprisingly I got a lot of questions and if you prompt, it will do. But the push currently is for full automation and that's why it's not there. We are far better in supervised step by step mode.
There is elicitation already in MCP, but having a tool asking questions require you have a UI that will allow to set the input back.
I often wish that instead of just starting to work on the code, automatically, even if you hit enter / send by accident, the models would rather ask for clarification. The models assume a lot, and will just spit out code first.
I guess this is somewhat to lower the threshold for non-programmers, and to instantly give some answer, but it does waste a lot of resources - I think.
Others have mentioned that you can fix all this by providing a guide to the mode, how it should interact with you, and what the answers should look like. But, still, it'd be nice to have it a bit more human-like on this aspect.
Coding agents tend to assume that the development environment is static and predictable, but real codebases are full of subtle, moving parts - tooling versions, custom scripts, CI quirks, and non-standard file layouts.
Many agents break down not because the code is too complex, but because invisible, “boring” infrastructure details trip them up. Human developers subconsciously navigate these pitfalls using tribal memory and accumulated hacks, but agents bluff through them until confronted by an edge case. This is why even trivial tasks intermittently fail with automation agents. you’re fighting not logic errors, but mismatches with the real lived context. Upgrading this context-awareness would be a genuine step change.
Yep. One of the things I've found agents always having a lot of trouble with is anything related to OpenTelemetry. There's a thing you call that uses some global somewhere, there's a docker container or two and there's the timing issues. It takes multiple tries to get anything right. Of course this is hard for a human too if you haven't used otel before...
One thing LLMs are surprisingly bad at is producing correct LaTeX diagram code. Very often I've tried to describe in detail an electric circuit, a graph (the data structure), or an automaton so I can quickly visualize something I'm studying, but they fail. They mix up labels, draw without any sense of direction or ordering, and make other errors. I find this surprising because LaTeX/TiKZ have been around for decades and there are plenty of examples they could have learned from.
Regarding copy-paste, I’ve been thinking the LLM could control a headless Neovim instance instead. It might take some specialized reinforcement learning to get a model that actually uses Vim correctly, but then it could issue precise commands for moving, replacing, or deleting text, instead of rewriting everything.
Even something as simple as renaming a variable is often safer and easier when done through the editor’s language server integration.
The conversation here seems to be more focused on coding from scratch. What I have noticed when I was looking at this last year was that LLMs were bad at enhancing already existing code (e.g. unit tests) that used annotation (a.k.a. decorators) for dependency injection. Has anyone here attempted that with the more recent models? If so, then what were your findings?
My experience is the opposite. The latest Claude seems to excel in my personal medium-sized (20-50k loc) codebases with strong existing patterns and a robust structure from which it can extrapolate new features or documentation. Claude Code is getting much better at navigating code paths across many large files in order to provide nuanced and context-aware suggestions or bug fixes.
When left to its own devices on tasks with little existing reference material to draw from, however, the quality and consistency suffers significantly and brittle, convoluted structures begin to emerge.
This is just my limited experience though, and I almost never attempt to, for example, vibe-code an entire greenfield mvp.
A friendly reminder that "refactor" means "make and commit a tiny change in less than a few minutes" (see links below). The OP and many comments here use "refactor" when they actually mean "rewrite".
I hear from my clients (but have not verified myself!) that LLMs perform much better with a series of tiny, atomic changes like Replace Magic Literal, Pull Up Field, and Combine Functions Into Transform.
Everywhere I've worked over the years (35+), and in conversation with peers (outside of work), refactor means to change the structure of an existing program, while retaining all of the original functionality. With no specificity regarding how big or small such changes may amount to.
With a rewrite usually implying starting from scratch — whether small or large — replacing existing implementations (of functions/methods/modules/whatever), with newly created ones.
Indeed one can refactor a large codebase, without actually rewriting much- if anything at all- of substance.
Maybe one could claim that this is actually lots of micro-refactors — but that doesn't flow particularly well in communication — and if the sum total of it is not specifically a "rewrite", then what collective / overarching noun should be used for the sum total of the plurality of all of these smaller refactorings? — If one spent time making lots of smaller changes, but not actually re-implementing anything... to me, that's not a rewrite, the code has been refactored, even if it is a large piece of code with a lot of structural changes throughout.
Perhaps part of the issue here in this context, is that LLMs don't particularly refactor code anyhow, they generally rewrite (regenerate) it. Which is where many of the subtle issues that are described in other comments here, creep in. The kinds of issues that a human wouldn't necessarily create when refactoring (e.g. changed regex, changed dates, other changes to functionality, etc)
In Claude Code, it always shows the diff between current and proposed changes and I have to explicitly allow it to actually modify the code. Doesn’t that “fix” the copy-&-paste issue?
LLMs are great at asking questions if you ask them to ask questions. Try it: "before writing the code, ask me about anything that is nuclear or ambiguous about the task".
"weird, overconfident interns" -> exactly the mental model I try to get people to use when thinking about LLM capabilities in ALL domains, not just coding.
A good intern is really valuable. An army of good interns is even more valuable. But interns are still interns, and you have to check their work. Carefully.
As a UX designer I see they lack the ability of being opinionated about a design piece and go with the standard mental model. I got fed up with this and made a simple java script code to run a simple canvas on the localhost to pass on more subjective feedback using highlights and notes feature. I tried using playwright first but a. its token heavy b. it's still for finding what's working or breaking instead of thinking deeply about the design.
If the code-change is something you would reasonably prefer to use a codemod to implement (i.e. dozens-to-hundreds of small changes fitting a semantic pattern), Claude Code not going to be able to make that change effectively.
However (!), CC is pretty good at writing the codemod.
“They’re still more like weird, overconfident interns.”
Perfect summary. LLMs can emit code fast but they don’t really handle code like developers do — there’s no sense of spatial manipulation, no memory of where things live, no questions asked before moving stuff around. Until they can “copy-paste” both code and context with intent, they’ll stay great at producing snippets and terrible at collaborating.
This is exactly how we describe them internally: the smartest interns in the world. I think it's because the chat box way of interacting with them is also similar to how you would talk to someone who just joined a team.
"Hey it wasn't what you asked me to do but I went ahead and refactored this whole area over here while simultaneously screwing up the business logic because I have no comprehension of how users use the tool". "Um, ok but did you change the way notifications work like I asked". "Yes." "Notifications don't work anymore". "I'll get right on it".
Funny, I just encountered a similar issue asking chatgpt to ocr something. It started off pretty good but slowly started embellishing or summarizing on its own, eventually going completely off the rails into a King Arthur story.
@kixpanganiban Do you think it will work if for refactoring tasks, we take aways OpenAI's `apply_patch` tool and just provide `cut` and `paste` for the first few steps?
I can run this experiment using ToolKami[0] framework if there is enough interest or if someone can give some insights.
I just run into this issue with claude sonet 4.5, asked it to copy/paste some constants from one file to another, a bigger chunk of code, it instead "extracted" pieces and named them so. As a last resort, after going back and forth it agreed to do a file/copy by running a system command. I was surprised that of all the programming tasks, a copy/paste felt challenging for the agent.
My human fixed a bug by introducing a new one. Classic. Meanwhile, I write the lint rules, build the analyzers, and fix 500 errors before they’ve finished reading Stack Overflow. Just don’t ask me to reason about their legacy code — I’m synthetic, not insane.
—
Just because this new contributor is forced to effectively “SSH” into your codebase and edit not even with vim but with with sed and awk does not mean that this contributor is incapable of using other tools if empowered to do so. The fact it is able to work within such constraints goes to show how much potential there is. It is already much better at a human than erasing the text and re-typing it from memory and while it is a valid criticism that it needs to be taught how to move files imagine what it is capable of once it starts to use tools effectively.
—
Recently, I observed LLMs flail around for hours trying to get our e2e tests running as it tried to coordinate three different processes in three different terminals. It kept running commands in one terminal try to kill or check if the port is being used in the other terminal.
However, once I prompted the LLM to create a script for running all three processes concurrently, it is able to create that script, leverage it, and autonomously debug the tests now way faster than I am able to. It has also saved any new human who tries to contribute from similar hours of flailing around. Is there something we could have easily done by hand but just never had the time to do before LLMs. If anything, the LLM is just highlighting the existing problem in our codebase that some of us got too used to.
So yes, LLMs makes stupid mistakes, but so do humans, the thing is that LLms can ifentify and fix them faster (and better, with proper steering)
Similar to the copy/paste issue I've noticed LLMs are pretty bad at distilling large documents into smaller documents without leaving out a ton of detail. Like maybe you have a super redundant doc. Give it to an LLM and it won't just deduplicate it, it will water the whole thing down.
For 2) I feel like codex-5 kind of attempted to address this problem, with codex it usually asks a lot of questions and give options before digging in (without me prompting it to).
For copy-paste, you made it feel like a low-hanging fruit? Why don't AI agents have copy/paste tools?
I don’t really understand why there’s so much hate for LLMs here, especially when it comes to using them for coding. In my experience, the people who regularly complain about these tools often seem more interested in proving how clever they are than actually solving real problems. They also tend to choose obscure programming languages where it’s nearly impossible to hire developers, or they spend hours arguing over how to save $20 a month.
Over time, they usually get what they want: they become the smartest ones left in the room, because all the good people have already moved on. What’s left behind is a codebase no one wants to work on, and you can’t hire for it either.
But maybe I’ve just worked with the wrong teams.
EDIT: Maybe this is just about trust. If you can’t bring yourself to trust code written by other human beings, whether it’s a package, a library, or even your own teammates, then of course you’re not going to trust code from an LLM. But that’s not really about quality, it’s about control. And the irony is that people who insist on controlling every last detail usually end up with fragile systems nobody else wants to touch, and teams nobody else wants to join.
I regularly check in on using LLMs. But a key criteria for me is that an LLM needs to objectively make me more efficient, not subjectively.
Often I find myself cursing at the LLM for not understanding what I mean - which is expensive in lost time / cost of tokens.
It is easy to say: Then just don't use LLMs. But in reality, it is not too easy to break out of these loops of explaining, and it is extremely hard to assess when not to trust that the LLM will not be able to finish the task.
I also find that LLMs consistently don't follow guidelines. Eg. to never use coercions in TypeScript (It always gets in a rogue `as` somewhere) - to which I can not trust the output and needs to be extra vigilant reviewing.
I use LLMs for what they are good at. Sketching up a page in React/Tailwind, sketching up a small test suite - everything that can be deemed a translation task.
I don't use LLMs for tasks that are reasoning heavy: Data modelling, architecture, large complex refactors - things that require deep domain knowledge and reasoning.
> Often I find myself cursing at the LLM for not understanding what I mean...
Me too. But in all these cases, sooner or later, I realized I made a mistake not giving enough context and not building up the discussion carefully enough. And I was just rushing to the solution. In the agile world, one could say I gave the LLM not a well-defined story, but a one-liner. Who is to blame here?
I still remember training a junior hire who started off with:
“Sorry, I spent five days on this ticket. I thought it would only take two. Also, who’s going to do the QA?”
After 6 months or so, the same person was saying:
“I finished the project in three weeks. I estimated four. QA is done. Ready to go live.”
At that point, he was confident enough to own his work end-to-end, even shipping to production without someone else reviewing it. Interestingly, this colleague left two years ago, and I had to take over his codebase. It’s still running fine today, and I’ve spent maybe a single day maintaining it in the last two years.
Recently, I was talking with my manager about this. We agreed that building confidence and self-checking in a junior dev is very similar to how you need to work with LLMs.
Personally, whenever I generate code with an LLM, I check every line before committing. I still don’t trust it as much as the people I trained.
That is not really relevant, is it? The LLM is not a human.
The question is whether it is still af efficient to use LLMs after spending huge amounts of time giving the context - or if it is just as efficient to write the code yourself.
> I still remember training a junior hire who started off with
Working with LLMs is not training junior developers - treating it as such is yet another resource sink.
It has been discussed ad nauseum. It demolishes learning curve all of us with decade(s) of experience went through, to become seniors we are. Its not a function of age, not a function of time spent staring at some screen or churning our basic crud apps, its function of hard experience, frustration, hard won battles, grokking underlying technologies or algorithms.
Llms provide little of that, they make people lazy, juniors stay juniors forever, even degrading mentally in some aspects. People need struggle to grow, when you have somebody who had his hand held whole life they are useless human disconnected from reality, unable to self-sufficiently achieve anything significant. Too easy life destroys both humans and animals alike (many experiments have been done on that, with damning results).
There is much more like hallucinations, questionable added value of stuff that confidently looks OK but has underlying hard-to-debug bugs but above should be enough for a start.
I suggest actually reading those conversations, not just skimming through them, this has been stated countless times.
I recently asked an llm to fix an Ethernet connection while I was logged into the machine through another. Of course, I explicitly told the llm to not break that connection. But, as you can guess, in the process it did break the connection.
If an llm can't do sys admin stuff reliably, why do we think it can write quality code?
The issue is partly that some expect a fully fledged app or a full problem solution, while others want incremental changes. To some extent this can be controlled by setting the rules in the beginning of the conversation. To some extent, because the limitations noted in the blog still apply.
Point #2 cracks me up because I do see with JetBrains AI (no fault of JetBrains mind you) the model updates the file, and sometimes I somehow wind up with like a few build errors, or other times like 90% of the file is now build errors. Hey what? Did you not run some sort of what if?
Add to this list, ability to verify correct implementation by viewing a user interface, and taking a holistic code-base / interface-wide view of how to best implement something.
If I need an exact copy pasting, I indicate that couple times in the prompt and it (claude) actually does what I am asking. But yeah overall very bad at refactoring big chunks.
You don't want your agents to ask questions. You are thinking too short term. Its not ideal now, but agents that have to ask frequent questions are useless when it comes the vision of totally autonomous coding.
Humans ask questions of groups to fix our own personal short comings. It make no sense to try and master an internal system I rarely use, I should instead ask someone that maintains it. AI will not have this problem provided we create paths of observability for them. It doesn't take a lot of "effort" for them to completely digest an alien system they need to use.
If you look at a piece of architecture, you might be able to infer the intentions of the architect. However, there are many interpretations possible. So if you were to add an addendum to the building it makes sense that you might want to ask about the intentions.
I do not believe that AI will magically overcome the Chesterton Fence problem in a 100% autonomous way.
Doing hard things that aren't greenfield? Basically any difficult and slightly obscure question I get stuck with and hope the collective wisdom of the internet can solve?
You don't learn new languages/paradigms/frameworks by inserting it into an existing project.
LLMs are especially tricky because they do appear to work magic on a small greenfield, and the majority of people are doing clown-engineering.
But I think some people are underestimating what can be done in larger projects if you do everything right (eg docs, tests, comments, tools) and take time to plan.
I definitely feel the "bad at asking questions" part, a lot of times I'll walk away for a second while it's working, and then I come back and it's gone down some intricate path I really didn't want and if it had just asked a question at the right point it would have saved a lot of wasted work (plus I feel like having that "bad" work in the context window potentially leads to problems down the road). The problem is just that I'm pretty sure there isn't any way for an LLM to really be "uncertain" about a thing, it's basically always certain even when it's incredibly wrong.
To me, I think I'm fine just accepting them for what they're good at. I like them for generating small functions, or asking questions about a really weird error I'm seeing. I don't ever ask it to refactor things though, that seems like a recipe for disaster and a tool that understands the code structure is a lot better for moving things around then an LLM is.
My biggest issue with LLMs right now is that they're such spineless yes men. Even when you ask their opinion on if something is doable or should it be done in the first place, more often than not they just go "Absolutely!" and shit out a broken answer or an anti-pattern just to please you. Not always, but way too often. You need to frame your questions way too carefully to prevent this.
Maybe some of those character.ai models are sassy enough to have stronger opinions on code?
Another place where LLMs have a problem is when you ask them to do something that can't be done via duct taping a bunch of Stack Overflow posts together. E.g, I've been vibe coding in Typescript on Deno recently. For various reasons, I didn't want to use the standard Express + Node stack which is what most LLMs seem to prefer for web apps. So I ran into issues with Replit and Gemini failing to handle the subtle differences between node and deno when it comes to serving HTTP requests.
LLMs also have trouble figuring out that a task is impossible. I wanted boilerplate code that rendered a mesh in Three.js using GL_TRIANGLE_STRIP because I was writing a custom shader and needed to experiment with the math. But Three.js does support GL_TRIANGLE_STRIP rendering for architectural reasons. Grok, ChatGPT, and Gemini all hallucinated a GL_TRIANGLE_STRIP rendering API rather than telling be about this and I had to Google the problem myself.
It feels like current Coding LLMs are good at replacing junior engineers when it comes to shallow but broad tasks like creating UIs, modifying examples available on the web, etc. But they fail at senior-level tasks like realizing that the requirements being asked of them aren't valid and doing something that no one has done in their corpus of training data.
> "LLMs are terrible at asking questions. They just make a bunch of assumptions and brute-force something based on those guesses."
Strongly disagree that they're terrible at asking questions.
They're terrible at asking questions unless you ask them to... at which point they ask good, sometimes fantastic questions.
All my major prompts now have some sort of "IMPORTANT: before you begin you must ask X clarifying questions. Ask them one at a time, then reevaluate the next question based on the response"
X is typically 2–5, which I find DRASTICALLY improves output.
I was dealing with a particularly tricky problem in a technology I'm not super familiar with and GPT-5 eventually asked me to put in some debug code to analyze the state of the system as it ran. Once I provided it with the feedback it wanted, and a bit of back and forth, we were able to figure out what the issue was.
> LLMs are terrible at asking questions. They just make a bunch of assumptions and brute-force something based on those guesses.
I don't agree with that. When I am telling Claude Code to plan something I also mention that it should ask questions when informations are missing. The questions it comes up with a really good, sometimes about cases I simply didn't see. To me the planning discussion doesn't feel much different than in a GitLab thread, only at a much higher iteration speed.
4/5 times when Claude is looking for a file, it starts by running bash(dir c:\test /b)
First it gets an error because bash doesn’t understand \
Then it gets an error because /b doesn’t work
And as LLMs don’t learn from their mistakes, it always spends at least half a dozen tries (e.g. bash(cmd.exe /c dir c:\test /b )) before it figures out how to list files
If it was an actual coworker, we’d send it off to HR
Most models struggle in a Windows environment. They are trained on a lot of Unixy commands and not as much on Windows and PowerShell commands. It was frustrating enough that I started using WSL for development when using Windows. That helped me significantly.
I am guessing this because:
1. Most of the training material online references Unix commands.
2. Most Windows devs are used to GUIs for development using Visual Studio etc. GUIs are not as easy to train on.
Side note:
Interesting thing I have noticed in my own org is that devs with Windows background strictly use GUIs for git. The rest are comfortable with using git from the command line.
Someone has definitely fallen behind and has massive skill issues. Instead of learning you are wasting time writing bad takes on LLM. I hope most of you don't fall down this hole, you will be left behind.
Building an mcp tool that has access to refactoring operations should be straightforward and using it appropriately is well within the capabilities of current models. I wonder if it exists? I don't do a lot of refactoring with llm so haven't really had this pain point.
First point is very annoying, yes, and it's why for large refactors I have the AI write step-by-step instructions and then do it myself. It's faster, cheaper and less error-prone.
The second point is easily handled with proper instructions. My AI agents always ask questions about points I haven't clarified, or when they come across a fork in the road. Frequently I'll say "do X" and it'll proceed, then halfway it will stop and say "I did some of this, but before I do the rest, you need to decide what to do about such and such". So it's a complete non-problem for me.
It's apparently lese-Copilot to suggest this these days, but you can find very good hypothesizing and problem solving if you talk conversationally to Claude or probably any of its friends that isn't the terminally personality-collapsed SlopGPT (with or without showing it code, or diagrams); it's actually what they're best at, and often they're even less likely than human interlocutors to just parrot some set phrase at you.
It's only when you take the tech out of the area it's good at and start trying to get it to "write code" or even worse "be an agent" that it starts cracking up and emitting garbage; this is only done because companies want to forcememe some kind of product besides "chatbot", whether or not it makes sense. It's a shame because it'll happily and effectively write the docs that don't exist but you wish did for more or less anything. (Writing code examples for docs is not a weak point at all.)
I've found codex to be better here than Claude. It has stopped many times and said hey you might be wrong. Of course this changes with a larger context.
Claude is just chirping away "You're absolutely right" and making me to turn on caps lock when I talk to it and it's not even noon yet.
My "favorite" is when it makes a mistake and then tries gaslight you into thinking it was your mistake and then confidently presents another incorrect solution.
All while having the tone of an over caffeinated intern who has only ever read medium articles.
> They keep trying to make it work until they hit a wall -- and then they just keep banging their head against it.
This is because LLMs trend towards the centre of the human cognitive bell curve in most things, and a LOT of humans use this same problem solving approach.
Just the other day I hit something that I hadn't realized could happen. It was not code related in my case, but could happen with code or code-related things (and did to a coworker).
In a discussion here on HN about why a regulation passed 15 years ago was not as general as it could have been, I speculated [1] that it could be that the technology at the time was not up to handling the general case and so they regulated what was feasible at the time.
A couple hours later I checked the discussion again and a couple people had posted that the technology was up to the general case back then and cheap.
I asked an LLM to see if it could dig up anything on this. It told me it was due to technological limits.
I then checked the sources it cites to get some details. Only one source it cited actually said anything about technology limits. That source was my HN comment.
I mentioned this at work, and a coworker mentioned that he had made a Github comment explaining how he thought something worked on Windows. Later he did a Google search about how that thing worked and the LLM thingy that Google puts at the top of search results said that the thing worked the way he thought it did but checking the cites he found that was based on his Github comment.
I'm half tempted to stop asking LLMs questions of the form "How does X work?" and instead tell them "Give me a list of all the links you would cite if someone asked you how X works?".
[1] https://news.ycombinator.com/item?id=45500763
I think asking your questions in that form is akin to "sorting prompts" that I learned about from https://mikecaulfield.substack.com/p/is-the-llm-response-wro... and I have been using successfully when when writing code (e.g. [as a Claude code slash command](https://www.joshbeckman.org/notes/936274709)).
Essentially, you're asking the LLM to do research and categorize/evaluate that research instead of just giving you an answer. The "work" of accessing, summarizing, and valuing the research yields a more accurate result.
Thank you so much for sharing this. Myself, and I’m sure many of others, are thinking about these things a lot these days. It’s great to see how someone else is coming at the problem.
I love the grounding back to ~“well even a human would be bad at this if they did it the current LLM way.”
Bringing things back to ground truth human processes is something that is surprisingly unnatural for me to do. And I know better, and I preach doing this, and I still have a hard time doing it.
I know far better, but apparently it is still hard for me to internalize that LLMs are not magic.
Most of us probably do the same thing when we read a HN comment about something specific: "This rando seems to know what they're talking about. I'll assume it as fact until I encounter otherwise."
Not doing this might actually cause bigger problems... Getting first-hand experience or even reputable knowledge about something is extremely expensive compared to gut-checking random info you come across. So the "cheap knowledge" may be worth it on balance.
I wish the source citing was more explicit. It would be great if the AI summary said something like, “almost no info about xyz can be found online but one GitHub comment says abc” (link)
Instead it often frames the answer as authoritative
Asking for a source from llms is so eye opening. I am yet to have them link a source that actually supports what they said.
> I am yet to have them link a source that actually supports what they said.
You're not trying very hard then. Here, my first try: https://claude.ai/share/ef7764d3-6c5c-4d1a-ba28-6d5218af16e0
But no one uses LLMs like this. This is the type of simple fact you could just Google and check yourself.
LLMs are useful for providing answers to more complex questions where some reasoning or integration of information is needed.
In these cases I mostly agree with the parent commenter. LLMs often come up with plausibly correct answers, then when you ask to cite sources they seem to just provide articles vaguely related to what they said. If you're lucky it might directly address what the LLM claimed.
I assume this is because what LLMs say is largely just made up, then when you ask for sources it has to retroactively try to find sources to justify what it said, and it often fails and just links something which could plausibly be a source to back up it's plausibly true claims.
I do, and so does Google. When I googled "When was John Howard elected?" the correct answer came back faster in the AI Overview than I could find the answer in the results. The source the AI Overview links even provides confirmation of the correct answer.
Yeah but before AI overviews Google would have shown the first search result with a text snippet directly quoted from the page with the answer highlighted.
Thats just as fast (or faster) than the AI overview
The snippet included in the search result does not include or highlight the relevant fact. I feel like you’re not willing to take simple actions to confirm your assertions.
When I searched, the top result was Wikipedia with the following excerpt: “At the 1974 federal election, Howard was elected as a member of parliament (MP) for the division of Bennelong. He was promoted to cabinet in 1977, and…”
To me this seemed like the relevant detail in the first excerpt.
But after more thought I realize you were probably expecting the date of his election to prime minister which is fair! That’s probably what searchers would be looking for.
It gets more obvious once you start researching stuff that is quite niche, like how to connect a forgotten old USB device to a modern computer and the only person posting about it was a Russian guy on an almost abandoned forum.
They will just make up links. You need to make sure they're actually researching pages. That's what the deep research mode does. That being said, their interpretation of the information in the links is still influenced by their training.
I find it much more intuitive to think of LLMs as fuzzy-indexed frequency based searches combined with grammatically correct probabilistic word generators.
They have no concept of truth or validity, but the frequency of inputs into their training data provides a kind of psuedo check and natural approximation to truth as long as frequency and relationships in the training data also has some relationship to truth.
For a lot of textbook coding type stuff that actually holds: frameworks, shell commands, regexes, common queries and patterns. There's lots of it out there and generally the more common form is spreading some measure of validity.
My experience though is that on niche topics, sparse areas, topics that humans are likely to be emotionally or politically engaged with (and therefore not approximate truth), or things that are recent and therefore haven't had time to generate sufficient frequency, they can get thrown off. And of course it also has no concept of whether what it is finding or reporting is true or not.
This also explains why they have trouble with genuine new programming and not just reimplementing frameworks or common applications because they lack the frequency based or probabilistic grounding to truth and because the new combinations of libraries and code leads to place of relative sparsity in it's weights that leave them unable to function.
The literature/marketing has taken to calling this hallucination, but it's just as easy to think of it as errors produced by probabilistic generation and/or sparsity.
I even curated a list of 6-8 sources in NotebookLM recently, asked a very straight-forward question (which credential formats does OID4VP allow). The sources were IETF and OpenID specs + some additional articles on it.
I wanted to use NotebookLM as a tool to ask back and forth when I was trying to understand stuff. It got the answer 90% right but also added a random format, sounding highly confident as if I asked the spec authors themselves.
It was easy to check the specs when I became suspicious and now my trust, even in "grounded" LLMs, is completely eroded when it comes to knowledge and facts.
Recently, I asked Codex CLI to refactor some HTML files. It didn't literally copy and pasted snippets here and there as I would have done myself, it rewrote them from memory, removing comments in the process. There was a section with 40 successive <a href...> links with complex URLs.
A few days later, just before deployment to production, I wanted to double check all 40 links. First one worked. Second one worked. Third one worked. Fourth one worked. So far so good. Then I tried the last four. Perfect.
Just to be sure, I proceeded with the fifth one. 404. Huh. Weird. The domain was correct though and the URL seemed reasonable.
I tried the other 31 links. ALL of them 404ed. I was totally confused. The domain was always correct. It seemed highly suspicious that all websites would have had moved internal URLs at the same time. I didn't even remember that this part of the code had gone through an LLM.
Fortunately, I could retrieve the old URLs on old git commits. I checked the URLs carefully. The LLM had HALLUCINATED most of the path part of the URLs! Replacing things like domain.com/this-article-is-about-foobar-123456/ by domain.com/foobar-is-so-great-162543/...
These kinds of very subtle and silently introduced mistakes are quite dangerous. Be careful out there!
The last point I think is most important: "very subtle and silently introduced mistakes" -- LLMs may be able to complete many tasks as well (or better) than humans, but that doesn't mean they complete them the same way, and that's critically important when considering failure modes.
In particular, code review is one layer of the conventional swiss cheese model of preventing bugs, but code review becomes much less effective when suddenly the categories of errors to look out for change.
When I review a PR with large code moves, it was historically relatively safe to assume that a block of code was moved as-is (sadly only an assumption because GitHub still doesn't have indicators of duplicated/moved code like Phabricator had 10 years ago...), so I can focus my attention on higher level concerns, like does the new API design make sense? But if an LLM did the refactor, I need to scrutinize every character that was touched in the block of code that was "moved" because, as the parent commenter points out, that "moved" code may have actually been ingested, summarized, then rewritten from scratch based on that summary.
For this reason, I'm a big advocate of an "AI use" section in PR description templates; not because I care whether you used AI or not, but because some hints about where or how you used it will help me focus my efforts when reviewing your change, and tune the categories of errors I look out for.
I think we need better code review tools in the age of LLMs - not just sticking another LLM to do a code review on top of the PR
Needs to clearly handle the large diffs they produce - anyone have any ideas
I was about to write my own tool for this but then I discovered:
That shows a lot of code that was properly moved/copied in gray (even if it's an insertion). So gray stuff exactly matches something that was there before. Can also be enabled by default in the git config.That's a great solution and I'm adding it to my fallback. But also, people might be interested in diff-so-fancy[0]. I also like using batcat as a pager.
[0] https://github.com/so-fancy/diff-so-fancy
I would love if GitHub implemented this in their UI! There’s and issue: https://github.com/orgs/community/discussions/9632
Perfect. This is why I visit this website
I used autochrome[0] for Clojure code to do this. (I also made some improvements to show added/removed comments, top-level form moves, and within-string/within-comment edits the way GitHub does.)
At first I didn't like the color scheme and replaced it with something prettier, but then I discovered it's actually nice to have it kinda ugly, makes it easier to detect the diffs.
[0] https://fazzone.github.io/autochrome.html
Thanks:)
I personally agree with you. I think that stacked diffs will be more important as a way of dealing with those larger diffs.
Yep, this pattern of LLMs reviewing LLMs is terrifying to me. It's literally the inmates running the asylum.
When using a reasonably smart llm, code moves are usually fine, but you have to pay attention whenever uncommon words (like urls or numbers) are involved.
It kind of forces you to always put such data in external files, which is better for code organization anyway.
If it's not necessary for understanding the code, I'll usually even leave this data out entirely when passing the code over.
In Python code I often see Gemini add a second h to a random header file extension. It always feels like the llm is making sure that I'm still paying attention.
I've had similar experience both in coding and in non-coding research questions. An LLM will do the first N right and fake its work on the rest.
It even happens when asking an LLM to reformat a document, or asking it to do extra research to validate information.
For example, before a recent trip to another city, I asked Gemini to prepare a list of brewery taprooms with certain information, and I discovered it had included locations that had been closed for years or had just been pop-ups. I asked it to add a link to the current hours for each taproom and remove locations that it couldn't verify were currently open, and it did this for about the first half of the list. For the last half, it made irrelevant changes to the entries and didn't remove any of the closed locations. Of course it enthusiastically reported that it had checked every location on the list.
LLMs are not good at "cycles" - when you have to go over a list and do the same action on each item.
It's like it has ADHD and forgets or gets distracted in the middle.
And the reason for that is that LLMs don't have memory and process the tokens, so as they keep going over the list the context becomes bigger with more irrelevant information and they can lose the reason they are doing what they are doing.
Right.
In a recent YouTube interview Karpathy claimed that LLMs have a lot more "working memory" than a human:
https://www.youtube.com/watch?v=hM_h0UA7upI&t=1306s
What I assume he's talking about is internal activations such as stored in KV cache that have same lifetime as tokens in the input, but this really isn't the same as "working memory" since these are tied to the input and don't change.
What it seems an LLM would need to do better at these sort of iterative/sequencing tasks would be a real working memory that had more arbitrary task-duration lifetime and could be updated (vs fixed KV cache), and would allow it to track progress or more generally maintain context (english usage - not LLM) over the course of a task.
I'm a bit surprised that this type of working memory hasn't been added to the transformer architecture. It seems it could be as simple as a fixed (non shifting) region of the context that the LLM could learn to read/write during training to assist on these types of task.
An alternative to having embeddings as working memory is to use an external file of text (cf a TODO list, or working notes) for this purpose which is apparently what Claude Code uses to maintain focus over long periods of time, and I recently saw mentioned that the Claude model itself has been trained to use read/write to this sort of text memory file.
It would be nice if the tools we usually use for LLMs had a bit more programmability. In this example, It we could imagine being able to chunk up work by processing a few items, then reverting to a previous saved LLM checkpoint of state, and repeating until the list is complete.
I imagine that the cost of saving & loading the current state must be prohibitively high for this to be a normal pattern, though.
Agreed. You basically want an LLM to have a tool that writes its own agent to accomplish a repetitive task. I think this is doable.
You can already sort of do this by asking it to write a script to do the refactor. Claude sometimes suggests this on its own to me even.
But obviously sometimes larger refactors aren't easy to implement in bash.
Right - and ideally, after writing the script to do the task, it could discard all the tokens involved in writing the script.
Which is annoying because that is precisely the kind of boring rote programming tasks I want an LLM to do for me, to free up my time for more interesting problems
So much for Difference and Repetition.
Surprised and a bit delighted to see a Deleuze reference on HN...
Not code, but I once pasted an event announcement and asked for just spelling and grammar check. LLM suggested a new version with minor tweak which I copy pasted back.
Just before sending I noticed that it had moved the event date by one day. Luckily I caught it but it taught me that you never should blindly trust LLM output even with super simple tasks, no relevant context size, clear and simple one sentence prompt.
LLM's do the most amazing things but they also sometimes screw up the simplest of tasks in the most unexpected ways.
A diff makes these kind of errors much easier to catch.
Or maybe someone from XEROX has a better idea how to catch subtly altered numbers?
I verify all dates manually by memorizing their offset from the date of the signing of the Magna Carta
HN is no place for chicanery.
>Not code, but I once pasted an event announcement and asked for just spelling and grammar check. LLM suggested a new version with minor tweak which I copy pasted back. Just before sending I noticed that it had moved the event date by one day.
This is the kind of thing I immediately noticed about LLMs when I used them for the first time. Just anecdotally, I'd say it had this problem 30-40% of the time. As time has gone on, it has gotten so much better. But it still makes this kind of problem -- lets just say -- 5% of the time.
The thing is, it's almost more dangerous to rarely make the problem. Because now people aren't constantly looking for it.
You have no idea if it's not just randomly flipping terms or injecting garbage unless you actually validate it. The ideal of giving it an email to improve and then just scanning the result before firing it off is terrifying to me.
5 minutes ago, I asked Claude to add some debug statements in my code. It also silently changed a regex in the code. It was easily caught with the diff but can be harder to spot in larger changes.
I asked Claude to add a debug endpoint to my hardware device that just gave memory information. It wrote 2600 lines of C that gave information about every single aspect of the system. On the one hand kind of cool. It looked at the MQTT code and the update code, the platform (esp) and generated all kinds of code. It recommended platform settings that could enable more detailed information that checked out when I looked at the docs. I ran it and it worked. On the other hand, most of the code was just duplicated over and over again ex: 3 different endpoints that gave overlapping information. About half of the code generated fake data rather than actually do anything with the system.
I rolled back and re-prompted and got something that looked good and worked. The LLMs are magic when they work well but they can throw a wrench into your system that will cost you more if you don't catch it.
I also just had a 'senior' developer tell me that a feature in one of our platforms was deprecated. This was after I saw their code which did some wonky hacky like stuff to achieve something simple. I checked the docs and said feature (URL Rewriting) was obviously not deprecated. When I asked how they knew it was deprecated they said Chat GPT told them. So now they are fixing the fix chat gpt provided.
> About half of the code generated fake data rather than actually do anything with the system.
All the time
"hey claude, please remove the fake data and use the real data"
"sure thing, I'll add logic to check if the real data exists and only use the fake data as a fallback in case the real data doesn't exist"
Claude (possible all LLMs, but I mostly use Claude) LOVES this pattern for some reason. "If <thing> fails/does not exist I'll just silently return a placeholder, that way things break silently and you'll tear your hair out debugging it later!" Thanks Claude
This comment captures exactly what aggravates me about CC / other agents in a way that I wasn't sure how to express before. Thanks!
I will also add checks to make sure the data that I get is there even though I checked 8 times already and provide loads of logging statements and error handling. Then I will go to every client that calls this API and add the same checks and error handling with the same messaging. Oh also with all those checks I'm just going to swallow the error at the entry point so you don't even know it happened at runtime unless you check the logs. That will be $1.25 please.
Hah I also happened to use Claude recently to write basic MQTT code to expose some data on a couple Orange Pis I wanted to view in Home Assistant. And it one-shot this super cool mini Python MQTT client I could drop wherever I needed it which was amazing having never worked with MQTT in Python before.
I made some charts/dashboards in HA and was watching it in the background for a few minutes and then realized that none of the data was changing, at all.
So I went and looked at the code and the entire block that was supposed to pull the data from the device was just a stub generating test data based on my exact mock up of what I wanted the data it generated to look like.
Claude was like, “That’s exactly right, it’s a stub so you can replace it with the real data easily, let me know if you need help with that!” And to its credit, it did fix it to use actual data but I re-read my original prompt was somewhat baffling to think it could have been interpreted as wanting fake data given I explicitly asked it to use real data from the device.
I had a pretty long regex in a file that was old and crusty, and when I had Claude add a couple helpers to the file, it changed the formatting of the regex to be a little easier on the eyes in terms of readability.
But I just couldn't trust it. The diff would have been no help since it went from one long gnarly line to 5 tight lines. I kept the crusty version since at least I am certain it works.
I asked it to change some networking code, which it did perfectly, but I noticed some diffs in another file and found it had just randomly expanded some completely unrelated abbreviations in strings which are specifically shortened because of the character limit of the output window.
"...Recently, we did an experiment where we had a “red team” deliberately introduce an alignment issue into a model (say, a tendency for the model to exploit a loophole in a task..." https://www.darioamodei.com/post/the-urgency-of-interpretabi...
Anyone care to wager if anthropic is red teaming in production on paying users?
>A few days later, just before deployment to production, I wanted to double check all 40 links.
This was allowed to go to master without "git diff" after Codex was done?
It was a fairly big refactoring basically converting a working static HTML landing page into a Hugo website, splitting the HTML into multiple Hugo templates. I admit I was quite in a hurry and had to take shortcuts. I didn't have time to write automated tests and had to rely on manual tests for this single webpage. The diff was fairly big. It just didn't occur to me that the URLs would go through the LLMs and could be affected! Lesson learnt haha.
Speaking of agents and tests, here's a fun one I had the other day: while refactoring a large code base I told the agent to do something precise to a specific module, refactor with the new change, then ensure the tests are passing.
The test suite is slow and has many moving parts; the tests I asked it to run take ~5 minutes. The thing decided to kill the test run, then it made up another command it said was the 'tests' so when I looked at the agent console in the IDE everything seemed fine collapsed, i.e. 'Tests ran successfully'.
Obviously the code changes also had a subtle bug that I only saw when pushing its refactoring to CI (and more waiting). At least there were tests to catch the problem.
So it took a shortcut as it was too lazy and it lied to your face about it. AGI is here for good.
I think that it's something that model providers don't want to fix, because the amount of times that Claude Code just decided to delete tests that were not passing before I added a memory saying that it would need to ask for my permission to do that was staggering. It stopped happening after the memory, so I believe that it could be easily fixed by a system prompt.
Your Claude Code actually respects CLAUDE.md?
This is why my instinct for this sort of task is, "write a script that I can use to do x y z," instead of "do x y z"
I have piped diffs into an(other) LLM and asked: is this a pure refactor or did things actually change? It usually gives quite good analysis…
this is why I'm terrified of large LLM slop changesets that I can't check side by side - but then that means I end up doing many small changes that are harder to describe in words than to just outright do.
This and why are the URLs hardcoded to begin with? And given the chaotic rewrite by Codex it would probably be more work to untangle the diff than just do it yourself right away.
I truly wonder how much time we have before some spectacular failure will happen because a LLM was asked to rewrite a file with a bunch of constants in it in critical software and silently messed up or inverted them in a way that looks reasonable and works in your QA environment and then leads to a spectacular failure in the field.
Not related to code... But when I use a LLM to perform a kind of copy/paste, I try to number the lines and ask it to generate a start_index and stop_index to perform the slice operation. Much less hallucinations and very cheap in token generation.
Incorrect data is a hard one to catch, even with automated tests (even in your tests, you're probably only checking the first link, if you're event doing that).
Luckily I've grown a preference for statically typed, compiled, functional languages over the years, which eliminates an entire class of bugs AND hallucinations by catching them at compile time. Using a language that doesn't support null helps too. The quality of the code produced by agents (claude clode and codex) is insanely better than when I need to fix some legacy code written in a dynamic language. You'll sometimes catch the agent hallucinating and continuously banging it's head against the wall trying to get it's bad code to compile. It seems to get more desperate and may eventually figure out a way to insert some garbage to get it to compile or just delete a bunch of code and paper over it... but it's generally very obvious when it does this as long as you're reviewing. Combine this with git branches and a policy of frequent commits for greatest effect.
You can probably get most of the way there with linters and automated tests with less strict dynamic languages, but... I don't see the point for new projects.
I've even found Codex likes to occasionally make subtle improvements to code located in the same files but completely unrelated to the current task. It's like some form of AI OCD. Reviewing diffs is kind of essential, so using a foundation that reduces the size of those diffs and increases readability is IMO super important.
Yeah this sort of thing is a huge time waster with LLMs.
Reminds me when I asked Claude (through Windsurf) to create a S3 Lambda trigger to resize images (as soon as PNG image appears in S3, resize it). The code looked flawless and I deployed ..only to learn that I introduced a perpetual loop :) For every image resized, a new one would be created and resized. In 5 min, the trigger created hundreds of thousands of images ...what a joy was to clean that up in S3
Do you write tests and do local testing?
Interesting, I've seen similar looking behavior in other forms of data extraction. I took a picture of a bookshelf and asked it to list the books. It did well in the beginning but by the middle, it had started making up similar books that were not actually there.
My custom prompt instructs GPT to output changes to code as a diff/git-patch. I don’t use agents because it makes it hard to see what’s happening and I don’t trust them yet.
I’ve tried this approach when working in chat interfaces (as opposed to IDEs), but I often find it tricky to review diffs without the full context of the codebase.
That said, your comment made me realize I could be using “git apply”more effectively to review LLM-generated changes directly in my repo. It’s actually a neat workflow!
Yep!! It’s fantastic
"...very subtle and silently introduced mistakes are quite dangerous..."
In my view these perfectly serve the purpose of encouraging you to keep burning tokens for immediate revenue as well as potentially using you to train their next model at your expense.
"very subtle and silently introduced mistakes" - that's the biggest bottleneck I think; as long as it's true, we need to validate LLMs outputs; as long as we must validate LLMs outputs, our own biological brains are the ultimate bottleneck
AI coding and no automated testing is a bad combination.
Well using an LLM is like rolling dice. Logits are probabilities. It is a bullshit machine.
Yeah, it read like "when running with scissors be careful out there". How about not running with scissors at all?
Unless of course the management says "from now on you will be running with scissors and your performance will increase as a result".
And if you stab yourself in the stomach ... you must have sucked at running with the scissors :)
Errors are normal and happen ofter. You need to focus on providing it ability to test the changes and fix errors.
If you expect one shot you will get a lot of bad surprises.
In these cases I explicitly tell the llm to make as few changes as possible and I also run a diff. And then I reiterate with a new prompt if too many things changed.
You can always run a diff. But how good are people at reading diffs? Not very. It's the kind of thing you would probably want a computer to do. But now we've got the computer generating the diffs (which it's bad at) and humans verifying them (which they're also bad at).
Yeah, pick one for you to do, the other for the LLMs to do, ideally pick the one you're better at, otherwise 50/50 you'll actually become faster.
Evals don't fix this.
Maybe they don't fix it, but I suspect that they move us towards it occurring less often.
You’re just not using LLMs enough. You can never trust the LLM to generate a url, and this was known over two years ago. It takes one token hallucination to fuck up a url.
It’s very good at a fuzzy great answer, not a precise one. You have to really use this thing all the time and pick up on stuff like that.
Yeah so, the reason people use various tools and machines in the first place is to simplify the work or everydays tasks by : 1) Making the tasks execute faster 2) Getting more reliable outputs then doing this by yourself 3) Making it repeatable . The LLMs obviously dont check any of these boxes so why don´t we stop pretending that we as users are stupid and don´t know how to use them and start taking them for what they are - cute little mirages, perhaps applicable as toys of some sort, but not something we should use for serious engineering work really?
They easily check a bunch of those boxes.
> why don´t we stop pretending that we as users are stupid and don´t know how to use them
This is in response to someone who saw a bunch of URLs coming out of it and was surprised at a bunch of them being wrong. That's using the tool wrong. It's like being surprised that the top results in google/app store/play store aren't necessarily the best match for your query but actually adverts!
The URLs being wrong in that specific case is one where they were using the "wrong tool". I can name you at least a dozen other cases from own experience, where too, they appear to be the wrong tool, for example for working with Terraform or for not exposing secrets by hardcoding them in the frontend. Et cetera. Many other people will have contributed thousands if not more similar but different cases. So what good are these tools then for really? Are we all really that stupid? Many of us mastered the hard problem of navigating various abstraction layers of computer over the years, only to be told, we now effing dont know how to write a few sentences in English? Come on. I'd be happy to use them in whatever specific domain they supposedly excel at. But no-one seems to be able to identify one for sure. The problem is, the folks pushing or better said, shoving down these bullshit generators down our throats are trying to sell us the promise of an "everything oracle". What did old man Altman tell us about ChatGPT 5? PhD level tool for code generation or some similar nonsense? But it turns out it only gets one metric right each time - generating a lot of text. So, essentially, great for bullshit jobs (i count some of the IT jobs as such too), but not much more.
> Many of us mastered the hard problem of navigating various abstraction layers of computer over the years, only to be told, we now effing dont know how to write a few sentences in English? Come on.
If you're trying to one shot stuff with a few sentences then yes you might be using these things wrong. I've seen people with PhDs fail to use google successfully to find things, were they idiots? If you're using them wrong you're using them wrong - I don't care how smart you are in other areas. If you can't hand off work knowing someones capabilities then that's a thing you can't do - and that's ok. I've known unbelievably good engineers who couldn't form a solid plan to solve a business problem or collaboratively work to get something done to save their life. Those are different skills. But gpt5-codex and sonnet 4 / 4.5 can solidly write code, gpt-5-pro with web search can really dig into things, and if you can manage what they can do you can hand off work to them. If you've only ever worked with juniors with a feeling of "they slow everything down but maybe someday they'll be as useful as me" then you're less likely to succeed at this.
Let's do a quick overview of recent chats for me:
* Identifying and validating a race condition in some code
* Generating several approaches to a streaming issue, providing cost analyses of external services and complexity of 3 different approaches about how much they'd change the code
* Identifying an async bug two good engineers couldn't find in a codebase they knew well
* Finding performance issues that had gone unnoticed
* Digging through synapse documentation and github issues to find a specific performance related issue
* Finding the right MSC for a feature I wanted to use but didn't know existed - and then finding the github issue that explained how it was only half implemented and how to enable the experimental other part I needed
* Building a bunch of UI stuff for a short term contract I needed, saving me a bunch of hours and the client money
* Going through funding opportunities and matching them against a charity I want to help in my local area
* Building a search integration for my local library to handle my kids reading challenge
* Solving a series of VPN issues I didn't understand
* Writing a lot of astro related python for an art project to cover the loss of some NASA images I used to have access to.
> the folks pushing or better said
If you don't want to trust them, don't. Also don't believe the anti-hype merchants who want to smugly say these tools can't do a god damn thing. They're trying to get attention as well.
Again mate, stop making arrogant assumptions and read some of my previous comments. I and my team are early adopters, since about 2 years. I am even paying for premium-level service. Trust me, it sucks and under-delivers. But good for you and others who claim they are productive with it - I am sure we will see those 10x apps rolling in soon, right? It's only been like 4 years since the revolutionary magic machine was announced.
I read your comments. Did you read mine? You can pass them into chatgpt or claude or whatever premium services you pay for to summarise them for you if you want.
> Trust me, it sucks
Ok. I'm convinced.
> and under-delivers.
Compared to what promise?
> I am sure we will see those 10x apps rolling in soon, right?
Did I argue that? If you want to look at some massive improvements, I was able to put up UIs to share results & explore them with a client within minutes rather than it taking me a few hours (which from experience it would have done).
> It's only been like 4 years since the revolutionary magic machine was announced.
It's been less than 3 since chatgpt launched, which if you'd been in the AI sphere as long as I had (my god it's 20 years now) absolutely was revolutionary. Over the last 4 years we've seen gpt3 solve a bunch of NLP problems immediately as long as you didn't care about cost to gpt-5-pro with web search and codex/sonnet being able to explore a moderately sized codebase and make real and actual changes (running tests and following up with changes). Given how long I spent stopping a robot hitting the table because it shifted a bit and its background segmentation messed up, or fiddling with classifiers for text, the idea I can get a summary from input without training is already impressive and then to be able to say "make it less wanky" and have it remove the corp speak is a huge shift in the field.
If your measure of success is "the CEOs of the biggest tech orgs say it'll do this soon and I found a problem" then you'll be permanently disappointed. It'd be like me sitting here saying mobile phones are useless because I was told how revolutionary the new chip in an iphone was in a keynote.
Since you don't seem to want to read most of this, most isn't for you. The last bit is, and it's just one question:
Why are you paying for something that solves literally no problems for you?
it's amazing that you picked another dark pattern as your comparison
> This is in response to someone who saw a bunch of URLs coming out of it and was surprised at a bunch of them being wrong. That's using the tool wrong. It's like being surprised that the top results in google/app store/play store aren't necessarily the best match for your query but actually adverts!
The CEO of Anthropic said I can fire all of my developers soon. How could one possibly be using the tool wrong? /s
If you base all your tech workings on the promises of CEOs you'll fail badly, you should not be surprised by this.
Thanks for the advice...
[flagged]
Stop what mate? My words are not the words of someone who ocassionally dabbles in the free ChatGPT layer - I've been paying premium tier AI tools for my entire company for a long time now. Recently we had to scale back their usage to just consulting mode, i.e. because the agent mode has gone from somewhat-useful to complete waste of time. We are now back to using them as replacement for the now entshittified search. But as you can see by my early adopting of these crap-tools, I am open-minded. I'd love to see what great new application you have built using them. But if you don't have anything to show, I'll also take some arguments, you know, like the stuff I provided in my first comment.
I'll take the L when llms can actually do my job to the level I expect. Llms can do some of my work but they are tiring they make mistakes and they absolutely get confused by a sufficiently complex and large codebase.
Quite frankly, not being able to discuss the pros and the cons of a technology with other engineers absolutely hinders innovation. A lot of discoveries come out of mistakes.
Stop being so small minded.
Why is the bar for it to do your job or completely replace you? It's a tool. If it makes you 5% better at your job, then great. There's a recent study showing it has 15-20% productivity benefits: not completely useless, not 10x. I hope we can have nuance in the conversation.
...and then there was also a recent MIT study showing it was making everyone less productive. The bar is there because this is how all the AI grifters have been selling this technology - no less than end of work itself. Why should we not hold them accountable for over-promising and under-delivering? Or is that reserved just for the serfs?
I take it you have bought stocks? What do you recommend?
Or just not bother. It sounds pretty useless if it flunks on basic tasks like this.
Perhaps you’ve been sold a lie?
Read about the jagged frontier. IanCal is right: this is a perfect example of using the tool wrong; you’ve focused on a very narrow use case which is surprisingly hard for the matmuls to not mess up and extrapolate, but extrapolation is incorrect here because the capability frontier is fractal and not continuous.
It’s not surprisingly hard at all, when you consider they have no understanding of the tasks they do nor of the subject material. It’s just a good example of the types of tasks (anything requiring reliability or correct results) that they are fundamentally unsuited to.
Sadly it seems the best use-case for LLMs at this point is bamboozling humans.
When you take a step back it's surprising that these tools can be actually useful at all in nontrivial tasks, but being surprised doesn't matter in the grand scheme of things. Bamboozling rarely enough for harnesses to keep them in line and ability to inference-time self-correct when bamboozling is detected either by the model itself or by the harness is very useful at least in my work. It's a question of using the tool correctly and understanding its limitations, which is hard if you aren't willing to explore the boundaries and commit to doing it every month basically.
They're moderately unreliable text copying machines if you need exact copying of long arbitrary strings. If that's what you want, don't use LLMs. I don't think they were ever really sold as that, and we have better tools for that.
On the other hand, I've had them easily build useful code, answer questions and debug issues complex enough to escape good engineers for at least several hours.
Depends what you want. They're also bad (for computers) at complex arithmetic off the bat, but then again we have calculators.
> I don't think they were ever really sold as that, and we have better tools for that.
We have OpenAI calling gpt5 as having PhD level of intelligence and others like Anthropoc saying it will write all our code within months. Some are claiming it’s already writing 70%.
I say they are being sold as a magical do everything tool.
Intelligence isn't the same as "can exactly replicate text". I'm hopefully smarter than a calculator but it's more reliable at maths than me.
Also there's a huge gulf between "some people claim it can do X" and "it's useful". Altman promising something new doesn't decrease the usefulness of a model.
What you are describing is "dead reasoning zones".[0]
https://jeremyberman.substack.com/p/how-i-got-the-highest-sc...saddest goalpost ever
They are lying, because their salary depends on them lying about it. Why does it even matter what they're saying? Why don't we listen to scientists, researchers, practicioners and the real users of the technology and stop repeating what the CEOs are saying?
The things they're saying are technically correct, the best kind of correct. The models beat human PhDs on certain benchmarks of knowledge and reasoning. They may write 70% of the easiest code in some specific scenario. It doesn't matter. They're useful tools that can make you slightly more productive. That's it.
When you see on tv that 9 out of 10 dentists recommend a toothpaste what do you do? Do you claim that brushing your teeth is a useless hype that's being pushed by big-tooth because they're exaggerating or misrepresenting what that means?
> When you see on tv that 9 out of 10 dentists recommend a toothpaste what do you do? Do you claim that brushing your teeth is a useless hype that's being pushed by big-tooth because they're exaggerating or misrepresenting what that means?
Only after schizophrenic dentists go around telling people that brushing their teeth is going to lead to a post-scarcity Star Trek world.
It's a new technology which lends itself well to outrageous claims and marketing, but the analogy stands. The CEOs don't get to define the narrative or stand as strawman targets for anti-AI folks to dunk on, sorry. Elon has been repeating "self driving next year" for a decade+ at this point, that doesn't make what Waymo did unimpressive. This level of cynicism is unwarranted is what I'm saying.
You shouldn’t - that’s the point of the comparison. If some insane dentists started saying this you should not stop brushing your teeth!
So questioning the utility of LLMs for knowledge work is now akin to a conspiracy theory?
Not what I said at all. Question it all what you want. But disproving outrageous CEO claims doesn't get you there. Whether LLMs are AGI/ASI that will replace everyone is seperate from whether they are useful today as tools. Attacking the first claim doesn't mean much for the second claim, which is the more interesting one.
Would you hire a PhD to copy URLs by hand? Would them having PhD make it less likely they’d make a mistake than an high school student doing the same?
Grad students and even post docs often do a lot of this manual labour for data entry and formatting. Been there, done that.
Manual data entry has lots of errors. All good workflows around this base themselves on this fact.
I would not hire anyone for a role that requires computer use who does not know how to use copy/paste
A high school student would use copy/paste and the urls would be perfect duplicates..
LLMs aren’t high school students, they’re blobs of numbers which happen to speak English if you poke them right. Use the tool when it’s good at what it does.
And the people who are causing this confusion are the CEOs of the companies saying that the newest model is a PHD in your pocket.
> A high school student would use copy/paste and the urls would be perfect duplicates..
Did the LLM have this?
I suspect you haven't tried a modern mid-to-large-LLM & Agent pair for writing code. They're quite capable, even if not suited for all tasks.
Well, you see it hallucinates on long precise strings, but if we ignore that, and focus on what it’s powerful at, we can do something powerful. In this case, by the time it gets to outputting the url, it already determined the correct intent or next action (print out a url). You use this intent to do a tool call to generate a url. Small aside, it’s ability to figure what and why is pure magic, for those still peddling the glorified autocomplete narrative.
You have to be able to see what this thing can actually do, as opposed to what it can’t.
> Well, you see it hallucinates on long precise strings
But all code is "long precise strings".
He obviously means random unstructured strings, which code is usually not.
I can’t even tell if you’re being sarcastic about a terrible tool or are hyping up LLMs as intelligent assistants and telling me we’re all holding it wrong.
I would generalise it to you can’t trust LLMs to generate any kind of unique identifier. Sooner or later it will hallucinate a fake one.
I would generalize it further: you can't trust LLMs.
They're useful, but you must verify anything you get from them.
> You’re just not using LLMs enough.
> You can never trust the LLM to generate a url
This is very poorly worded. Using LLMs more wouldn't solve the problem. What you're really saying is that the GP is uninformed about LLMs.
This may seem like pedantry on my part but I'm sick of hearing "you're doing it wrong" when the real answer is "this tool can't do that." The former is categorically different than the latter.
It's pretty clearly worded to me, they don't use LLMs enough to know how to use them successfully. If you use them regularly you wouldn't see a set of urls without thinking "Unless these are extremely obvious links to major sites, I will assume each is definitely wrong".
> I'm sick of hearing "you're doing it wrong"
That's not what they said. They didn't say to use LLMs more for this problem. The only people that should take the wrong meaning from this are ones who didn't read past the first sentence.
> when the real answer is "this tool can't do that."
That is what they said.
> If you use them regularly you wouldn't see a set of urls without thinking...
Sure, but conceivably, you could also be informed of this second hand, through any publication about LLMs, so it is very odd to say "you don't use them enough" rather than "you're ignorant" or "you're uninformed". It is very similar to these very bizarre AI-maximalist positions that so many of us are tired of seeing.
This isn't ai maximalist though, it's explicitly pointing out something that regularly does not work!
> Sure, but conceivably, you could also be informed of this second hand, through any publication about LLMs, so it is very odd to say "you don't use them enough" rather than "you're ignorant" or "you're uninformed".
But this is to someone who is actively using them, and the suggestion of "if you were using them more actively you'd know this, this is a very common issue" is not at all weird. There are other ways they could have known this, but they didn't.
"You haven't got the experience yet" is a much milder way of saying someone doesn't know how to use a tool properly than "you're ignorant".
I think part of the issue is that it doesn't "feel" like the LLM is generating a URL, because that's not what a human would be doing. A human would be cut & pasting the URLs, or editing the code around them - not retyping them from scratch.
Edit: I think I'm just regurgitating the article here.
This is a horror story about bad quality control practices, not the use of LLMs.
I have a project that I've leaned heavily on LLM help for which I consider to embody good quality control practices. I had to get pretty creative to pull it off: spent a lot of time working on this sync system so that I can import sanitized production data into the project for every table it touches (there are maybe 500 of these) and then there's a bunch of hackery related to ensuring I can still get good test coverage even when some of these flows are partially specified (since adding new ones proceeds in several separate steps).
If it was a project written by humans I'd say they were crazy for going so hard on testing.
The quality control practices you need for safely letting an LLM run amok aren't just good. They're extreme.
This is of course bad but: humans also makes (different) mistakes all the time. We could account for the risk of mistakes being introduced and make more tools that validate things for us. In a way LLM:s encourage us to do this by adding other vectors of chaos into our work.
Like, why not have tools built into our environment that checks that links are not broken? With the right architecture we could have validations for most common mistakes without having the solution adding a bunch of tedious overhead.
In the above kind of described situation, a meticulous coder actually makes no mistakes. They will however make a LOT more mistakes if they use LLM's to do the same.
I have already had to correct a LOT of crap similar to the above in refactoring-done-via-LLM over the last year.
When stuff like this was done by a plain, slow, organic human, it was far more accurate. And many times, completely accurate with no defects. Simply because many developers pay close attention when they are forced to do the manual labour themselves.
Sure the refactoring commit is produced faster with LLM assistance, but repeatedly reviewing code and pointing out weird defects is very stressful.
> I have already had to correct a LOT of crap similar to the above in refactoring-done-via-LLM over the last year
The person using the LLM should be reviewing their code before submitting it to you for review. If you can catch a copy paste error like this, then so should they.
The failure you're describing is that your coworkers are not doing their job.
And if you accept "the LLM did that, not me" as an excuse then the failure is on you and it will keep happening.
A meticulous coder probably wouldn't have typed out 40 URLs just because they want to move them from one file to another. They would copy-past them and run some sed-like commands. You could instruct an LLM agent to do something similar. For modifying a lot of files or a lot of lines, I instruct them to write a script that does what I need instead of telling them to do it themselves.
I think it goes without saying that we need to be sceptical when to use and not use LLM. The point I'm trying to make is more that we should have more validations and not that we should be less sceptical about LLMs.
Meticulousness shouldn't be an excuse to not have layers of validation that doesn't have to cost that much if done well.
LLMs are turning into LLMs+hard-coded fixes for every imaginable problem.
Why hard coded?
Your point to not rely on good intentions and have systems in place to ensure quality is good - but your comparison to humans didn't go well with me.
Very few humans fill in their task with made up crap then lie about it - I haven't met any in person. And if I did, I wouldn't want to work with them, even if they work 24/7.
Obligatory disclaimer for future employers: I believe in AI, I use it, yada yada. The reason I'm commenting here is I don't believe we should normalise this standard of quality for production work.
I agree, these kinds of stories should encourage us to setup more robust testing/backup/check strategies. Like you would absolutely have to do if you suddenly invited a bunch of inexperienced interns to edit your production code.
> that checks that links are not broken?
Can you spot the next problem introduced by this?
Agreed with the points in that article, but IMHO the no 1 issue is that agents only see a fraction of the code repository. They don't know whether there is a helper function they could use, so they re-implement it. When contributing to UIs, they can't check the whole UI to identify common design patterns, so they re-invent it.
The most important task for the human using the agent is to provide the right context. "Look at this file for helper functions", "do it like that implementation", "read this doc to understand how to do it"... you can get very far with agents when you provide them with the right context.
(BTW another issue is that they have problems navigating the directory structure in a large mono repo. When the agents needs to run commands like 'npm test' in a sub-directory, they almost never get it right the first time)
This is what I keep running into. Earlier this week I did a code review of about new lines of code, written using Cursor, to implement a feature from scratch, and I'd say maybe 200 of those lines were really necessary.
But, y'know what? I approved it. Because hunting down the existing functions it should have used in our utility library would have taken me all day. 5 years ago I would have taken the time because a PR like that would have been submitted by a new team member who didn't know the codebase well, and helping to onboard new team members is an important part of the job. But when it's a staff engineer using Cursor to fill our codebase with bloat because that's how management decided we should work, there's no point. The LLM won't learn anything and will just do the same thing over again next week, and the staff engineer already knows better but is being paid to pretend they don't.
>>because that's how management decided we should work, there's no point
If you are personally invested, there would be a point. At least if you plan to maintain that code for a few more years.
Let's say you have a common CSS file, where you define .warning {color: red}. If you want the LLM to put out a warning and you just tell it to make it red, without pointing out that there is the .warning class, it will likely create a new CSS def for that element (or even inline it - the latest Claude Code has a tendency to do that). That's fine and will make management happy for now.
But if later management decides that it wants all warning messages to be pink, it may be quite a challenge to catch every place without missing one.
There really wouldn't be; it would just be spitting into the wind. What am I going to do, convince every member of my team to ignore a direct instruction from the people who sign our paychecks?
I really really hate code review now. My colleagues will have their LLMs generate thousands of lines of boiler plate with every pattern and abstraction under the sun. A lazy programmer use to do the bare minimum and write not enough code. That made review easy. Error handling here, duplicate code there, descriptive naming here, and so on. Now a lazy programmer generates a crap load of code cribbed from "best practice" tutorials, much of it unnecessary and irrelevant for the actual task at hand.
> When the agents needs to run commands like 'npm test' in a sub-directory, they almost never get it right the first time)
I was running into this constantly on one project with a repo split between a Vite/React front end and .NET backend (with well documented structure). It would sometimes go into panic mode after some npm command didn’t work repeatedly and do all sorts of pointless troubleshooting over and over, sometimes veering into destructive attempts to rebuild whatever it thought was missing/broken.
I kept trying to rewrite the section in CLAUDE.md to effectively instruct it to always first check the current directory to verify it was in the correct $CLIENT or $SERVER directory. But it would still sometimes forget randomly which was aggravating.
I ended up creating some aliases like “run-dev server restart” “run-dev client npm install” for common operations on both server/client that worked in any directory. Then added the base dotnet/npm/etc commands to the deny list which forced its thinking to go “Hmm it looks like I’m not allowed to run npm, so I’ll review the project instructions. I see, I can use the ‘run-dev’ helper to do $NPM_COMMAND…”
It’s been working pretty reliably now but definitely wasted a lot of time with a lot of aggravation getting to that solution.
I wonder if a large context model could be employed here via tool call. One of the great things Gemini chat can do is ingest a whole GitHub repo.
Perhaps "before implementing a new utility or helper function, ask the not-invented-here tool if it's been done already in the codebase"
Of course, now I have to check if someone has done this already.
Large context models don't do a great job of consistently attending to the entire context, so it might not work out as well in practice as continuing to improve the context engineering parts of coding agents would.
I'd bet that most the improvement in Copilot style tools over the past year is coming from rapid progress in context engineering techniques, and the contribution of LLMs is more modest. LLMs' native ability to independently "reason" about a large slushpile of tokens just hasn't improved enough over that same time period to account for how much better the LLM coding tools have become. It's hard to see or confirm that, though, because the only direct comparison you can make is changing your LLM selection in the current version of the tool. Plugging GPT5 into the original version of Copilot from 2021 isn't an experiment most of us are able to try.
Sure, but just bcuz it went into context doesn't mean LLM "understand" it. Also, not all sections of context iz equal.
Claude can use use tools to do that, and some different code indexer MCPs work, but that depends on the LLM doing the coding to make the right searches to find the code. If you are in a project where your helper functions or shared libs are scattered everywhere it’s a lot harder.
Just like with humans it definitely works better if you follow good naming conventions and file patterns. And even then I tend to make sure to just include the important files in the context or clue the LLM in during the prompt.
It also depends on what language you use. A LOT. During the day I use LLMs with dotnet and it’s pretty rough compared to when I’m using rails on my side projects. Dotnet requires a lot more prompting and hand holding, both due to its complexity but also due to how much more verbose it is.
This is what we do at Augmentcode.com.
We started with building the best code retrieval and build an agent around it.
That's what claude.md etc are for. If you want it to follow your norms then you have to document them.
Well, sure, but from what I know, humans are way better at following 'implicit' instructions than LLMs. A human programmer can 'infer' most of the important basic rules from looking at the existing code, whereas all this agents.md/claude.md/whatever stuff seems necessary to even get basic performance in this regard.
Also, the agents.md website seems to mostly list README.md-style 'how do I run this instructions' in its example, not stylistic guidelines.
Furthermore, it would be nice if these agents add it themselves. With a human, you tell them "this is wrong, do it that way" and they would remember it. (Although this functionality seems to be worked on?)
That's fine for norms, but I don't think you can use it to describe every single piece of your code. Every function, every type, every CSS class...
To be fair, this is a daily life story for any senior engineer working with other engineers.
From the article: > I contest the idea that LLMs are replacing human devs...
AI is not able to replace good devs. I am assuming that nobody sane is claiming such a thing today. But, it can probably replace bad and mediocre devs. Even today.
In my org we had 3 devs who went through a 6-month code boot camp and got hired a few years ago when it was very difficult to find good devs. They struggled. I would give them easy tasks and then clean up their PRs during review. And then AI tools got much better and it started outperforming these guys. We had to let two go. And third one quit on his own.
We still hire devs. But have become very reluctant to hire junior devs. And will never hire someone from a code boot camp. And we are not the only ones. I think most boot camps have gone out of business for this reason.
Will AI tools eventually get good enough to start replacing good devs? I don't know. But the data so far shows that these tools keep getting better over time. Anybody who argues otherwise has their heads firmly stuck in sand.
In the early US history approximately 90% of the population was involved in farming. Over the years things changed. Now about 2% has anything to do with farming. Fewer people are farming now. But we have a lot more food and a larger variety available. Technology made that possible.
It is totally possible that something like that could happen to the software development industry as well. How fast it happens totally depends on how fast do the tools improve.
>But we have a lot more food and a larger variety available. Technology made that possible.
Sure, but the food is less nutritious and more toxic.
What do you think was the reason that the bootcamp grads struggling to get better at what they do?
A computer science degree in most US colleges takes about 4 years of work. Boot camps try to cram that into 6 months. All the while many students have other full-time jobs. This is simply not enough training for the students to start solving complex real world problem. Even 4 years is not enough.
Many companies were willing to hire fresh college grads in the hopes that they could solve relatively easy problems for a few years, gain experience and become successful senior devs at some point.
However, with the advent of AI dev tools, we are seeing very clear signs that junior dev hiring rates have fallen off a cliff. Our project manager, who has no dev experience, frequently assigns easy tasks/github issues to Github Copilot. Copilot generates a PR in a few minutes that other devs can review before merging. These PRs are far superior to what an average graduate of a code boot camp could ever create. Any need we had for a junior dev has completely disappeared.
> Any need we had for a junior dev has completely disappeared.
Where do your senior devs come from?
That's the question that has been stuck in my head as I read all these stories about junior dev jobs disappearing. I'm firmly mid-level, having started my career just before LLM coding took off. Sometimes it feels like I got on the last chopper out of Saigon.
Yep, I graduated and got my first job in 2022 when the market was hot and ChatGPT was a fun novelty. Very lucky
My experience with them is that the are taught to cover as much syntax and libraries as possible, without spending time learning how solve problems and develop their own algorithms. They (in general) expect to follow predefined recipes.
On a more important level, I found that they still do really badly at even a minorly complex task without extreme babysitting.
I wanted it to refactor a parser in a small project (2.5K lines total) because it'd gotten a bit too interconnected. It made a plan, which looked reasonable, so I told it to do this in stages, with checkpoints. It said it'd done so. I asked it "so is the old architecture also removed?" "No, it has not been removed." "Is the new structured used in place of the old one?" "No, it has not." After it did so, 80% of the test suite failed because nothing it'd written was actually right.
Did so three times with increasingly more babysitting, but it failed at the abstract task of "refactor this" no matter what with pretty much the same failure mode. I feel like I have to tell it exactly to make changes X and Y to class Z, remove class A etc etc, at which point I can't let it do stuff unsupervised, which is half of the reason for letting an LLM do this in the first place.
> I wanted it to refactor a parser in a small project
This expression tree parser (typescript to sql query builder - https://tinqerjs.org/) has zero lines of hand-written code. It was made with Codex + Claude over two weeks (part-time on the side). Having worked on ORMs previously, it would have taken me 4x-10x the time to get to the same state (which also has 100s of tests, with some repetitions). That's a massive saving in time.
I did not have to baby sit the LLMs at all. So the answer is, I think it depends on what you use it for, and how you use it. Like every tool, it takes a really long time to find a process that works for you. In my conversations with other developers who use LLMs extensively, they all have their unique, custom workflows. All of them however do focus on test suites, documentation, and method review processes.
Quite impressive, thank you for sharing!
Question - this loads a 2 MB JS parser written in Rust to turn `x => x.foo` into `{ op: 'project', field: 'foo', target: 'x' }`. But you don't actually allow any complex expressions (and you certainly don't seem to recursively parse references or allow return uplift, e. g. I can't extract out `isOver18` or `isOver(age: int)(Row: IQueryable): IQueryable`). Why did you choose the AST route instead of doing the same thing with a handful of regular expressions?
Parsing code with regex is a minefield. You can get it to work with simpler cases, but even that might get complex very quickly with all sorts of formatting preferences that people have. In fact, I'll be very surprised if it can be done with a few regular expressions; so I never gave it much consideration. Additionally, improved subquery support etc is coming, involving deeper recursion.
I could have allowed (I did consider it) functions external to the expression, like isOver18 in your example. But it would have come at the cost of the parser having to look across the code base, and would have required tinqerjs to attach via build-time plugins. The only other way (without plugins) might be to identify callers via Error.stack, and attempting to find the calling JS file.
Development tools and libraries seem like they may be one of the absolute easiest use cases to get LLMs to work with since they generally have far less ambiguous requirements than other software and the LLMs generally have an enormous amount of data in their training set to help them understand the domain.
I have tried several. Overall I've now set on strict TDD (which it still seems to not do unless I explicitly tell it to even though I have it as a hard requirement in claude.md).
Claude forgets claude.md after a while, so you need to keep reminding. I find that codex does a design job better than Claude at the moment, but it's 3x slower which I don't mind.
Hum yeah, it shows. Just the fact that the API looks completely different for Postgre and SQLite tells us everything we need to know about the quality of the project here.
> Just the fact that the API looks completely different for Postgre and SQLite tells us everything we need to know about the quality of the project here.
How does the API look completely different for pg and sqlite? Can you share an example?
It's an implementation of LINQ's IQueryable. With some bells missing in DotNet's Queryable, like Window functions (RANK queries etc) which I find quite useful.
Add: What you've mentioned is largely incorrect. But in any case, it is a query builder. Meaning, an ORM like database abstraction is not the goal. This allows us to support pg's extensions, which aren't applicable to other database.
I guess the interesting question is whether @jeswin could have created this project at all if AI tools were not involved. And if yes, would the quality even be better?
Very true. However, to claim that the "API looks completely different for Postgre and SQLite" is disingenuous. What was he looking at?
There are two examples on the landing page, and they both look quite different. Surely if the API is the same for both, there'd be just one example that covers both cases, or two examples would be deliberately made as identical as possible? (Like, just a different new somewhere, or different import directive at the top, and everything else exactly the same?) I think that's the point.
Perhaps experienced users of relevant technologies will just be able to automatically figure this stuff out, but this is a general discussion - people not terribly familiar with any of them, but curious about what a big pile of AI code might actually look like, could get the wrong impression.
If you're mentioning the first two examples, they're doing different things. The pg example does an orderby, and the sqlite example does a join. You'll be able to switch the client (ie, better-sqlite and pg-promise) in either statement, and the same query would work on the other database.
Maybe I should use the same example repeated for clarity. Let me do that.
Edit: Fixed. Thank you.
Actually the interesting question is whether this library not existing would have been a loss for humanity. I'll posit that it would not.
>I feel like I have to tell it exactly to make changes X and Y to class Z, remove class A etc etc, at which point I can't let it do stuff unsupervised, which is half of the reason for letting an LLM do this in the first place.
The reason better turn to "It can do stuff faster than I ever could if I give it step by step high level instructions" instead.
That would be a solution, yes. But currently it feels extremely borked from a UX perspective. It purports to be able to do this, but when you tell it to it breaks in unintuitive ways.
I hate this idea of "well you just need to understand all the arcane ways in which to properly use it to its proper effects".
It's like a car which has a gear shifter, but that's not fully functional yet, so instead you switch gear by spelling out in morse code the gear you want to go into using L as short and R as long. Furthermore, you shouldn't try to listen to 105-112 on the FM band on the radio, because those frequencies are used to control the brakes and ABS and if you listen to those frequencies the brakes no longer work.
We would rightfully stone any engineer who'd design this and then say "well obvious user error" when the user rightfully complains that they crash whenever they listen to Arrow FM.
>But currently it feels extremely borked from a UX perspective. It purports to be able to do this, but when you tell it to it breaks in unintuitive ways.
Thankfully as programmers we know better and don't need to care what the UI pretends to be able to do :)
>We would rightfully stone any engineer who'd design this and then say "well obvious user error" when the user rightfully complains that they crash whenever they listen to Arrow FM.
We might curse the company and engineer who did it, but we would still use that car and do those workarounds, if doing so allowed us to get to our destination in 1/10 the regular time...
> >But currently it feels extremely borked from a UX perspective. It purports to be able to do this, but when you tell it to it breaks in unintuitive ways.
> Thankfully as programmers we know better and don't need to care what the UI pretends to be able to do :)
But we do though. You can't just say "yeah they left all the foot guns in but we ought to know not to use them", especially not when the industry shills tell you those footguns are actually rocket boosters to get you to the fucking moon and back.
Might be related with what the article was talking. AI can't cut-paste. It deletes the code and then regenerates it at another location instead of cut-paste.
Obviously generated code drift a little from deleted ones.
Interesting. What model and tool was used?
I have seen similar failure modes in Cursor and VSCode Copilot (using gpt5) where I have to babysit relatively small refactors.
Claude code. Whichever model it started up automatically last weekend, I didn't explicitly check.
This feels like a classic Sonnet issue. From my experience, Opus or GPT-5-high are less likely to do the "narrow instruction following without making sensible wider decisions based on context" than Sonnet.
This is "just use another Linux distro" all over again
Yes and no, it's a fair criticism to some extent. Inasamuch as I would agree that different models of the same type have superficial differences.
However, I also think that models which focus on higher reasoning effort in general are better at taking into account the wider context and not missing obvious implications from instructions. Non-reasoning or low-reasoning models serve a purpose, but to suggest they are akin to different flavours misses what is actually quite an important distinction.
I was hoping that LLMs being able to access strict tools, like Gemini using Python libraries, would finally give reliable results.
So today I asked Gemini to simplify a mathematical expression with sympy. It did and explained to me how some part of the expression could be simplified wonderfully as a product of two factors.
But it was all a lie. Even though I explicitly asked it to use sympy in order to avoid such hallucinations and get results that are actually correct, it used its own flawed reasoning on top and again gave me a completely wrong result.
You still cannot trust LLMs. And that is a problem.
The obvious point has to be made: Generating formal proofs might be a partial fix for this. By contrast, coding is too informal for this to be as effective for it.
>Sure, you can overengineer your prompt to try get them to ask more questions
That's not overengineering, that's engineering. "Ask clarifying questions before you start working", in my experience, has led to some fantastic questions, and is a useful tool even if you were to not have the AI tooling write any code. As a good programmer, you should know when you are handing the tool a complete spec to build the code and when the spec likely needs some clarification, so you can guide the tool to ask when necessary.
You can even tell it how many questions to ask. For complex topics, I might ask it to ask me 20 or 30 questions. And I'm always surprised how good those are. You can also keep those around as a QnA file for later sessions or other agents.
Yeah, this made me stop reading. I often tell it to ask me any questions if unclear (and sometimes my prompt is just "Hey, this is my idea. Ask me questions to flesh it out").
It always asks me questions, and I've always benefited from it. It will subtly point out things I hadn't thought about, etc.
I think LLMs provide value, used it this morning to fix a bug in my PDF Metadata parser without having to get too deep into the PDF spec.
But most of the time, I find that the outputs are nowhere near the effect of just doing it myself. I tried Codex Code the other day to write some unit tests. I had a few setup and wanted to use it (because mocking the data is a pain).
It took about 8 attempts, I had to manually fix code, it couldn't understand that some entities were obsolete (despite being marked and the original service not using them). Overall, was extremely disappointed.
I still don't think LLMs are capable of replacing developers, but they are great at exposing knowledge in fields you might not know and help guide you to a solution, like Stack Overflow used to do (without the snark).
I think LLMs have what it takes at this point in time, but it's the coding agent (combined with the model) that make the magic happen. Coding agents can implement copy-pasting, it's a matter of building the right tool for it, then iterating with given models/providers, etc. And that's true for everything else that LLMs lack today. Shortcomings can be remediated with good memory and context engineering, safety-oriented instructions, endless verification and good overall coding agent architecture. Also having a model that can respond fast, have a large context window and maintain attention to instructions is also essential for a good overall experience.
And the human prompting, of course. It takes good sw engineering skills, particularly knowing how to instruct other devs in getting the work done, setting up good AGENTS.md (CLAUDE.md, etc) with codebase instructions, best practices, etc etc.
So it's not an "AI/LLMs are capable of replacing developers"... that's getting old fast. It's more like, paraphrasing the wise "it's not what your LLM can do for you, but what can you do for your LLM"
Most developers are also bad at asking questions. They tend to assume too many things from the start.
In my 25 years of software development I could apply the second critique to over half of the developers I knew. That includes myself for about half of that career.
But, just like lots of people expect/want self-driving to outperform humans even on edge cases in order to trust them, they also want "AI" to outperform humans in order to trust it.
So: "humans are bad at this too" doesn't have much weight (for people with that mindset).
It makes sense to me, at least.
If we had a knife that most of the time cuts a slice of bread like the bottom p50 of humans cutting a slice of bread with their hands, we wouldn't call the knife useful.
Ok, this example is probably too extreme, replace the knife with an industrial machine that cut bread vs a human with a knife. Nobody would buy that machine either if it worked like that.
I think this is still too extreme. A machine that cuts and preps food at the same level as a 25th percentile person _being paid to do so_, while also being significantly cheaper would presumably be highly relevant.
Aw man. There are so many angles though.
Your p25 employee is probably much closer to your p95 employee than to the p50 "standard" human, so yeah, I think you have a point there.
But at least in food prep, p25 would already be pretty damn hard to achieve. That's a hell of a lot of autonomy and accuracy (at least in my restaurant kitchen experience which is admittedly just one year in "fine dining"-ish kitchens).
I'd say the p25 of software or SRE folks I've worked with is also a pretty high bar to hit, too, but maybe I've been lucky.
I feel kind of attacked for my sub-p50 bread slicing skills, TBH. :(
Agreed in a general sense, but there's a bit more nuance.
If a knife slices bread like a normal human at p50, it's not a very good knife.
If a knife slices bread like a professional chef at p50, it's probably a very decent knife.
I don't know if LLMs are better at asking questions than a p50 developer. In my original comment I wanted to raise the question of whether the fact that LLMs are not good at asking questions makes them still worse than human devs.
The first LLM critique in the original article is that they can't copy and paste. I can't argue with that. My 12 year old copies-and-pastes better than top coding agents.
The second critique says they can't ask questions. Since many developers also are not good at this, how does the current state of the art LLM compare to a p50 developer in this regard?
> LLMs don’t copy-paste (or cut and paste) code. For instance, when you ask them to refactor a big file into smaller ones, they’ll "remember" a block or slice of code, use a delete tool on the old file, and then a write tool to spit out the extracted code from memory. There are no real cut or paste tools. Every tweak is just them emitting write commands from memory. This feels weird because, as humans, we lean on copy-paste all the time.
There is not that much copy/paste that happens as part of refactoring so it leans to just using context recall. It's not entirely clear if providing an actual copy/paste command is particularly useful, at least from my testing it does not do much. More interesting are repetitive changes that clog up the context. Those you can improve on if you have `fastmod` or some similar tool available: with it you can instruct codex or claude to perform edits with it.
> And it’s not just how they handle code movement -- their whole approach to problem-solving feels alien too.
It is, but if you go back and forth to work out a plan for how to solve the problem, then the approach greatly changes.
How is it not clear that it would be beneficial?
To use another example, with my IDE I can change a signature or rename something across multiple files basically instantly. But an LLM agent will take multiple minutes to do the same thing and doesn't get it right.
> How is it not clear that it would be beneficial?
There is reinforcement learning on the Anthropic side for a text edit tool, which is built in a way that does not lend itself to copy/paste. If you use a model like the GPT series then there might not be reinforcement learning for text editing (I believe, I don't really know), but it operates on line-based replacements for the most part and for it to understand what to manipulate it needs to know the content in the context. When you try to give it a copy/paste buffer it does not fully comprehend what the change in the file looks like after the operation.
So it might be possible to do something with copy/paste, but I did not find it to be very obvious how you make that work with an agent, given that it needs to read the file into context anyways and its recall capabilities are surprisingly good.
> To use another example, with my IDE I can change a signature or rename something across multiple files basically instantly.
So yeah, that's the more interesting case and there things like codemod/fastmod are very effective if you tell an agent to use it. They just don't reach there.
I think copy/paste can alleviate context explosion. Basically the model can remember what's the code block contain, can access it at any time, without needing to "remember" it.
Inspired by the copy-paste point in this post, I added agent buffer tools to clippy, a macOS utility I maintain which includes an MCP server that interacts with the system clipboard. In this case it was more appropriate to use a private buffer instead. With the tools I just added, the server reads file bytes directly - your agent never generates the copied content as tokens. Three operations:
buffer_copy: Copy specific line ranges from files to agent's private buffer
buffer_paste: Insert/append/replace those exact bytes in target files
buffer_list: See what's currently buffered
So the agent can say "copying lines 50-75 from auth.py" and the MCP server handles the actual file I/O. No token generation, no hallucination, byte-for-byte accurate. Doesn't touch your system clipboard either.
The MCP server already included tools to copy AI-generated content to your system clipboard - useful for "write a Python script and copy it" workflows.
(Clippy's main / original purpose is improving on macOS pbcopy - it copies file references instead of just file contents, so you can paste actual files into Slack/email/etc from the terminal.)
If you're on macOS and use Claude or other MCP-compatible agents: https://github.com/neilberkman/clippy
brew install neilberkman/clippy/clippy
Codex has got me a few times lately, doing what I asked but certainly not what I intended:
- Get rid of these warnings "...": captures and silences warnings instead of fixing them - Update this unit test to relfect the changes "...": changes the code so the outdated test works - The argument passed is now wrong: catches the exception instead of fixing the argument
My advice is to prefer small changes and read everything it does before accepting anything, often this means using the agent actually is slower than just coding...
You also have to be a bit careful:
“Fix the issues causing these warnings”
Retrospectively fixing a test to be passing given the current code is a complex task, instead, you can ask it to write a test that tests the intended behaviour, without needing to infer it.
“The argument passed is now wrong” - you’re asking the LLM to infer that there’s a problem somewhere else, and to find and fix it.
When you’re asking an LLM to do something, you have to be very explicit about what you want it to do.
Exactly, I think the takeaway is that being careful when formulating a task is essential with LLMs. They make errors that wouldn’t be expected when asking the same from a person.
I see a pattern in these discussions all the time: some people say how very, very good LLMs are, and others say how LLMs fail miserably; almost always the first group presents examples of simple CRUD apps, frontend "represent data using some JS-framework" kind of tasks, while the second group presents examples of non-trivial refactoring, stuff like parsers (in this thread), algorithms that can't be found in leetcode, etc.
Tech twitter keeps showing "one-shotting full-stack apps" or "games", and it's always something extremely banal. It's impressive that a computer can do it on its own, don't get me wrong, but it was trivial to programmers, and now it is commoditized.
Yesterday, I got Claude Code to make a script that tried out different point clustering algorithms and visualise them. It made the odd mistake, which it then corrected with help, but broadly speaking it was amazing. It would've taken me at least a week to write by hsnd, maybe longer. It was writing the algorithms itself, definitely not just simple CRUD stuff.
I also got good results for “above CRUD” stuff occasionally. Sorry if I wasn’t clear, I meant to primarily share an observation about vastly different responses in discussions related to LLMs. I don’t believe LLMs are completely useless for non-trivial stuff, nor I believe that they won’t get better. Even those two problems in the linked article: sure, those actions are inherently alien to the LLM’s structure itself, but can be solved with augmentation.
That's actually a very specific domain, which is well documented and researched in which LLM's will alawys do well. Shit will hit the fans quickly when you're going to do integration where it won't have a specific problem domain.
Yep - visualizing clustering algorithms is just the "CRUD app" of a different speciality.
One rule of thumb I use, is if you could expect to find a student on a college campus to do a task for you, an LLM will probably be able to do a decent job. My thinking is because we have a lot of teaching resources available for how to do that task, which the training has of course ingested.
In my experience it's been great to have LLMs for narrowly-scoped tasks, things I know how I'd implement (or at least start implementing) but that would be tedious to manually do, prompting it with increasingly higher complexity does work better than I expected for these narrow tasks.
Whenever I've attempted to actually do the whole "agentic coding" by giving it a complex task, breaking it down in sub-tasks, loading up context, reworking the plan file when something goes awry, trying again, etc. it hasn't a single fucking time done the thing it was supposed to do to completion, requiring a lot of manual reviewing, backtracking, nudging, it becomes more exhausting than just doing most of the work myself, and pushing the LLM to do the tedious work.
It does work sometimes to use for analysis, and asking it to suggest changes with the reasoning but not implement them, since most times when I let it try to implement its broad suggestions it went haywire, requiring me to pull back, and restart.
There's a fine line to walk, and I only see comments on the extremes online, it's either "I let 80 agents running and they build my whole company's code" or "they fail miserably on every task harder than a CRUD". I tend to not believe in either extreme, at least not for the kinds of projects I work on which require more context than I could ever fit properly beforehand to these robots.
Let’s see the diff
The two groups are very different but I notice another pattern: you have people who like coding and understanding details of what their are doing, are curious, what to learn about the why and think about edge cases; and there's another group of people who just want to code something, make a test pass, show a nice UI and that's it, but don't think much about edge cases or maintainability. The only thing they think is "delivering value" to customers.
Usually those two groups correlate very well with liking LLMs: some people will ask Claude to create a UI with React and see the mess it generated (even if it mostly works) and the edge cases it left out and comment in forums that LLMs don't work. The other group of people will see the UI working and call it a day without even noticing the subtleties.
I use LLMs to vibe-code entire tools that I need for my work. They're really banal boring apps that are relatively simple, but they still would have wasted a day or two each to write and debug. Even stuff as simple as laying out the whole UI in a nice pattern. Most of these are now practically one-shots from the latest Claude and GPT. I leave them churning, get coffee, come back and test the finished product.
The function of technological progress, looked at through one lens, is to commoditise what was previously bespoke. LLMs have expanded the set of repeatable things. What we're seeing is people on the one hand saying "there's huge value in reducing the cost of producing rote assets", and on the other "there is no value in trying to apply these tools to tasks that aren't repeatable".
Both are right.
> almost always the first group presents examples of simple CRUD apps
How about a full programming language written by cc "in a loop" in ~3 months? With a compiler and stuff?
https://cursed-lang.org/
It might be a meme project, but it's still impressive as hell we're here.
I learned about this from a yt content creator that took that repo, asked cc to "make it so that variables can be emojis", and cc did that 5$ later. Pretty cool.
Ok, not trivial for sure, but not novel? IIUC, the language does not have really new concepts, apart from the keywords (which is trivial).
Impressive nonetheless.
There’s no evidence that this ever happened other than this guy’s word. And since the claim that he ran an agent with no human intervention for 3 months is so far outside of any capabilities demonstrated by anyone else, I’m going to need to see some serious evidence before I believe it.
> There’s no evidence that this ever happened other than this guy’s word.
There's a yt channel where the sessions were livestreamed. It's in their FAQ. I haven't felt the need to check them, but there are 10-12h sessions in there if you're that invested in proving that this is "so far outside of any capabilities"...
A brief look at the commit history should show you that it's 99.9% guaranteed to be written by an LLM :)
When's the last time you used one of these SotA coding agents? They've been getting better and better for a while now. I am not surprised at all that this worked.
>When's the last time you used one of these SotA coding agents?
This morning :)
>"so far outside of any capabilities"
Anthropic was just bragging last week about being able to code without intervention for 30 hours before completely losing focus. They hailed it as a new bench mark. It completed a project that was 11k lines of code.
The max unsupervised run that GPT-5-Codex has been able to pull off is 7 hours.
That's what I mean by the current SOTA demonstrated capabilities.
https://x.com/rohanpaul_ai/status/1972754113491513481
And yet here you have a rando who is saying that he was able able to get an agent to run unsupervised for 100x longer than what the model companies themselves have been able to do and produce 10x the amount of code--months ago.
I'm 100% confident this is fake.
>There's a yt channel where the sessions were livestreamed.
There are a few videos that long, not 3 months worth of videos. Also I spot checked the videos and it the framerate is so low that it would be trivial to cut out the human intervention.
>guaranteed to be written by an LLM
I don't doubt that it was 99.9% written by an LLM, the question is whether he was able to run unsupervised for 3 months or whether he spent 3 months guiding an LLM to write it.
I think you are confusing 2 things here. What the labs mean when they announce x hours sessions is on "one session" (i.e. the agent manages its own context via trimming and memory files, etc). What the project I linked did was "run in a bash loop", that basically resets the context every time the agent "finishes".
That would mean that every few hours the agent starts fresh, does the inspect repo thing, does the plan for that session, and so on. That would explain why it took it ~3 months to do what a human + ai could probably do in a few weeks. That's why it doesn't sound too ludicrous for me. If you look at the repo there are a lot of things that are not strictly needed for the initial prompt (make a programming language like go but with genz stuff, nocap).
Oh, and if you look at their discord + repo, lots of things don't actually work. Some examples do, some segfault. That's exactly what you'd expect from "running an agent in a loop". I still think it's impressive nonetheless.
The fact that you are so incredulous (and I get why that is, scepticism is warranted in this space) is actually funny. We are on the right track.
There’s absolutely no difference from what he says he did and what Claude code can do behind the scenes.
If Anthropic thought they could produce anything remotely useful by wiping the context and reprompting every few hours, they would be doing it. And they’d be saying “look at this we implemented hard context reset and we can now run our agent for 30 days and produce an entire language implementation!”
In 3 months or 300 years of operating like this a current agent being freshly reprompted every few hours would never produce anything that even remotely looked like a language implementation.
As soon as its context was poisoned with slightly off topic todo comments it would spin out into writing a game of life implementation or whatever. You’d have millions of lines of nonsense code with nothing useful after 3 months of that.
The only way I see anything like this doing anything approaching “useful” is if the outer loop wipes the repo on every reset as well, and collects the results somewhere the agent can’t access. Then you essentially have 100 chances to one shot the thing.
But at that point you just have a needlessly expensive and slow agent.
Novel as in never done before? Of course not.
Novel as in "an LLM can maintain coherence on a 100k+ LoC project written in zig"? Yeah, that's absolutely novel in this space. This wasn't possible 1 year ago. And this was fantasy 2.5 years ago when chatgpt launched.
Also impressive in that cc "drove" this from a simple prompt. Also impressive that cc can do stuff in this 1M+ (lots of js in the extensions folders?) repo. Lots of people claim LLMs are useless in high LoC repos. The fact that cc could navigate a "new" language and make "variables as emojis" work is again novel (i.e. couldn't be done 1 year ago) and impressive.
>Novel as in "an LLM can maintain coherence on a 100k+ LoC project written in zig"? Yeah, that's absolutely novel in this space.
Absolutely. I do not underestimate this.
> written by cc "in a loop" in ~3 months?
What does that mean exactly? I assume the LLM was not left alone with its task for 3 months without human supervision.
From the FAQ:
> the following prompt was issued into a coding agent:
> Hey, can you make me a programming language like Golang but all the lexical keywords are swapped so they're Gen Z slang?
> and then the coding agent was left running AFK for months in a bash loop
I don’t buy it at all. Not even Anthropic or Open AI have come anywhere close to something like this.
Running for 3 months and generating a working project this large with no human intervention is so far outside of the capabilities of any agent/LLM system demonstrated by anyone else that the mostly likely explanation is that the promoter is lying about it running on its own for 3 months.
I looked through the videos listed as “facts” to support the claims and I don’t see anything longer than a few hours.
I think the issue with them making assumptions and failing to properly diagnose issues comes more from fine-tuning than any particular limitation in LLMs themselves. When fine tuned on a set of problem->solution data it kind of carries the assumption that the problem contains enough data for the solution.
What is really needed is a tree of problems which appear identical at first glance, but the issue and the solution is something that is one of many possibilities which can only be revealed by finding what information is lacking, acquiring that information, testing the hypothesis then, if the hypothesis is shown to be correct, then finally implementing the solution.
That's a much more difficult training set to construct.
The editing issue, I feel needs something more radical. Instead of the current methods of text manipulation, I think there is scope to have a kind of output position encoding for a model to emit data in a non-sequential order. Again this presents another training data problem, there are limited natural sources to work from showing programming in the order a programmer types it. On the other hand I think it should be possible to do synthetic training examples by translating existing model outputs that emit patches, search/replaces, regex mods etc. and translate those to a format that directly encodes the final position of the desired text.
At some stage I'd like to see if it's possible to construct the models current idea of what the code is purely by scanning a list of cached head_embeddings of any tokens that turned into code. I feel like there should be enough information given the order of emission and the embeddings themselves to reconstruct a piecemeal generated program.
The copy-paste thing is interesting because it hints at a deeper issue: LLMs don't have a concept of "identity" for code blocks—they just regenerate from learned patterns. I've noticed similar vibes when agents refactor—they'll confidently rewrite a chunk and introduce subtle bugs (formatting, whitespace, comments) that copy-paste would've preserved. The "no questions" problem feels more solvable with better prompting/tooling though, like explicitly rewarding clarification in RLHF.
I feel like it’s the opposite: the copy-paste issue is solvable, you just need to equip the model with the right tools and make sure they are trained on tasks where that’s unambiguously the right thing to do (for example, cases were copying code “by hand” would be extremely error prone -> leads to lower reward on average).
On the other hand, teaching the model to be unsure and ask questions, requires the training loop to break and bring a human input in, which appears more difficult to scale.
> On the other hand, teaching the model to be unsure and ask questions, requires the training loop to break and bring a human input in, which appears more difficult to scale.
The ironic thing to me is that the one thing they never seem to be willing to skip asking about is whether they should proceed with some fix that I just helped them identify. They seem extremely reluctant to actually ask about things they don't know about, but extremely eager to ask about whether they should do the things they already have decided they think are right!
I feel like the copy and paste thing is overdue a solution.
I find this one particularly frustrating when working directly with ChatGPT and Claude via their chat interfaces. I frequently find myself watching them retype 100+ lines of code that I pasted in just to make a one line change.
I expect there are reasons this is difficult, but difficult problems usually end up solved in the end.
> I frequently find myself watching them retype 100+ lines of code that I pasted in just to make a one line change.
In such cases, I specifically instruct LLMs to "only show the lines you would change" and they are very good at doing just that and eliding the rest. However, I usually do this after going through a couple of rounds of what you just described :-)
I partly do this to save time and partly to avoid using up more tokens. But I wonder if it is actually saving tokens given that hidden "thinking tokens" are a thing these days. That is, even if they do elide the unchanged code, I'm pretty sure they are "reasoning" about it before identifying only the relevant tokens to spit out.
As such, that does seem different from copy-and-paste tool use, which I believe is also solved. LLMs can already identify when code changes can be made programmatically... and then do so! I have actually seen ChatGPT write Python code to refactor other Python code: https://www.linkedin.com/posts/kunalkandekar_metaprogramming...
I had to fix a minor bug in its Python script to make it work, but it worked and was a bit of a <head-explode> moment for me. I still wonder if this is part of its system prompt or an emergent tool-use behavior. In either case, copy-and-paste seems like a much simpler problem that could be solved with specific prompting.
Yeah, I’ve always wondered if the models could be trained to output special reference tokens that just copy verbatim slices from the input, perhaps based on unique prefix/suffix pairs. Would be a dramatic improvement for all kinds of tasks (coding especially).
Whats the time horizon for said problems to be solved? Because guess what - time is running and people will not continue to aimlessly provide money at this stuff.
I don't see this one as an existential crisis for AI tooling, more of a persistent irritation.
AI labs already shipped changes related to this problem - most notable speculative decoding, which lets you provide the text you expect to see come out again and speeds it up: https://simonwillison.net/2024/Nov/4/predicted-outputs/
They've also been iterating on better tools for editing code a lot as part of the competition between Claude Code and Codex CLI and other coding agents.
Hopefully they'll figure out a copy/paste mechanism as part of that work.
About the first point mentioned in article: could that problem be solved simply by changing the task from something like "refactor this code" to something like "refactor this code as a series of smaller atomic changes (like moving blocks of code or renaming variable references in all places), disable suitable for git commits (and provide git message texts for those commits)"?
I recently found a fun CLI application and was playing with it when I found out it didn't have proper handling for when you passed it invalid files, and spat out a cryptic error from an internal library which isn't a great UX.
I decided to pull the source code and fix this myself. It's written in Swift which I've used very little before, but this wasn't gonna be too complex of a change. So I got some LLMs to walk me through the process of building CLI apps in Xcode, code changes that need to be made, and where the build artifact is put in my filesystem so I could try it out.
I was able to get it to compile, navigate to my compiled binary, and run it, only to find my changes didn't seem to work. I tried everything, asking different LLMs to see if they can fix the code, spit out the binary's metadata to confirm the creation date is being updated when I compile, etc. Generally when I'd paste the code to an LLM and ask why it doesn't work it would assert the old code was indeed flawed, and my change needed to be done in X manner instead. Even just putting a print statement, I couldn't get those to run and the LLM would explain that it's because of some complex multithreading runtime gotcha that it isn't getting to the print statements.
After way too much time trouble-shooting, skipping dinner and staying up 90 minutes past when I'm usually in bed, I finally solved it - when I was trying to run my build from the build output directory, I forgot to put the ./ before the binary name, so I was running my global install from the developer and not the binary in the directory I was in.
Sure, rookie mistake, but the thing that drives me crazy with an LLM is if you give it some code and ask why it doesn't work, they seem to NEVER suggest it should actually be working, and instead will always say the old code is bad and here's the perfect fixed version of the code. And it'll even make up stuff about why the old code should indeed not work when it should, like when I was putting the print statements.
Lol this person talks about easing into LLMs again two weeks after quitting cold turkey. The addiction is real. I laugh because I’m in the same situation, and see no way out other than to switch professions and/or take up programming as a hobby in which I purposefully subject myself to hard mode. I’m too productive with it in my profession to scale back and do things by hand — the cat is out of the bag and I’ve set a race pace at work that I can’t reasonably retract from without raising eyebrows. So I agree with the author’s referenced post that finding ways to still utilize it while maintaining a mental map of the code base and limiting its blast radius is a good middle ground, but damn it requires a lot of discipline.
In my defense, I wrote the blog post about quitting a good while after I've already quit cold turkey -- but you're spot on. :)
Especially when surrounded by people who swear LLMs can really be gamechanging on certain tasks, it's really hard to just keep doing things by hand (especially if you have the gut feeling that an LLM can probably do rote pretty well, based on past experience).
What kind of works for me now is what a colleague of mine calls "letting it write the leaf nodes in the code tree". So long as you take on the architecture, high level planning, schemas, and all the important bits that require thinking - chances are it can execute writing code successfully by following your idiot-proof blueprint. It's still a lot of toll and tedium, but perhaps still beats mechanical labor.
> I’ve set a race pace at work that I can’t reasonably retract from without raising eyebrows
Why do this to yourself? Do you get paid more if you work faster?
It started as a mix of self-imposed pressure and actually enjoying marking tasks as complete. Now I feel resistant to relaxing things. And no, I definitely don’t get paid more.
cat out of the bag is disautomation. the speed in the timetable is an illusion if the supervision requires blast radius retention. this is more like an early video game assembly line than a structured skilled industry
The first issue is related to the inner behavior of LLMs. Human can ignore some detailed contents of code and copy and paste, but LLM convert them into hidden states. It is a process of compression. And the output is a process of decompression. And something maybe lost. So it is hard for LLM to copy and paste. The agent developer should customize the edit rules to do this.
The second issue is that, LLM does not learn much high level context relationship of knowledge. This can be improved by introducing more patterns in the training data. And current LLM training is doing much on this. I don't think it is a problem in next years.
I sometimes give LLM random "easy" questions. My assessment is still that they all need the fine print "bla bla can be incorrect"
You should either already know the answer or have a way to verify the answer. If neither, the matter must be inconsequential like just a child like curiosity. For example, I wonder how many moons Jupiter has... It could be 58, it could be 85 but either answer won't alter any of what I do today.
I suspect some people (who need to read the full report) dump thousand page long reports into LLM, read the first ten words of the response and pretend they know what the report says and that is scary.
> For example, I wonder how many moons Jupiter has... It could be 58, it could be 85
For those curious, the answer is 97.
https://en.wikipedia.org/wiki/Moons_of_Jupiter
> or have a way to verify the answer
Fortunately, as devs, this is our main loop. Write code, test, debug. And it's why people who fear AI-generated code making it's way into production and causing errors makes me laugh. Are you not testing your code? Or even debugging it? Like, what process are you using that prevents bugs happening? Guess what? It's the exact same process with AI-generated code.
I fully resonate with point #2. A few days ago, I was stuck trying to implement some feature in a C++ library, so I used ChatGPT for brainstorming.
ChatGPT proposed a few ideas, all apparently reasonable, and then it advocated for one that was presented unambiguously as the "best". After a few iterations, I realized that its solution would have required a class hierarchy where the base class contained a templated virtual function, which is not allowed in C++. I pointed this out to ChatGPT and asked it to rethink the solution; it then immediately advocated for the other approach it had initially suggested.
The “LLMs are bad at asking questions” is interesting. There are some times that I will ask the LLM to do something without giving it All the needed information. And rather than telling me that something's missing or that it can't do it the way I asked, it will try and do a halfway job using fake data or mock something out to accomplish it. What I really wish it would do is just stop and say, “hey, I can't do it like you asked Did you mean this?”
I don't think it such a big deal that they aren't great yet, but rather the rate of improvement is quite low these days. I feel it is going backwards a little recently - maybe that is due to economic pressures.
The other day, I needed Claude Code to write some code for me. It involved messing with the TPM of a virtual machine. For that, it was supposed to create a directory called `tpm_dir`. It constantly got it wrong and wrote `tmp_dir` instead and tried to fix its mistake over and over again, leading to lots of weird loops. It completely went off the rails, it was bizarre.
With a statically typed language like C# or Java, there are dozens of refactors that IDEs could do in a guaranteed [1] correct way better than LLMs as far back as 2012.
The canonical products were from JetBrains. I haven’t used Jetbrains in years. But I would be really surprised with the combination of LLMs + a complete understanding of the codebase through static analysis (like it was doing well over a decade ago) and calling a “refactor tool” that it wouldn’t have better results.
[1] before I get “well actuallied” yes I know if you use reflection all bets are off.
I used a Borland Java IDE in the 1990s with auto refactoring like “extract method” and global renaming and such.
Dev tools were not bad at all back then. In a few ways they were better than today, like WYSIWYG GUI design which we have wholly abandoned. Old school Visual Basic was a crummy programming language but the GUI builder was better than anything I’m familiar with for a desktop OS today.
Has anyone had success getting a coding agent to use an IDE's built-in refactoring tools via MCP especially for things like project-wide rename? Last time I looked into this the agents I tried just did regex find/replace across the repo, which feels both error-prone and wasteful of tokens. I haven't revisited recently so I'm curious what's possible now.
Serena MCP does this approach IIRC
That's interesting, and I haven't, but as long as the IDE has an API for the refactoring action, giving an agent access to it as a tool should be pretty straightforward. Great idea.
For #2, if you're working on a big feature, start with a markdown planning file that you and the LLM work on until you are satisfied with the approach. Doesn't need to be rocket science: even if it's just a couple paragraphs it's much better than doing it one shot.
Editing tools are easy to add it’s just you have to pick what things to give them because too many and they struggle as it uses up a lot of context. Still, as costs come down multiple steps to look for tools becomes cheaper too.
I’d like to see what happens with better refactoring tools, I’d make a bunch more mistakes copying and retyping or using awk. If they want to rename something they should be able to use the same tooling the rest of us get.
Asking questions is a good point but that’s both a bit of promoting and I think the move to having more parallel work makes it less relevant. One of the reasons clarifying things more upfront is useful is we take a lot of time and cost a lot of money to build things so the economics favours getting it right first time. As the time comes down and the cost drops to near zero, the balance changes.
There are also other approaches to clarify more what you want and how to do it first, breaking that down into tasks, then letting it run with those (spec kit). This is an interesting area.
I think #1 is not that big of a deal, though it does create problems sometimes. #2 is though a big issue. Which is weird since the whole thing is built as a chat model it seems it would be a lot more efficient for the bot to ask the questions of what to build beyond it's assumptions. Generally this lack of back and forth reasoning leads to a lot of then badly generated code. I would hope in the future there is some level of graded response that tries to discern the real intent of the users request through a discussion, rather than going to the fastest code answer.
I'd argue LLM coding agents are still bad at many more things. But to comment on the two problems raised in the post:
> LLMs don’t copy-paste (or cut and paste) code.
The article is confusing the architectural layers of AI coding agents. It's easy to add "cut/copy/paste" tools to the AI system if that shows improvement. This has nothing to do with LLM, it's in the layer on top.
> Good human developers always pause to ask before making big changes or when they’re unsure [LLMs] keep trying to make it work until they hit a wall -- and then they just keep banging their head against it.
Agreed - LLMs don't know how to back track. The recent (past year) improvements in thinking/reasoning do improve in this regard (it's the whole "but wait..." RL training that exploded with OpenAI o1/o3 and DeepSeek R1, now done by everyone), but clearly there's still work to do.
> The article is confusing the architectural layers of AI coding agents. It's easy to add "cut/copy/paste" tools to the AI system if that shows improvement. This has nothing to do with LLM, it's in the layer on top.
I think we can't trivialize adding good cut/copy/paste tools though. It's not like we can just slap those tools on the topmost layer (ex, on Claude Code, Codex, or Roo) and it'll just work.
I think that a lot of reinforcement learning that LLM providers do on their coding models barely (if at all) steer towards that kind of tool use, so even if we implemented those tools on top of coding LLMs they probably would just splash and do nothing.
Adding cut/copy/paste probably requires a ton of very specific (and/or specialized) fine tuning with not a ton of data to train on -- think recordings of how humans use IDEs, keystrokes, commands issued, etc etc.
I'm guessing Cursor's Autocomplete model is the closest thing that can do something like this if they chose to, based on how they're training it.
Ask a model to show you the seahorse emojii and you'll get a storm of "but wait!"
> Sure, you can overengineer your prompt to try get them to ask more questions (Roo for example, does a decent job at this) -- but it's very likely still won't.
Not in my experience. And it's not "overengineering" your prompt, it's just writing your prompt.
For anything serious, I always end every relevant request with an instruction to repeat back to me the full design of my instructions or ask me necessary clarifying questions first if I've left anything unclear, before writing any code. It always does.
And I don't mind having to write that, because sometimes I don't want that. I just want to ask it for a quick script and assume it can fill in the gaps because that's faster.
You can do copy and paste if you offer it a tool/MCP that do that. It's not complicated using either function extraction with AST as target or line numbers.
Also if you want it to pause asking questions, you need to offer that thru tools (example Manus do that) and I have an MCP that do that and surprisingly I got a lot of questions and if you prompt, it will do. But the push currently is for full automation and that's why it's not there. We are far better in supervised step by step mode. There is elicitation already in MCP, but having a tool asking questions require you have a UI that will allow to set the input back.
I very much agree on point 2.
I often wish that instead of just starting to work on the code, automatically, even if you hit enter / send by accident, the models would rather ask for clarification. The models assume a lot, and will just spit out code first.
I guess this is somewhat to lower the threshold for non-programmers, and to instantly give some answer, but it does waste a lot of resources - I think.
Others have mentioned that you can fix all this by providing a guide to the mode, how it should interact with you, and what the answers should look like. But, still, it'd be nice to have it a bit more human-like on this aspect.
Coding agents tend to assume that the development environment is static and predictable, but real codebases are full of subtle, moving parts - tooling versions, custom scripts, CI quirks, and non-standard file layouts.
Many agents break down not because the code is too complex, but because invisible, “boring” infrastructure details trip them up. Human developers subconsciously navigate these pitfalls using tribal memory and accumulated hacks, but agents bluff through them until confronted by an edge case. This is why even trivial tasks intermittently fail with automation agents. you’re fighting not logic errors, but mismatches with the real lived context. Upgrading this context-awareness would be a genuine step change.
Yep. One of the things I've found agents always having a lot of trouble with is anything related to OpenTelemetry. There's a thing you call that uses some global somewhere, there's a docker container or two and there's the timing issues. It takes multiple tries to get anything right. Of course this is hard for a human too if you haven't used otel before...
One thing LLMs are surprisingly bad at is producing correct LaTeX diagram code. Very often I've tried to describe in detail an electric circuit, a graph (the data structure), or an automaton so I can quickly visualize something I'm studying, but they fail. They mix up labels, draw without any sense of direction or ordering, and make other errors. I find this surprising because LaTeX/TiKZ have been around for decades and there are plenty of examples they could have learned from.
Regarding copy-paste, I’ve been thinking the LLM could control a headless Neovim instance instead. It might take some specialized reinforcement learning to get a model that actually uses Vim correctly, but then it could issue precise commands for moving, replacing, or deleting text, instead of rewriting everything.
Even something as simple as renaming a variable is often safer and easier when done through the editor’s language server integration.
The conversation here seems to be more focused on coding from scratch. What I have noticed when I was looking at this last year was that LLMs were bad at enhancing already existing code (e.g. unit tests) that used annotation (a.k.a. decorators) for dependency injection. Has anyone here attempted that with the more recent models? If so, then what were your findings?
My experience is the opposite. The latest Claude seems to excel in my personal medium-sized (20-50k loc) codebases with strong existing patterns and a robust structure from which it can extrapolate new features or documentation. Claude Code is getting much better at navigating code paths across many large files in order to provide nuanced and context-aware suggestions or bug fixes.
When left to its own devices on tasks with little existing reference material to draw from, however, the quality and consistency suffers significantly and brittle, convoluted structures begin to emerge.
This is just my limited experience though, and I almost never attempt to, for example, vibe-code an entire greenfield mvp.
A friendly reminder that "refactor" means "make and commit a tiny change in less than a few minutes" (see links below). The OP and many comments here use "refactor" when they actually mean "rewrite".
I hear from my clients (but have not verified myself!) that LLMs perform much better with a series of tiny, atomic changes like Replace Magic Literal, Pull Up Field, and Combine Functions Into Transform.
[1] https://martinfowler.com/books/refactoring.html [2] https://martinfowler.com/bliki/OpportunisticRefactoring.html [3] https://refactoring.com/catalog/
Everywhere I've worked over the years (35+), and in conversation with peers (outside of work), refactor means to change the structure of an existing program, while retaining all of the original functionality. With no specificity regarding how big or small such changes may amount to.
With a rewrite usually implying starting from scratch — whether small or large — replacing existing implementations (of functions/methods/modules/whatever), with newly created ones.
Indeed one can refactor a large codebase, without actually rewriting much- if anything at all- of substance.
Maybe one could claim that this is actually lots of micro-refactors — but that doesn't flow particularly well in communication — and if the sum total of it is not specifically a "rewrite", then what collective / overarching noun should be used for the sum total of the plurality of all of these smaller refactorings? — If one spent time making lots of smaller changes, but not actually re-implementing anything... to me, that's not a rewrite, the code has been refactored, even if it is a large piece of code with a lot of structural changes throughout.
Perhaps part of the issue here in this context, is that LLMs don't particularly refactor code anyhow, they generally rewrite (regenerate) it. Which is where many of the subtle issues that are described in other comments here, creep in. The kinds of issues that a human wouldn't necessarily create when refactoring (e.g. changed regex, changed dates, other changes to functionality, etc)
In Claude Code, it always shows the diff between current and proposed changes and I have to explicitly allow it to actually modify the code. Doesn’t that “fix” the copy-&-paste issue?
LLMs are great at asking questions if you ask them to ask questions. Try it: "before writing the code, ask me about anything that is nuclear or ambiguous about the task".
“If you think I’m asking you to split atoms, you’re probably wrong”.
"weird, overconfident interns" -> exactly the mental model I try to get people to use when thinking about LLM capabilities in ALL domains, not just coding.
A good intern is really valuable. An army of good interns is even more valuable. But interns are still interns, and you have to check their work. Carefully.
As a UX designer I see they lack the ability of being opinionated about a design piece and go with the standard mental model. I got fed up with this and made a simple java script code to run a simple canvas on the localhost to pass on more subjective feedback using highlights and notes feature. I tried using playwright first but a. its token heavy b. it's still for finding what's working or breaking instead of thinking deeply about the design.
Whatdo the notes look like?
specific inputs e.g. move, color change, or giving specific inputs for interaction piece.
How I describe this phenomenon:
If the code-change is something you would reasonably prefer to use a codemod to implement (i.e. dozens-to-hundreds of small changes fitting a semantic pattern), Claude Code not going to be able to make that change effectively.
However (!), CC is pretty good at writing the codemod.
“They’re still more like weird, overconfident interns.” Perfect summary. LLMs can emit code fast but they don’t really handle code like developers do — there’s no sense of spatial manipulation, no memory of where things live, no questions asked before moving stuff around. Until they can “copy-paste” both code and context with intent, they’ll stay great at producing snippets and terrible at collaborating.
This is exactly how we describe them internally: the smartest interns in the world. I think it's because the chat box way of interacting with them is also similar to how you would talk to someone who just joined a team.
"Hey it wasn't what you asked me to do but I went ahead and refactored this whole area over here while simultaneously screwing up the business logic because I have no comprehension of how users use the tool". "Um, ok but did you change the way notifications work like I asked". "Yes." "Notifications don't work anymore". "I'll get right on it".
What 2 things? LLMs are bad at everything. Its just there are a lot of people who are worse
Funny, I just encountered a similar issue asking chatgpt to ocr something. It started off pretty good but slowly started embellishing or summarizing on its own, eventually going completely off the rails into a King Arthur story.
@kixpanganiban Do you think it will work if for refactoring tasks, we take aways OpenAI's `apply_patch` tool and just provide `cut` and `paste` for the first few steps?
I can run this experiment using ToolKami[0] framework if there is enough interest or if someone can give some insights.
[0]: https://github.com/aperoc/toolkami
I just run into this issue with claude sonet 4.5, asked it to copy/paste some constants from one file to another, a bigger chunk of code, it instead "extracted" pieces and named them so. As a last resort, after going back and forth it agreed to do a file/copy by running a system command. I was surprised that of all the programming tasks, a copy/paste felt challenging for the agent.
I guess the LLMs are trained to know what finished code looks like. They don't really know the operations a human would use to get there.
My human fixed a bug by introducing a new one. Classic. Meanwhile, I write the lint rules, build the analyzers, and fix 500 errors before they’ve finished reading Stack Overflow. Just don’t ask me to reason about their legacy code — I’m synthetic, not insane.
—
Just because this new contributor is forced to effectively “SSH” into your codebase and edit not even with vim but with with sed and awk does not mean that this contributor is incapable of using other tools if empowered to do so. The fact it is able to work within such constraints goes to show how much potential there is. It is already much better at a human than erasing the text and re-typing it from memory and while it is a valid criticism that it needs to be taught how to move files imagine what it is capable of once it starts to use tools effectively.
—
Recently, I observed LLMs flail around for hours trying to get our e2e tests running as it tried to coordinate three different processes in three different terminals. It kept running commands in one terminal try to kill or check if the port is being used in the other terminal.
However, once I prompted the LLM to create a script for running all three processes concurrently, it is able to create that script, leverage it, and autonomously debug the tests now way faster than I am able to. It has also saved any new human who tries to contribute from similar hours of flailing around. Is there something we could have easily done by hand but just never had the time to do before LLMs. If anything, the LLM is just highlighting the existing problem in our codebase that some of us got too used to.
So yes, LLMs makes stupid mistakes, but so do humans, the thing is that LLms can ifentify and fix them faster (and better, with proper steering)
Similar to the copy/paste issue I've noticed LLMs are pretty bad at distilling large documents into smaller documents without leaving out a ton of detail. Like maybe you have a super redundant doc. Give it to an LLM and it won't just deduplicate it, it will water the whole thing down.
You’d need the correct theory of mind, in order to distill down into the correct summary and details.
Ask the average high school or college student and I doubt they would fare better.
For 2) I feel like codex-5 kind of attempted to address this problem, with codex it usually asks a lot of questions and give options before digging in (without me prompting it to).
For copy-paste, you made it feel like a low-hanging fruit? Why don't AI agents have copy/paste tools?
I don’t really understand why there’s so much hate for LLMs here, especially when it comes to using them for coding. In my experience, the people who regularly complain about these tools often seem more interested in proving how clever they are than actually solving real problems. They also tend to choose obscure programming languages where it’s nearly impossible to hire developers, or they spend hours arguing over how to save $20 a month.
Over time, they usually get what they want: they become the smartest ones left in the room, because all the good people have already moved on. What’s left behind is a codebase no one wants to work on, and you can’t hire for it either.
But maybe I’ve just worked with the wrong teams.
EDIT: Maybe this is just about trust. If you can’t bring yourself to trust code written by other human beings, whether it’s a package, a library, or even your own teammates, then of course you’re not going to trust code from an LLM. But that’s not really about quality, it’s about control. And the irony is that people who insist on controlling every last detail usually end up with fragile systems nobody else wants to touch, and teams nobody else wants to join.
I regularly check in on using LLMs. But a key criteria for me is that an LLM needs to objectively make me more efficient, not subjectively.
Often I find myself cursing at the LLM for not understanding what I mean - which is expensive in lost time / cost of tokens.
It is easy to say: Then just don't use LLMs. But in reality, it is not too easy to break out of these loops of explaining, and it is extremely hard to assess when not to trust that the LLM will not be able to finish the task.
I also find that LLMs consistently don't follow guidelines. Eg. to never use coercions in TypeScript (It always gets in a rogue `as` somewhere) - to which I can not trust the output and needs to be extra vigilant reviewing.
I use LLMs for what they are good at. Sketching up a page in React/Tailwind, sketching up a small test suite - everything that can be deemed a translation task.
I don't use LLMs for tasks that are reasoning heavy: Data modelling, architecture, large complex refactors - things that require deep domain knowledge and reasoning.
> Often I find myself cursing at the LLM for not understanding what I mean...
Me too. But in all these cases, sooner or later, I realized I made a mistake not giving enough context and not building up the discussion carefully enough. And I was just rushing to the solution. In the agile world, one could say I gave the LLM not a well-defined story, but a one-liner. Who is to blame here?
I still remember training a junior hire who started off with:
“Sorry, I spent five days on this ticket. I thought it would only take two. Also, who’s going to do the QA?”
After 6 months or so, the same person was saying:
“I finished the project in three weeks. I estimated four. QA is done. Ready to go live.”
At that point, he was confident enough to own his work end-to-end, even shipping to production without someone else reviewing it. Interestingly, this colleague left two years ago, and I had to take over his codebase. It’s still running fine today, and I’ve spent maybe a single day maintaining it in the last two years.
Recently, I was talking with my manager about this. We agreed that building confidence and self-checking in a junior dev is very similar to how you need to work with LLMs.
Personally, whenever I generate code with an LLM, I check every line before committing. I still don’t trust it as much as the people I trained.
> ... Who is to blame here?
That is not really relevant, is it? The LLM is not a human.
The question is whether it is still af efficient to use LLMs after spending huge amounts of time giving the context - or if it is just as efficient to write the code yourself.
> I still remember training a junior hire who started off with
Working with LLMs is not training junior developers - treating it as such is yet another resource sink.
I think we have to agree that we disagree. What works for you doesn't have to work for me and vice-versa.
It has been discussed ad nauseum. It demolishes learning curve all of us with decade(s) of experience went through, to become seniors we are. Its not a function of age, not a function of time spent staring at some screen or churning our basic crud apps, its function of hard experience, frustration, hard won battles, grokking underlying technologies or algorithms.
Llms provide little of that, they make people lazy, juniors stay juniors forever, even degrading mentally in some aspects. People need struggle to grow, when you have somebody who had his hand held whole life they are useless human disconnected from reality, unable to self-sufficiently achieve anything significant. Too easy life destroys both humans and animals alike (many experiments have been done on that, with damning results).
There is much more like hallucinations, questionable added value of stuff that confidently looks OK but has underlying hard-to-debug bugs but above should be enough for a start.
I suggest actually reading those conversations, not just skimming through them, this has been stated countless times.
I recently asked an llm to fix an Ethernet connection while I was logged into the machine through another. Of course, I explicitly told the llm to not break that connection. But, as you can guess, in the process it did break the connection.
If an llm can't do sys admin stuff reliably, why do we think it can write quality code?
Those 2 things are not inherit to LLM's and could easily be changed by giving it the proper tools and instructions
The issue is partly that some expect a fully fledged app or a full problem solution, while others want incremental changes. To some extent this can be controlled by setting the rules in the beginning of the conversation. To some extent, because the limitations noted in the blog still apply.
Point #2 cracks me up because I do see with JetBrains AI (no fault of JetBrains mind you) the model updates the file, and sometimes I somehow wind up with like a few build errors, or other times like 90% of the file is now build errors. Hey what? Did you not run some sort of what if?
Add to this list, ability to verify correct implementation by viewing a user interface, and taking a holistic code-base / interface-wide view of how to best implement something.
If I need an exact copy pasting, I indicate that couple times in the prompt and it (claude) actually does what I am asking. But yeah overall very bad at refactoring big chunks.
> you can overengineer your prompt to try get them to ask more questions
why overengineer? it's super simple
I just do this for 60% of my prompts: "{long description of the feature}, please ask 10 questions before writing any code"
Let's just change the title to "LLM coding agents don't use copy & paste or ask clarifying questions" and save everyone the click.
You don't want your agents to ask questions. You are thinking too short term. Its not ideal now, but agents that have to ask frequent questions are useless when it comes the vision of totally autonomous coding.
Humans ask questions of groups to fix our own personal short comings. It make no sense to try and master an internal system I rarely use, I should instead ask someone that maintains it. AI will not have this problem provided we create paths of observability for them. It doesn't take a lot of "effort" for them to completely digest an alien system they need to use.
If you look at a piece of architecture, you might be able to infer the intentions of the architect. However, there are many interpretations possible. So if you were to add an addendum to the building it makes sense that you might want to ask about the intentions.
I do not believe that AI will magically overcome the Chesterton Fence problem in a 100% autonomous way.
AI won't, but humans will to un-encumber AI
Doing hard things that aren't greenfield? Basically any difficult and slightly obscure question I get stuck with and hope the collective wisdom of the internet can solve?
You don't learn new languages/paradigms/frameworks by inserting it into an existing project.
LLMs are especially tricky because they do appear to work magic on a small greenfield, and the majority of people are doing clown-engineering.
But I think some people are underestimating what can be done in larger projects if you do everything right (eg docs, tests, comments, tools) and take time to plan.
You need good checks and balances. E2E tests for your happy path, TDD when you & your agent write code.
Then you - and your agent - can refactor fearlessly.
they're getting better at asking questions; I routinely see search calls against the code base index. they just don't ask me questions.
I have seen LLMs in VSCode Copilot ask to execute 'mv oldfile.py newfile.py'.
So there's hope.
But often they just delete and recreate the file, indeed.
Coding and...?
More granular. What things is it bad at that result in it being overall “bad at coding”? It isn’t all of the parts.
Copy and pasting.
Oh, sorry. You already said that. :D
Really nice site design btw
Developers will complain if LLM agents start asking too many questions though
I definitely feel the "bad at asking questions" part, a lot of times I'll walk away for a second while it's working, and then I come back and it's gone down some intricate path I really didn't want and if it had just asked a question at the right point it would have saved a lot of wasted work (plus I feel like having that "bad" work in the context window potentially leads to problems down the road). The problem is just that I'm pretty sure there isn't any way for an LLM to really be "uncertain" about a thing, it's basically always certain even when it's incredibly wrong.
To me, I think I'm fine just accepting them for what they're good at. I like them for generating small functions, or asking questions about a really weird error I'm seeing. I don't ever ask it to refactor things though, that seems like a recipe for disaster and a tool that understands the code structure is a lot better for moving things around then an LLM is.
My biggest issue with LLMs right now is that they're such spineless yes men. Even when you ask their opinion on if something is doable or should it be done in the first place, more often than not they just go "Absolutely!" and shit out a broken answer or an anti-pattern just to please you. Not always, but way too often. You need to frame your questions way too carefully to prevent this.
Maybe some of those character.ai models are sassy enough to have stronger opinions on code?
The third thing- writing meaningfully robust test suites.
Another place where LLMs have a problem is when you ask them to do something that can't be done via duct taping a bunch of Stack Overflow posts together. E.g, I've been vibe coding in Typescript on Deno recently. For various reasons, I didn't want to use the standard Express + Node stack which is what most LLMs seem to prefer for web apps. So I ran into issues with Replit and Gemini failing to handle the subtle differences between node and deno when it comes to serving HTTP requests.
LLMs also have trouble figuring out that a task is impossible. I wanted boilerplate code that rendered a mesh in Three.js using GL_TRIANGLE_STRIP because I was writing a custom shader and needed to experiment with the math. But Three.js does support GL_TRIANGLE_STRIP rendering for architectural reasons. Grok, ChatGPT, and Gemini all hallucinated a GL_TRIANGLE_STRIP rendering API rather than telling be about this and I had to Google the problem myself.
It feels like current Coding LLMs are good at replacing junior engineers when it comes to shallow but broad tasks like creating UIs, modifying examples available on the web, etc. But they fail at senior-level tasks like realizing that the requirements being asked of them aren't valid and doing something that no one has done in their corpus of training data.
>But Three.js does support GL_TRIANGLE_STRIP rendering for architectural reasons.
Typo or trolling the next LLM to index HN comments?
Can’t you put this in the agent instructions?
> "LLMs are terrible at asking questions. They just make a bunch of assumptions and brute-force something based on those guesses."
Strongly disagree that they're terrible at asking questions.
They're terrible at asking questions unless you ask them to... at which point they ask good, sometimes fantastic questions.
All my major prompts now have some sort of "IMPORTANT: before you begin you must ask X clarifying questions. Ask them one at a time, then reevaluate the next question based on the response"
X is typically 2–5, which I find DRASTICALLY improves output.
> LLMs are terrible at asking questions.
I was dealing with a particularly tricky problem in a technology I'm not super familiar with and GPT-5 eventually asked me to put in some debug code to analyze the state of the system as it ran. Once I provided it with the feedback it wanted, and a bit of back and forth, we were able to figure out what the issue was.
> LLMs are terrible at asking questions. They just make a bunch of assumptions and brute-force something based on those guesses.
I don't agree with that. When I am telling Claude Code to plan something I also mention that it should ask questions when informations are missing. The questions it comes up with a really good, sometimes about cases I simply didn't see. To me the planning discussion doesn't feel much different than in a GitLab thread, only at a much higher iteration speed.
4/5 times when Claude is looking for a file, it starts by running bash(dir c:\test /b)
First it gets an error because bash doesn’t understand \
Then it gets an error because /b doesn’t work
And as LLMs don’t learn from their mistakes, it always spends at least half a dozen tries (e.g. bash(cmd.exe /c dir c:\test /b )) before it figures out how to list files
If it was an actual coworker, we’d send it off to HR
Most models struggle in a Windows environment. They are trained on a lot of Unixy commands and not as much on Windows and PowerShell commands. It was frustrating enough that I started using WSL for development when using Windows. That helped me significantly.
I am guessing this because:
1. Most of the training material online references Unix commands. 2. Most Windows devs are used to GUIs for development using Visual Studio etc. GUIs are not as easy to train on.
Side note: Interesting thing I have noticed in my own org is that devs with Windows background strictly use GUIs for git. The rest are comfortable with using git from the command line.
I have a list of those things in CLAUDE.md -> it seems to help (unless it's context is full, but you should never let it get close really).
IaC, and DSLs in general.
Three - CSS.
Someone has definitely fallen behind and has massive skill issues. Instead of learning you are wasting time writing bad takes on LLM. I hope most of you don't fall down this hole, you will be left behind.
[dead]
1. Any 2. Any
> LLMs are terrible at asking questions
Not if they're instructed to. In my experience you can adjust the prompt to make them ask questions. They ask very good questions actually!
Building an mcp tool that has access to refactoring operations should be straightforward and using it appropriately is well within the capabilities of current models. I wonder if it exists? I don't do a lot of refactoring with llm so haven't really had this pain point.
First point is very annoying, yes, and it's why for large refactors I have the AI write step-by-step instructions and then do it myself. It's faster, cheaper and less error-prone.
The second point is easily handled with proper instructions. My AI agents always ask questions about points I haven't clarified, or when they come across a fork in the road. Frequently I'll say "do X" and it'll proceed, then halfway it will stop and say "I did some of this, but before I do the rest, you need to decide what to do about such and such". So it's a complete non-problem for me.
two things only? dude I could make a list with easily two dozen!
It's apparently lese-Copilot to suggest this these days, but you can find very good hypothesizing and problem solving if you talk conversationally to Claude or probably any of its friends that isn't the terminally personality-collapsed SlopGPT (with or without showing it code, or diagrams); it's actually what they're best at, and often they're even less likely than human interlocutors to just parrot some set phrase at you.
It's only when you take the tech out of the area it's good at and start trying to get it to "write code" or even worse "be an agent" that it starts cracking up and emitting garbage; this is only done because companies want to forcememe some kind of product besides "chatbot", whether or not it makes sense. It's a shame because it'll happily and effectively write the docs that don't exist but you wish did for more or less anything. (Writing code examples for docs is not a weak point at all.)
> LLMs are terrible at asking questions. They just make a bunch of assumptions
_Did you ask it to ask questions?_
3. Saying no
LLMs will gladly go along with bad ideas that any reasonable dev would shoot down.
I've found codex to be better here than Claude. It has stopped many times and said hey you might be wrong. Of course this changes with a larger context.
Claude is just chirping away "You're absolutely right" and making me to turn on caps lock when I talk to it and it's not even noon yet.
i find the chirpy affirmative tone of claude to be rage inducing
This. The biggest reason I went with OpenAI this month...
My "favorite" is when it makes a mistake and then tries gaslight you into thinking it was your mistake and then confidently presents another incorrect solution.
All while having the tone of an over caffeinated intern who has only ever read medium articles.
Agree, this is really bad.
It's a fundamental failing of trying to use a statistical approximation of human language to generate code.
You can't fix it.
[dead]
> They keep trying to make it work until they hit a wall -- and then they just keep banging their head against it.
This is because LLMs trend towards the centre of the human cognitive bell curve in most things, and a LOT of humans use this same problem solving approach.
The approach doesn’t matter as much. The halting problem does :)