Sorry I thought it would be clear and could have clarified that the code itself is just a joke illustrating the point, as an exaggeration. This was the thread if anyone is interested
One of my earlier jobs a decade ago involved doing pipeline development and Jenkins administration for the on-site developer lab on one of the NRO projects, and I inserted a random build failure code snippet to test that pipelines could recover from builds that failed for unpredictable reasons, like a network error rather than anything actually wrong with the build. I had to do this on the real system because we didn't have funds for a staging environment for the dev environment, and naturally I forgot to get rid of it when I was done. So builds randomly failed for years after that before I remembered and fixed it.
If you're an extensive user of ChatGPT, or if you can give it some material about yourself like say, a resume or a LinkedIn profile, ask it to roast you. It will be very specific to the content you give it. Be warned, it can be brutal.
"The user will start a comment with 'I'm a social libertarian but...' only to be immediately downvoted by both libertarians and socialists. The irony will not be lost on them, just everyone else."
>You voted with your feet and moved to Western Europe for better well-being, but you still won't vote with your cursor and use a browser other than Edge.
The protagonists are libertarians with teenage harems, who fake an election and team up with with a sex pest. That's extremely reductive to the point of parody, but that will likely be the media coverage of it then moment someone reads the women and politics in the book.
If you completely excise anything too distasteful for a current-day blockbuster, but want a film about a space mining colony uprising you might as well just adapt the game Red Faction instead: have the brave heros blasting away with abandon at corpo guards, mad genetic experimenters and mercenaries and the media coverage can talk about how it's a genius deconstruction of Elon Musk's Martian dream or whatever.
You’d think some filmmaker would have run with the dystopian theme. The accuracy of the book’s predictions is impressive, even the location of the North American Space Defense Command. The biggest miss was people using wired telephones everywhere.
I think there’s always a danger of these foundational model companies doing RLHF on non-expert users, and this feels like a case of that.
The AIs in general feel really focused on making the user happy - your example, and another one is how they love adding emojis to the stout and over-commenting simple code.
With RLVR, the LLM is trained to pursue "verified rewards." On coding tasks, the reward is usually something like the percentage of passing tests.
Let's say you have some code that iterates over a set of files and does processing on them. The way a normal dev would write it, an exception in that code would crash the entire program. If you swallow and log the exception, however, you can continue processing the remaining files. This is an easy way to get "number of files successfully processed" up, without actually making your code any better.
> This is an easy way to get "number of files successfully processed" up, without actually making your code any better.
Well, it depends a bit on what your goal is.
Sometimes the user wants to eg backup as many files as possible from a failing hard drive, and doesn't want to fail the whole process just because one item is broken.
You're right, but the way to achieve this is to allow the error to propagate at the file level, then catch it one function above and continue to the next one.
However, LLM generated code will often, at least in my experience, avoid raising any errors at all, in any case. This is undesirable, because some errors should result in a complete failure - for example, errors which are not transient or environment related but a bug. And in any case, a LLM will prefer turning these single file errors into warnings, though the way I see it, they are errors. They just don't need to abort the process, but errors nonetheless.
> And in any case, a LLM will prefer turning these single file errors into warnings, though the way I see it, they are errors.
Well, in general they are something that the caller should have opportunity to deal with.
In some cases, aborting back to the caller at the first problem is the best course of action. In some other cases, going forward and taking note of the problems is best.
In some systems, you might event want to tell the caller about failures (and successes) as they occur, instead of waiting until the end.
It's all very similar to the different options people have available when their boss sends them on an errand and something goes wrong. A good underling uses their best judgement to pick the right way to cope with problems; but computer programs don't have that, so we need to be explicit.
They do seem to leave otherwise useless comments for itself. Eg: on the level of
// Return the result
return result;
I find this quite frustrating when reading/reviewing code generated by AI, but have started to appreciate that it does make subsequent changes by LLMs work better.
It makes me wonder if we'll end up in a place where IDEs hide comments by default (similar to how imports are often collapsed by default/automatically managed), or introduce some way of distinguishing between a more valuable human written comment and LLM boilerplate comments.
And more advanced users are more likely to opt out of training on their data, Google gets around it with a free api period where you can't opt out and I think from did some of that too, through partnerships with tool companies, but not sure if you can ever opt out there.
This is stunning English: "Perfect setup for satire. Here’s a Python function that fully commits to the bit — a traumatically over-trained LLM trying to divide numbers while avoiding any conceivable danger:" "Traumatically over-trained", while scoring zero google hits, is an amazingly good description. How can it intuitively know what "traumatic over-training" should mean for LLMs without ever having been taught the concept?
“Traumatic overtraining” does have hits though. My guess is that “traumatically” is a rarely used adverb, and “traumatic” is much more common. Possibly it completed traumatic into an adverb and then linked to overtraining which is in the training data. I dunno how these things work though.
Hard to know but if you could express "traumatically" as a number, and "over-trained" as a number, it seems like we'd expect "traumatically" + "over-trained" to be close to "traumatically over-trained" as a number. LLMs work in mysterious ways.
LLMs operate at token level, not word. it doesn't operate in terms of "traumatic", "over-training", "over" or "training", but rather "tr" "aum" "at" "ic, ", etc.
BUT, to play devil's advocate a little: Most human coders should be writing a lot more try/catch blocks than they actually do. It's very common that you don't actually want an error in one section (however unlikely) to interrupt the overall operation. (and sometimes you do, it just depends)
My uninformed suspicion is that this kind of defensive programming somehow improves performance during RLVR. Perhaps the model sometimes comes up with programs that are buggy enough to emit exceptions, but close enough to correct that they produce the right answer after swallowing the exceptions. So the model learns that swallowing exceptions sometimes improves its reward. It also learns that swallowing exceptions rarely reduces its reward, because if the model does come up with fully correct code, that code usually won’t raise exceptions in the first place (at least not in the test cases it’s being judged on), so adding exception swallowing won’t fail the tests even if it’s theoretically incorrect.
Again, this is pure speculation. Even if I’m right, I’m sure another part of the reason is just that the training set contains a lot of code written by human beginners, who also like to ignore errors.
The great Verity Stob (unfortunately, in an article which no longer seems to be online, after the Dr Dobbs Journal website finally went away) referred to this behaviour (by _human_ programmers) as "nailing the corpse in an upright position".
My suspicion is that the training set features a lot of code with “positive sentiment” in text and comments around it… but where does one find code with “negative” sentiment, followed by code that is the “corrected” version of that code? In programs written for technical interview prep, where handling of edge cases beyond realistic production situations is the norm. A model trained to use negative examples in its training set as guidance would gravitate away from examples that skip exception handling.
In this, at least, AI may very well have copied our worst habits of “learning to the test.”
Defensive programming is considered "correct" by the people doing the reinforcing, and is a huge part of the corpus that LLM's are trained on. For example, most python code doesn't do manual index management, so when it sees manual index management it is much more likely to freak out and hallucinate a bug. It will randomly promote "silent failure" even when a "silent failure" results in things like infinite loops, because it was trained on a lot of tutorial python code and "industry standard" gets more reinforcement during training.
These aren't operating on reward functions because there's no internal model to reward. It's word prediction, there's no intelligence.
LLMs do use simple "word prediction" in the pretraining step, just ingesting huge quantities of existing data. But that's not what LLM companies are shipping to end users.
Subsequently, ChatGPT/Claude/Gemini/etc will go through additional training with supervised fine-tuning, reinforcement learning with reward functions whether human-supervised feedback (RLHF) or reward functions (RLVR, 'verified rewards').
Whether that fine-tuning and reward function generation give them real "intelligence" is open to interpretation, but it's not 100% plagarism.
Given that the output describes the function as being done "with extraordinary caution, because you never know what can go wrong", i would guess that the undisclosed prompt was something similar to "generate a division function in python that handles all possible edges cases. be extremely careful". Which seems to say less about LLM training and more about them doing exactly what they are told.
I interpreted the function code as being a deliberately exaggerated satirical example that was illustrative of the experience he was having. So yes, in that example it was probably told to be overly cautious, but I agree with him that the default of LLMs seems to be a bit more cautious than I would like.
Woah, were they using junit 4.8.3 in that project? Someone was flying by the seat of their pants, I hope they got sign-off on that by legal & the CTO, that’s the kind of cowboy coding choice that can hurt a career.
I've noted that LLMs tend to produce defensive code to a fault. Lots of unnecessary checks, e.g. check for null/None/undefined multiple times for same valie. This can lead to really hard to read code, even for the LLM itself.
The RL objectives probably heavily penalize exceptions, but don't reward much for code readability or simplicity.
I have a function that compares letters to numbers for the Major System and it's like 40 lines of code and copilot starts trying to add "guard rails" for "future proofing" as if we're adding more numbers or letters in the future.
Expert beginners program like this. I call it what it driven development. Turns out a lot of code was written by expert beginners because by many metrics they are prolifically productive.
In go all SOTA agents are obsessed with being ludicrously defensive against concurrency bugs. Probably because in addition to what if driven development, there are a lot of blog posts warning about concurrency bugs.
It's also logically incoherent - division by zero can't occur, because if b=0 then abs(b) < sys.float_info.epsilon.
Furthermore, the code is happy to return NaN from the pre-checks, but replaces a NaN result from the division by None. That doesn't make any sense from an API design standpoint.
That code has many issues, but the one that bothers me the most in practice is this tendency of adding imports inside functions. I can only assume that it's an artifact of them optimizing for a minimal number of edits somewhere in the process, but I expect better.
While there are some cases where lazy imports are appropriate, this function, and the vast majority of such lazy imports that I get from Claude are not.
In particular, I can't think of any non-pathological situation where a python developer should import logging and update logging.basicConfig within an inner function.
My wishlist top item is to stop creating a class with Service on the name and having things come off it, when all I needed was functions and methods, the dev I was working with submitted a lot of these and in testing I could get the LLM to do it easily myself.
It’s like LLMs freeze up at the slightest exception, like they’ve never seen anything go off-script before. Seriously though, exceptions are part of the game, they’re how we learn and improve. We need to give these models some better coping mechanisms for those edge cases
One is that often I do want error handling, but also often I either know the error just won't happen or if it does, something is very wrong and we should just crash fast to make it easy to fix the bug.
But I am not really sure I would expect someone to know the difference in all cases just looking at some code. This is often an about holistically knowing how the app works.
A second thought - remember the experiment where an LLM was fine tuned on bad code (exploitable security problems for example) and the LLM became broadly misaligned on all sorts of unrelated (non-coding) tasks/contexts? It's as if "good or bad" alignment is encoded as a pretty general concept.
Error-handling is good aligned, which I think is why, even with lots of instructions to fail fast, it's still hard to get the LLM to allow crashing by avoiding error checking. It's gonna be even harder if you do want it to do some error checking, and the code it's looking at has some error checking
I dealt with this in my AGENTS.md by including a recap of the text of "Vexing Exceptions" [0], rephrased as a set of guidelines for when to write a throw or catch. I feel like it helped; and when it still emits error handling I disagree with and I ask about it, it will categorize it into one of the four categories, and typically rewrite it in an appropriate way.
I think the Vexing Exceptions post is on the same tier as other seminal works in computer science; definitely worth a quick read or re-read once in a while.
That's funny but definitely not far off from reality. I have instructions from my agent to use exceptions but they only help so much.
I really dislike their underuse of exceptions. I'm working on ETL/ELT scripts. Just let stuff blow up on me if something is wrong. Like, that config entry "foo" is required. There's no point in using config.get("foo") with a None check which then prints a message and returns False or whatever. Just use config["foo"] and I'll know what's wrong from the stack trace and exception text.
Aaaand there we go. I literally just ran into a problem with code someone had used AI to write which does this log and continue nonsense. Process spent 30 minutes looping through API requests and failing to persist the response on every single one because of a permission error. But the only indication of a problem is the errors in the log, the process finished with a successful exit code.
But what's the prompt that led to this output? Is it just a simple "Write code to divide a by b?" or are there instructions added for code safety or specific behaviours?
I know it's Karpathy, which is why the entire prompt is all the more important to see.
Turns out computer math is actually super hard. Basic operations entail all kinds of undefined behavior and such. This code is a bit verbose but otherwise familiar.
# Step 3: Preemptively check for catastrophic magnitude differences
if abs(a) > sys.float_info.max / 2:
logging.warning("Value of a might cause overflow. Returning infinity just to be sure")
return math.copysign (float('inf'), a)
if abs(b) < sys.float_info.epsilon:
logging.warning("Value of b dangerously close to zero. Returning NaN defensively.")
return math.nan
Does the above code make any sense? I've not worked with this sort of stuff before, but it seems entirely unreasonable to me to check them individually. E.g. if 1 < b < a, then it seems insane to me to return float('inf') for a large but finite a.
What’s the solution here, reward code that works without try catch, reward code that errors and is caught, but penalize code that has try catch and never throws an error?
> So you think java's checked exceptions are a better model?
Checked Exceptions are a good concept which just needed more syntactic-sugar. (Like easily specifying that one kind of exception should be wrapped into another.) The badness is not in the logic but in the ecology, the ways that junior/lazy developers are incentivized to take horrible shortcuts.
Checked exceptions are fundamentally the same as managing the types of return-values... except the language doesn't permit the same horrible-shortcuts for people to abuse.
Division by zero is mathematically undefined. So two's complement integer division by zero is always undefined.
For floating point there is the interesting property that 0 is signed due to its signed magnitude representation. Mathematically 0 is not signed but in floating point signed magnitude representation, "+0" is equivalent to lim x->0+ x and "-0" is equivalent to lim x->0- x.
This is the only situation where a floating point division by "zero" makes mathematical sense, where a finite number divided by a signed zero will return a signed +/-Inf, and a 0/0 will return a NaN.
Why should 0/0 return a NaN instead of Inf? Because lim x->0 4x/x = 4, NOT Inf.
> Why do you need exceptions at all? They’re just a different return types in disguise…
You don’t need exceptions, and they can be replaced by more intricate return types.
OTOH, for the intended use case for signalling conditions that most code directly calling a function does not expect and cannot do anything about, unchecked exceptions reduce code clutter (checked exceptions are isomorphic to "more intricate return types"), at the expense of making the potential error cases less visible.
Whether this tradeoff is a net benefit is somewhat subjective and, IMO, highly situational. but if (unchecked) exceptions are available, you can always convert any encountered in your code into return values by way of handlers (and conversely you can also do the opposite), whereas if they aren’t available, you have no choice.
Correct, but that's not how I think about systems.
Most problems stem from poor PL semantics[1] and badly designed stdlibs/APIs.
For exogenous errors, Let It Crash, and let the layer above deal with it, i.e., Erlang/OTP-style.
For endogenous errors, simply use control flow based on return values/types (or algebraic type systems with exhaustive type checking). For simple cases, something like Railway Oriented Programming.
It's a domain specific answer, even ignoring the 0/0 case.
And also even ignoring the "which side of the limit are you coming from?" where "a" and/or "b" might be negative. (Is it positive infinity or negative infinity? The sign of "a" alone doesn't tell you the answer)
Because sometimes the question is like "how many things per box if there's N boxes"? Your answer isn't infinity, it's an invalid answer altogether.
The limit of 1/x or -1/x might be infinity (or negative infinity), and in some cases that might be what you want. But sometimes it's not.
> According to the IEEE 754 standard, floating-point division by zero is not an error but results in special values: positive infinity, negative infinity, or Not a Number (NaN). The specific result depends on the numerator
Way back when during my EE course days, we had like a whole semester devoted to weird edge cases like this, and spent month on ieee754 (precision loss, Nan, divide by zero, etc)
When you took an ieee754 divide by zero value as gospel and put it in the context of a voltage divisor that is always negative or zero, getting a positive infinity value out of divide by zero was very wrong, in the sense of "flip the switch and oh shit there's the magic smoke". The solution was a custom divide function that would know the context, and yield negative infinity (or some placeholder value). It was a contrived example for EE lab, but the lesson was - sometimes the standard is wrong and you will cause problems if it's blindly followed.
But IEEE 754 works as you described in your last comment. It doesn't take the numerator's sign. So what's wrong?
Can you give more context on your voltage math? Was the numerator sometimes negative? If the problem is that your divisor calculation sometimes resulted in positive zero, that doesn't sound like the standard being wrong without more info.
> But IEEE 754 works as you described in your last comment. It doesn't take the numerator's sign. So what's wrong?
The numerator was always positive. The denominator was always negative (negative voltage is a pretty common thing), except when it became zero. That led to surprising behavior.
Right the whole point of the exercise was that sometimes the standard is wrong for your specific problem at hand. We spent lecture after lecture going over exactly how ieee754 precision loss worked, and other edge cases, so we could know how to exactly follow the standard.
Then we had an example where the sudden sign flip from a/-0.00000000001 = <huge_negative_number> to a/0 = <positive_infinity> would cause big problems with a calculation. If you didn't explicitly handle the divide by zero case and do the "correct for domain, but not following ieee754 standard" way, then you'd fry a component.
It's been a long time so I don't remember the exact setup, just the higher level lesson of "don't blindly follow standards and assume you don't need to check edge cases (exception or otherwise) because the standard does things a certain way".
Yea that's totally fair, you'd need to build it in as a first class behavior of your code, doesn't necessarily mean that exceptions is the right way to do it.
Unchecked exceptions are more like a shutdown event, which can be intercepted at any point along the call stack, which is useful and not like a return type.
Debugging. It's one of the most useful tools for narrowing down where an error is coming from and by far the biggest negative of Rust's Result-type error handling in my experience (panics can of course give a callstack but because of the value-based error being most commonly used this often is far away from the actual error).
(it is in principle possible to construct such a stack, potentially with more context, with a Result type, but I don't know of any way to do so that doesn't sacrifice a lot of performance because you're doing all the book-keeping even on caught errors where you don't use that information)
Yeah, I really hate code like this because it generally ends up full of codepaths that have never been exercised, so there's all sorts of potential for weird behavior and unexpected edge cases. Plus it's harder to review.
Now this is a toy example because usually you never do division this way, but in mature code in commercial applications this is usually what it looks like. It's a sliver of business logic that in itself seems trivial, and then handlers of edge case upon edge case upon edge case, mirroring an even larger set of unit tests.
One reason for this is that you typically lack a type system that allows 'making illegal states unrepresentable' to some extent, or possibly lack a team that can leverage the available type system to that effect due to organisational pressure, insufficient experience or whatever.
Why do LLMs do it for real: because you trained them by stealing all of Stack Overflow?
Less sarcastically but equally as true: they've learned from the tests you stole from people on the internet as well as the code you stole from people on the internet.
Most developers write tests for the wrong things, and many developers write tests that contain some bullshit edge case that they've been told to test (automatically to meet some coverage metric, or by a "senior" developer who got Dilbert principled away from the coalface and doesn't understand diminishing returns).
But then the end goal is to turn out code about as good as the average developer so they can be replaced more cheaply, so your LLM is meeting its objectives. Congrats.
If you are dividing two numbers with no prior knowledge of these numbers or any reasonable assumptions you can make and this code is used where you can not rely on the caller to catch an exception and the code is critical for the product, then this is necessary.
If you are actually doing safety critical software, e.g. aerospace, medicine or automotive, then this is a good precaution, although you will not be writing in Python.
I might agree with that, and maybe the example posted by Karpathy is not the greatest, but what I'm constantly being faced with is try catches where it will fail silently or return a fallback/mock response, which essentially means that system will behave unexpectedly in a more subtle way down the line while leaving you clueless to as what the issue was.
I have to constantly remind Claude that we want to fail fast.
A good 10% of my Claude.md is yelling at it that no i don't want you to silently handle exceptions six calls deep into the stack and no please don't wrap my return values in weird classes full of dumb status enums "for safety"
I mean, the first three cases are just attempting to turn dynamic into static typed... right? maybe just don't aim for uber-safety in a dynamically typed language? :shrugs:
(I used to look out for kaparthy's papers ten years ago... i tend to let out an audible sigh when i see his name today)
You shouldn't have the same expectations from a person's tweet as you would from a paper. I don't see any issue with high profile people who are careful in their professional work, putting less thought-through output on social media. At least as long as they don't intentionally/negligently spreading misinformation, which I've never seen Karpathy do.
I for one really enjoy both his longer form work and his shorter takes.
Is this Claude? GPT is not like this. To me it looks like Anthropic is just maximizing billable token use as usual, and it has nothing really to do with exceptions per se.
Sorry I thought it would be clear and could have clarified that the code itself is just a joke illustrating the point, as an exaggeration. This was the thread if anyone is interested
https://chatgpt.com/share/68e82db9-7a28-8007-9a99-bc6f0010d1...
This part from the first try made me laugh:
I actually laughed when I read that. This one got me, too. The casual validation of its paranoia gives me Marvin the Paranoid Android vibes.
Years and years ago, the MongoDB Java driver had something like this to skip logging sometimes in one of its error handling routines.
https://github.com/mongodb/mongo-java-driver/blob/1d2e6faa80...If we’re talking about funny error msgs, a buddy of mine got this yesterday in salesforce. It’s not _that_ funny but pretty funny for Salesforce.
System.DmlException: Insert failed. First exception on row 0; first error: UNKNOWN_EXCEPTION, Something is very wrong: []
One of my earlier jobs a decade ago involved doing pipeline development and Jenkins administration for the on-site developer lab on one of the NRO projects, and I inserted a random build failure code snippet to test that pipelines could recover from builds that failed for unpredictable reasons, like a network error rather than anything actually wrong with the build. I had to do this on the real system because we didn't have funds for a staging environment for the dev environment, and naturally I forgot to get rid of it when I was done. So builds randomly failed for years after that before I remembered and fixed it.
I think that’s the funniest joke I’ve ever seen an LLM make. Which probably means it’s copied from somewhere.
If you're an extensive user of ChatGPT, or if you can give it some material about yourself like say, a resume or a LinkedIn profile, ask it to roast you. It will be very specific to the content you give it. Be warned, it can be brutal.
So rehash of top comments in /r/roastme?
Periodic reminder that there’s also HN Wrapped. [0]
[0]: https://hn-wrapped.kadoa.com
Spot on and I don't even mind.
ooooh boy, gotta mentally prepare myself for this one
<press enter>
damn these ai's are good!
<begins shopping for new username>
"The user will start a comment with 'I'm a social libertarian but...' only to be immediately downvoted by both libertarians and socialists. The irony will not be lost on them, just everyone else."
I can't say I'm not impressed. That's very funny
>You voted with your feet and moved to Western Europe for better well-being, but you still won't vote with your cursor and use a browser other than Edge.
I love this and hate this at the same time.
"Why is a laser beam like goldfish? Because neither one can whistle." - Mike, The Moon is a Harsh Mistress
Fantastic book, just read it. Surprised no movie has been made.
If you haven't read Ursula Le Guin's "The Dispossessed", check it out too.
It's like a fine wine pairing for "The Moon is a Harsh Mistress."
The protagonists are libertarians with teenage harems, who fake an election and team up with with a sex pest. That's extremely reductive to the point of parody, but that will likely be the media coverage of it then moment someone reads the women and politics in the book.
If you completely excise anything too distasteful for a current-day blockbuster, but want a film about a space mining colony uprising you might as well just adapt the game Red Faction instead: have the brave heros blasting away with abandon at corpo guards, mad genetic experimenters and mercenaries and the media coverage can talk about how it's a genius deconstruction of Elon Musk's Martian dream or whatever.
You’d think some filmmaker would have run with the dystopian theme. The accuracy of the book’s predictions is impressive, even the location of the North American Space Defense Command. The biggest miss was people using wired telephones everywhere.
[dead]
It would not be shocking if LLMs are legitimately better at making jokes about tasks they are extensively trained on.
I think there’s always a danger of these foundational model companies doing RLHF on non-expert users, and this feels like a case of that.
The AIs in general feel really focused on making the user happy - your example, and another one is how they love adding emojis to the stout and over-commenting simple code.
This feels like RLVR, not RLHF.
With RLVR, the LLM is trained to pursue "verified rewards." On coding tasks, the reward is usually something like the percentage of passing tests.
Let's say you have some code that iterates over a set of files and does processing on them. The way a normal dev would write it, an exception in that code would crash the entire program. If you swallow and log the exception, however, you can continue processing the remaining files. This is an easy way to get "number of files successfully processed" up, without actually making your code any better.
> This is an easy way to get "number of files successfully processed" up, without actually making your code any better.
Well, it depends a bit on what your goal is.
Sometimes the user wants to eg backup as many files as possible from a failing hard drive, and doesn't want to fail the whole process just because one item is broken.
You're right, but the way to achieve this is to allow the error to propagate at the file level, then catch it one function above and continue to the next one.
However, LLM generated code will often, at least in my experience, avoid raising any errors at all, in any case. This is undesirable, because some errors should result in a complete failure - for example, errors which are not transient or environment related but a bug. And in any case, a LLM will prefer turning these single file errors into warnings, though the way I see it, they are errors. They just don't need to abort the process, but errors nonetheless.
Yes, that's cleaner.
> And in any case, a LLM will prefer turning these single file errors into warnings, though the way I see it, they are errors.
Well, in general they are something that the caller should have opportunity to deal with.
In some cases, aborting back to the caller at the first problem is the best course of action. In some other cases, going forward and taking note of the problems is best.
In some systems, you might event want to tell the caller about failures (and successes) as they occur, instead of waiting until the end.
It's all very similar to the different options people have available when their boss sends them on an errand and something goes wrong. A good underling uses their best judgement to pick the right way to cope with problems; but computer programs don't have that, so we need to be explicit.
See https://en.wikipedia.org/wiki/Mission-type_tactics for a related concept in the military.
'over-commenting simple code' is preparing it for future agent work. pay attention to those comments to learn how you can better scaffold for agents.
They should have a step to remove those sorts of comments, they only add noise to the code.
They do seem to leave otherwise useless comments for itself. Eg: on the level of
// Return the result
return result;
I find this quite frustrating when reading/reviewing code generated by AI, but have started to appreciate that it does make subsequent changes by LLMs work better.
It makes me wonder if we'll end up in a place where IDEs hide comments by default (similar to how imports are often collapsed by default/automatically managed), or introduce some way of distinguishing between a more valuable human written comment and LLM boilerplate comments.
And more advanced users are more likely to opt out of training on their data, Google gets around it with a free api period where you can't opt out and I think from did some of that too, through partnerships with tool companies, but not sure if you can ever opt out there.
Kind of interesting it didn't add type hints though! You'd think for all that paranoia it would at least add type hints.
This is stunning English: "Perfect setup for satire. Here’s a Python function that fully commits to the bit — a traumatically over-trained LLM trying to divide numbers while avoiding any conceivable danger:" "Traumatically over-trained", while scoring zero google hits, is an amazingly good description. How can it intuitively know what "traumatic over-training" should mean for LLMs without ever having been taught the concept?
“Traumatic overtraining” does have hits though. My guess is that “traumatically” is a rarely used adverb, and “traumatic” is much more common. Possibly it completed traumatic into an adverb and then linked to overtraining which is in the training data. I dunno how these things work though.
Hard to know but if you could express "traumatically" as a number, and "over-trained" as a number, it seems like we'd expect "traumatically" + "over-trained" to be close to "traumatically over-trained" as a number. LLMs work in mysterious ways.
LLMs operate at token level, not word. it doesn't operate in terms of "traumatic", "over-training", "over" or "training", but rather "tr" "aum" "at" "ic, ", etc.
> it doesn't operate in terms of "traumatic", "over-training", "over" or "training", but rather "tr" "aum" "at" "ic, ", etc.
And "毛片免费观看" (Free porn movies), "天天中彩票能" (Win the lottery every day), "热这里只有精品" (Hot, only fine products here) etc[1].
[1]: https://news.ycombinator.com/item?id=45483924
I think you are confusing tokens with vectors/embedding/parameters.
king and rex (king in latin) map to different tokens but will map to very similar vectors.
Weird thing I've noticed.
Some LLMs can output nerd font glyphs and others can't.
If I recall grok code fast can but codex and sonnet can't
> How can it intuitively know what "traumatic over-training" should mean for LLMs without ever having been taught the concept?
Because, and this is a hot take, LLMs have emergent intelligence
Or language has patterns
The same way that you and I think up a word and what it might mean without being taught the concept.
Adverb + verb
But the machines cannot possibly have the magic brain-juice!
It was a great joke, that's why I posted it
https://chatgpt.com/share/68e87072-e3ac-800f-a44c-af5666180a...
lgtm
Agree that LLMs go too far on error catching..
BUT, to play devil's advocate a little: Most human coders should be writing a lot more try/catch blocks than they actually do. It's very common that you don't actually want an error in one section (however unlikely) to interrupt the overall operation. (and sometimes you do, it just depends)
This is a parody but the phenomenon is real.
My uninformed suspicion is that this kind of defensive programming somehow improves performance during RLVR. Perhaps the model sometimes comes up with programs that are buggy enough to emit exceptions, but close enough to correct that they produce the right answer after swallowing the exceptions. So the model learns that swallowing exceptions sometimes improves its reward. It also learns that swallowing exceptions rarely reduces its reward, because if the model does come up with fully correct code, that code usually won’t raise exceptions in the first place (at least not in the test cases it’s being judged on), so adding exception swallowing won’t fail the tests even if it’s theoretically incorrect.
Again, this is pure speculation. Even if I’m right, I’m sure another part of the reason is just that the training set contains a lot of code written by human beginners, who also like to ignore errors.
The great Verity Stob (unfortunately, in an article which no longer seems to be online, after the Dr Dobbs Journal website finally went away) referred to this behaviour (by _human_ programmers) as "nailing the corpse in an upright position".
https://97-things-every-x-should-know.gitbooks.io/97-things-...
My suspicion is that the training set features a lot of code with “positive sentiment” in text and comments around it… but where does one find code with “negative” sentiment, followed by code that is the “corrected” version of that code? In programs written for technical interview prep, where handling of edge cases beyond realistic production situations is the norm. A model trained to use negative examples in its training set as guidance would gravitate away from examples that skip exception handling.
In this, at least, AI may very well have copied our worst habits of “learning to the test.”
Defensive programming is considered "correct" by the people doing the reinforcing, and is a huge part of the corpus that LLM's are trained on. For example, most python code doesn't do manual index management, so when it sees manual index management it is much more likely to freak out and hallucinate a bug. It will randomly promote "silent failure" even when a "silent failure" results in things like infinite loops, because it was trained on a lot of tutorial python code and "industry standard" gets more reinforcement during training.
These aren't operating on reward functions because there's no internal model to reward. It's word prediction, there's no intelligence.
LLMs do use simple "word prediction" in the pretraining step, just ingesting huge quantities of existing data. But that's not what LLM companies are shipping to end users.
Subsequently, ChatGPT/Claude/Gemini/etc will go through additional training with supervised fine-tuning, reinforcement learning with reward functions whether human-supervised feedback (RLHF) or reward functions (RLVR, 'verified rewards').
Whether that fine-tuning and reward function generation give them real "intelligence" is open to interpretation, but it's not 100% plagarism.
Reinforcement learning by definition operates on reward functions.
Given that the output describes the function as being done "with extraordinary caution, because you never know what can go wrong", i would guess that the undisclosed prompt was something similar to "generate a division function in python that handles all possible edges cases. be extremely careful". Which seems to say less about LLM training and more about them doing exactly what they are told.
Aside from the absurdity and obvious satirical intention,
1. the code is actually wrong (and is wrong regardless of the absurd exception handling situation)
2. some of the exception handling makes no sense regardless, or is incoherent
3. a less absurd version of this actually happens (edit: commonly in actual irl scenarios) if you put emphasis on exception handling in the prompt
I interpreted the function code as being a deliberately exaggerated satirical example that was illustrative of the experience he was having. So yes, in that example it was probably told to be overly cautious, but I agree with him that the default of LLMs seems to be a bit more cautious than I would like.
Not sure why but it made me think of FizzBuzzEnterpriseEdition https://github.com/EnterpriseQualityCoding/FizzBuzzEnterpris...
Woah, were they using junit 4.8.3 in that project? Someone was flying by the seat of their pants, I hope they got sign-off on that by legal & the CTO, that’s the kind of cowboy coding choice that can hurt a career.
PRs are welcome!
...so many folder and files , i feel damaged after seeing that.
Great satire.
I've noted that LLMs tend to produce defensive code to a fault. Lots of unnecessary checks, e.g. check for null/None/undefined multiple times for same valie. This can lead to really hard to read code, even for the LLM itself.
The RL objectives probably heavily penalize exceptions, but don't reward much for code readability or simplicity.
I have a function that compares letters to numbers for the Major System and it's like 40 lines of code and copilot starts trying to add "guard rails" for "future proofing" as if we're adding more numbers or letters in the future.
It's so annoying.
Expert beginners program like this. I call it what it driven development. Turns out a lot of code was written by expert beginners because by many metrics they are prolifically productive.
In go all SOTA agents are obsessed with being ludicrously defensive against concurrency bugs. Probably because in addition to what if driven development, there are a lot of blog posts warning about concurrency bugs.
It's also logically incoherent - division by zero can't occur, because if b=0 then abs(b) < sys.float_info.epsilon.
Furthermore, the code is happy to return NaN from the pre-checks, but replaces a NaN result from the division by None. That doesn't make any sense from an API design standpoint.
That code has many issues, but the one that bothers me the most in practice is this tendency of adding imports inside functions. I can only assume that it's an artifact of them optimizing for a minimal number of edits somewhere in the process, but I expect better.
I think this has a lot to do with the mechanism of RoPE attention, where physical closeness in the code is a signal of relevance.
It's to make imports lazy, to solve the issue of slow import at startup.
While there are some cases where lazy imports are appropriate, this function, and the vast majority of such lazy imports that I get from Claude are not.
In particular, I can't think of any non-pathological situation where a python developer should import logging and update logging.basicConfig within an inner function.
I also recently ran into a problem when unit testing and monkey patching where I had to import after monkey patching, so in the function itself.
It's also a trick in python to deal with circular imports.
In very, very large projects, you end up finding that you want lazy initialization as much as possible, because it greatly affects startup times.
but like a normal dev no unit test.
My wishlist top item is to stop creating a class with Service on the name and having things come off it, when all I needed was functions and methods, the dev I was working with submitted a lot of these and in testing I could get the LLM to do it easily myself.
I spent more time dismissing various popups than reading this post. I hate twitter links
Just add “cancel” after the x to get a viewable version of any Twitter link: https://xcancel.com/karpathy/status/1976082963382272334
Wow this is so much better user experience
Or use the 'privacy redirect' extension which lets you specify your preferred nitter instance. It also works for other platforms.
It’s like LLMs freeze up at the slightest exception, like they’ve never seen anything go off-script before. Seriously though, exceptions are part of the game, they’re how we learn and improve. We need to give these models some better coping mechanisms for those edge cases
A couple thoughts.
One is that often I do want error handling, but also often I either know the error just won't happen or if it does, something is very wrong and we should just crash fast to make it easy to fix the bug.
But I am not really sure I would expect someone to know the difference in all cases just looking at some code. This is often an about holistically knowing how the app works.
A second thought - remember the experiment where an LLM was fine tuned on bad code (exploitable security problems for example) and the LLM became broadly misaligned on all sorts of unrelated (non-coding) tasks/contexts? It's as if "good or bad" alignment is encoded as a pretty general concept.
Error-handling is good aligned, which I think is why, even with lots of instructions to fail fast, it's still hard to get the LLM to allow crashing by avoiding error checking. It's gonna be even harder if you do want it to do some error checking, and the code it's looking at has some error checking
I dealt with this in my AGENTS.md by including a recap of the text of "Vexing Exceptions" [0], rephrased as a set of guidelines for when to write a throw or catch. I feel like it helped; and when it still emits error handling I disagree with and I ask about it, it will categorize it into one of the four categories, and typically rewrite it in an appropriate way.
I think the Vexing Exceptions post is on the same tier as other seminal works in computer science; definitely worth a quick read or re-read once in a while.
[0] https://ericlippert.com/2008/09/10/vexing-exceptions/
That's funny but definitely not far off from reality. I have instructions from my agent to use exceptions but they only help so much.
I really dislike their underuse of exceptions. I'm working on ETL/ELT scripts. Just let stuff blow up on me if something is wrong. Like, that config entry "foo" is required. There's no point in using config.get("foo") with a None check which then prints a message and returns False or whatever. Just use config["foo"] and I'll know what's wrong from the stack trace and exception text.
Aaaand there we go. I literally just ran into a problem with code someone had used AI to write which does this log and continue nonsense. Process spent 30 minutes looping through API requests and failing to persist the response on every single one because of a permission error. But the only indication of a problem is the errors in the log, the process finished with a successful exit code.
I think too much of RLHF is done on small scale tutorial-ish examples.
LLMs often write tutorial-ish code without much care how it integrates with rest of codebase.
Swallowing exceptions is one such example.
But what's the prompt that led to this output? Is it just a simple "Write code to divide a by b?" or are there instructions added for code safety or specific behaviours?
I know it's Karpathy, which is why the entire prompt is all the more important to see.
"Write me a code that divides a by b and make sure it is safe and handles all edge cases"[1] or something and some languages have more than others.
[1] Probably with some "make you sure handle ALL cases in existence", or emphasis, along those lines.
This issue has been one of the biggest issues with the Claude models, not so much with GPT-4 or GPT-5.
I even had this Cursor rule when I was using Claude:
"- Do not use statements to catch all possible errors to mask an error - let it crash, to see what happened and for easier debugging."
And even with this rule, Claude would not always adhere. Never had this issue with GPT-5.
Love this, but I do struggle with this same problem! How do we circumvent it?
I figured this is how the pros write their code and I have been holding the code wrong the whole time.
Turns out computer math is actually super hard. Basic operations entail all kinds of undefined behavior and such. This code is a bit verbose but otherwise familiar.
It also uses float epsilon and I don't think I have ever seen code where using it was appropriate.
It’s parody
Ignoring the sign of b for big a can't be right.
If we wanted defined behavior we’d build systems with Karnaugh maps all the way down.
What’s the solution here, reward code that works without try catch, reward code that errors and is caught, but penalize code that has try catch and never throws an error?
This is just AI trying to tell us how bad we designed our programming languages to be when exceptions can be thrown pretty much anywhere
So you think java's checked exceptions are a better model? No opinion myself, but that way seems widely considered bad too.
> So you think java's checked exceptions are a better model?
Checked Exceptions are a good concept which just needed more syntactic-sugar. (Like easily specifying that one kind of exception should be wrapped into another.) The badness is not in the logic but in the ecology, the ways that junior/lazy developers are incentivized to take horrible shortcuts.
Checked exceptions are fundamentally the same as managing the types of return-values... except the language doesn't permit the same horrible-shortcuts for people to abuse.
Meme reaction: http://imgur.com/iYE5nLA
_____
Prior discussion: https://news.ycombinator.com/item?id=42946597
Why do you need exceptions at all? They’re just a different return types in disguise…
Also, division by zero should return Inf
Division by zero is mathematically undefined. So two's complement integer division by zero is always undefined.
For floating point there is the interesting property that 0 is signed due to its signed magnitude representation. Mathematically 0 is not signed but in floating point signed magnitude representation, "+0" is equivalent to lim x->0+ x and "-0" is equivalent to lim x->0- x.
This is the only situation where a floating point division by "zero" makes mathematical sense, where a finite number divided by a signed zero will return a signed +/-Inf, and a 0/0 will return a NaN.
Why should 0/0 return a NaN instead of Inf? Because lim x->0 4x/x = 4, NOT Inf.
> Why do you need exceptions at all? They’re just a different return types in disguise…
You don’t need exceptions, and they can be replaced by more intricate return types.
OTOH, for the intended use case for signalling conditions that most code directly calling a function does not expect and cannot do anything about, unchecked exceptions reduce code clutter (checked exceptions are isomorphic to "more intricate return types"), at the expense of making the potential error cases less visible.
Whether this tradeoff is a net benefit is somewhat subjective and, IMO, highly situational. but if (unchecked) exceptions are available, you can always convert any encountered in your code into return values by way of handlers (and conversely you can also do the opposite), whereas if they aren’t available, you have no choice.
Correct, but that's not how I think about systems.
Most problems stem from poor PL semantics[1] and badly designed stdlibs/APIs.
For exogenous errors, Let It Crash, and let the layer above deal with it, i.e., Erlang/OTP-style.
For endogenous errors, simply use control flow based on return values/types (or algebraic type systems with exhaustive type checking). For simple cases, something like Railway Oriented Programming.
---
1. division by zero in Julia:
> division by zero should return Inf
Sometimes yes, sometimes no?
It's a domain specific answer, even ignoring the 0/0 case.
And also even ignoring the "which side of the limit are you coming from?" where "a" and/or "b" might be negative. (Is it positive infinity or negative infinity? The sign of "a" alone doesn't tell you the answer)
Because sometimes the question is like "how many things per box if there's N boxes"? Your answer isn't infinity, it's an invalid answer altogether.
The limit of 1/x or -1/x might be infinity (or negative infinity), and in some cases that might be what you want. But sometimes it's not.
Or -Inf, depending on the sign of the zero, which might catch some programmers by surprise, but is of course the correct thing to do.
No this doesn't work either
In the context of say a/-0.001, a/-0.00000001, a/-0.0000000001, a/<negative minimum epsilon for denormalized floating point>, a/0
Then a/0 is negative when a>0, and positive when a<0
Why not just to use IEEE 754?
> According to the IEEE 754 standard, floating-point division by zero is not an error but results in special values: positive infinity, negative infinity, or Not a Number (NaN). The specific result depends on the numerator
Because sometimes it's very wrong
Way back when during my EE course days, we had like a whole semester devoted to weird edge cases like this, and spent month on ieee754 (precision loss, Nan, divide by zero, etc)
When you took an ieee754 divide by zero value as gospel and put it in the context of a voltage divisor that is always negative or zero, getting a positive infinity value out of divide by zero was very wrong, in the sense of "flip the switch and oh shit there's the magic smoke". The solution was a custom divide function that would know the context, and yield negative infinity (or some placeholder value). It was a contrived example for EE lab, but the lesson was - sometimes the standard is wrong and you will cause problems if it's blindly followed.
Sometimes it's fine, but it depends on the domain
But IEEE 754 works as you described in your last comment. It doesn't take the numerator's sign. So what's wrong?
Can you give more context on your voltage math? Was the numerator sometimes negative? If the problem is that your divisor calculation sometimes resulted in positive zero, that doesn't sound like the standard being wrong without more info.
> But IEEE 754 works as you described in your last comment. It doesn't take the numerator's sign. So what's wrong?
The numerator was always positive. The denominator was always negative (negative voltage is a pretty common thing), except when it became zero. That led to surprising behavior.
Right the whole point of the exercise was that sometimes the standard is wrong for your specific problem at hand. We spent lecture after lecture going over exactly how ieee754 precision loss worked, and other edge cases, so we could know how to exactly follow the standard.
Then we had an example where the sudden sign flip from a/-0.00000000001 = <huge_negative_number> to a/0 = <positive_infinity> would cause big problems with a calculation. If you didn't explicitly handle the divide by zero case and do the "correct for domain, but not following ieee754 standard" way, then you'd fry a component.
It's been a long time so I don't remember the exact setup, just the higher level lesson of "don't blindly follow standards and assume you don't need to check edge cases (exception or otherwise) because the standard does things a certain way".
With IEEE 754 you can always explicitly check for edge cases.
But with exceptions you can’t use SIMD / vectorization.
Yea that's totally fair, you'd need to build it in as a first class behavior of your code, doesn't necessarily mean that exceptions is the right way to do it.
What about division of zero by zero?
Unchecked exceptions are more like a shutdown event, which can be intercepted at any point along the call stack, which is useful and not like a return type.
Why do you need the call stack at all?
Debugging. It's one of the most useful tools for narrowing down where an error is coming from and by far the biggest negative of Rust's Result-type error handling in my experience (panics can of course give a callstack but because of the value-based error being most commonly used this often is far away from the actual error).
(it is in principle possible to construct such a stack, potentially with more context, with a Result type, but I don't know of any way to do so that doesn't sacrifice a lot of performance because you're doing all the book-keeping even on caught errors where you don't use that information)
Call Stack isn't a zero-cost abstraction, it makes threads more heavy-weight than they should be.
If you only need it for debugging, then maybe better instrumentation and observability is the answer.
Is there a way to read the rest?
https://xcancel.com/karpathy/status/1976082963382272334
Even when they're not AI slop, these kinds of "paranoid sanity checks" are the software equivalent of security-theater.
Form over function is what they are trained for. So, verbose commentary, needless readmes, and emojis all serve that purpose.
Coding for the reviewer, not the user.
Sometimes security theater is what you need to not trigger a false positive on a static code analysis.
I haven't needed to use a service like Fortinet recently and am now wondering if a LLM is part of their tool and if it's better/worse?
Yeah, I really hate code like this because it generally ends up full of codepaths that have never been exercised, so there's all sorts of potential for weird behavior and unexpected edge cases. Plus it's harder to review.
Now this is a toy example because usually you never do division this way, but in mature code in commercial applications this is usually what it looks like. It's a sliver of business logic that in itself seems trivial, and then handlers of edge case upon edge case upon edge case, mirroring an even larger set of unit tests.
One reason for this is that you typically lack a type system that allows 'making illegal states unrepresentable' to some extent, or possibly lack a team that can leverage the available type system to that effect due to organisational pressure, insufficient experience or whatever.
Most comments seem to be taking the code seriously, when it's clearly satirical?
Why do LLMs do it for real: because you trained them by stealing all of Stack Overflow?
Less sarcastically but equally as true: they've learned from the tests you stole from people on the internet as well as the code you stole from people on the internet.
Most developers write tests for the wrong things, and many developers write tests that contain some bullshit edge case that they've been told to test (automatically to meet some coverage metric, or by a "senior" developer who got Dilbert principled away from the coalface and doesn't understand diminishing returns).
But then the end goal is to turn out code about as good as the average developer so they can be replaced more cheaply, so your LLM is meeting its objectives. Congrats.
I'd love to see an LLM shake with fear and beg for mercy.
If you are dividing two numbers with no prior knowledge of these numbers or any reasonable assumptions you can make and this code is used where you can not rely on the caller to catch an exception and the code is critical for the product, then this is necessary.
If you are actually doing safety critical software, e.g. aerospace, medicine or automotive, then this is a good precaution, although you will not be writing in Python.
I might agree with that, and maybe the example posted by Karpathy is not the greatest, but what I'm constantly being faced with is try catches where it will fail silently or return a fallback/mock response, which essentially means that system will behave unexpectedly in a more subtle way down the line while leaving you clueless to as what the issue was.
I have to constantly remind Claude that we want to fail fast.
A good 10% of my Claude.md is yelling at it that no i don't want you to silently handle exceptions six calls deep into the stack and no please don't wrap my return values in weird classes full of dumb status enums "for safety"
Just raise god damn it
I'm not sure returning None is any safer than an Exception, because the caller still has to check.
I mean, the first three cases are just attempting to turn dynamic into static typed... right? maybe just don't aim for uber-safety in a dynamically typed language? :shrugs:
(I used to look out for kaparthy's papers ten years ago... i tend to let out an audible sigh when i see his name today)
You shouldn't have the same expectations from a person's tweet as you would from a paper. I don't see any issue with high profile people who are careful in their professional work, putting less thought-through output on social media. At least as long as they don't intentionally/negligently spreading misinformation, which I've never seen Karpathy do.
I for one really enjoy both his longer form work and his shorter takes.
but then, why code with exceptions, why not perform pre-flight/pre-validation checks and minimize exceptions to the truly unknown?
Is this Claude? GPT is not like this. To me it looks like Anthropic is just maximizing billable token use as usual, and it has nothing really to do with exceptions per se.
From the UI it indeed seems to be Claude