People ditch symbolic reasoning for statistical models, then are surprised when the model does, in fact, use statistical features and not symbolic reasoning.
The hard part is that for all the things that the author says disprove LLMs are intelligent are failings for humans too.
* Humans tell you how they think, but it seemingly is not how they really think.
* Humans tell you repeatedly they used a tool, but they did it another way.
* Humans tell you facts they believe to be true but are false.
* Humans often need to be verified by another human and should not be trusted.
* Humans are extraordinarily hard to align.
While I am sympathetic to the argument, and I agree that machines aligned on their own goals over a longer timeframe is still science fiction, I think this particular argument fails.
GPT o3 is a better writer than most high school students at the time of graduation. GPT o3 is a better researcher than most high school students at the time of graduation. GPT o3 is a better lots of things than any high school student at the time of graduation. It is a better coder than the vast majority of first semester computer science students.
The original Turing test has been shattered. We're building progressively harder standards to get to what is human intelligence and as we find another one, we are quickly achieving it.
The gap is elsewhere: look at Devin as to the limitation. Its ability to follow its own goal plans is the next frontier and maybe we don't want to solve that problem yet. What if we just decide not to solve that particular problem and lean further into the cyborg model?
We don't need them to replace humans - we need them to integrate with humans.
> GPT o3 is a better writer than most high school students at the time of graduation.
All of these claims, based on benchmarks, don't hold up in the real world on real world tasks. Which is strongly supportive of the statistical model. It will be capable of answering patterns extensively trained on. But is quickly breaks down when you step outside that distribution.
o3 is also a significant hallucinator. I spent quite a bit of time with it last weekend and found it to be probably far worse than any of the other top models. The catch is that it its hallucinations are quite sophisticated. Unless you are using it on material for which you are extremely knowledgeable, you won't know.
LLMs are probability machines. Which means they will mostly produce content that aligns to the common distribution of data. They don't analyze what is correct, but only what is probable completions for your text by common word distributions. But when scaled to incomprehensible scales of combinatorial patterns, it does create a convincing mimic of intelligence and it does have its uses.
But importantly, it diverges from the behaviors we would see in true intelligence in ways that make it inadequate for solving many of the kinds of tasks we are hoping to apply them to. The being namely the significant unpredictable behaviors. There is just no way to know what type of query/prompt will result in operating over concepts outside the training set.
> o3 is also a significant hallucinator. I spent quite a bit of time with it last weekend and found it to be probably far worse than any of the other top models. The catch is that it its hallucinations are quite sophisticated. Unless you are using it on material for which you are extremely knowledgeable, you won't know.
At least 3/4 of humans identify with a religion which at best can be considered a confabulation or hallucination in the rigorous terms you're using to judge LLMs. Dogma is almost identical to the doubling-down on hallucinations that LLMs produce.
I think what this shows about intelligence in general is that without grounding in physical reality it tends to hallucinate from some statistical model of reality and confabulate further ungrounded statements without strong and active efforts to ground each statement in reality. LLMs have the disadvantage of having no real-time grounding in most instantiations; Gato and related robotics projects exempted. This is not so much a problem with transformers as it is with the lack of feedback tokens in most LLMs. Pretraining on ground truth texts can give an excellent prior probability of next tokens and I think feedback either in the weights (continuous fine-tuning) or real-world feedback as tokens in response to outputs can get transformers to hallucinate less in the long run (e.g. after responding to feedback when OOD)
I don't dispute that these are problems, but the fact that its hallucinations are quite sophisticated to me means that they are errors humans also could reach.
I am not saying that the LLMs are better than you analyze but rather that average humans are worse. (Well trained humans will continue to be better alone than LLMs alone for some time. But compare an LLM to an 18 year old.)
Essentially, pattern matching can outperform humans at many tasks. Just as computers and calculators can outperform humans at tasks.
So it is not that LLMs can't be better at tasks, it is that they have specific limits that are hard to discern as pattern matching on the entire world of data is kind of an opaque tool in which we can not easily perceive where the walls are and it falls completely off the rails.
Since it is not true intelligence, but a good mimic at times, we will continue to struggle with unexpected failures as it just doesn't have understanding for the task given.
We are capable of much more, which is why we can perform tasks when no prior pattern or example has been provided.
We can understand concepts from the rules. LLMs must train on millions of examples. A human can play a game of chess from reading the instruction manual without ever witnessing a single game. This is distinctly different than pattern matching AI.
Humans can be deceptive, but it is usually deliberate. We can also honestly make things up and present them as fact but it is not that common, we usually say that we don't know. And generally, lying is harder for us than telling the truth, in the sense that making a consistent but false narrative requires effort.
For LLMs, making stuff up is the default, one can argue that it is all they do, it just happens to be the truth most of the times.
And AFAIK, what I would call the "real" Turing test hasn't been shattered by far. The idea is that the interrogator and the human subject are both experts and collaborate against the computer. They can't cheat by exchanging secrets, but anything else is fair game.
I think it is important because the Turing test has already been "won" by primitive algorithms acting clueless to interrogators who were not aware of the trick. For me, this is not really a measure of computer intelligence, more like a measure of how clever the chatbot designers were at tricking unsuspecting people.
I think this is one of the distinguishing attributes of human failures. Human failures have some degree of predictability. We know when we aren't good at something, we then devise processes to close that gap. Which can be consultations, training, process reviews, use of tools etc.
The failures we see in LLMs are distinctly of a different nature. They often appear far more nonsensical and have more of a degree of randomness.
The LLMs as a tool would be far more useful if they could indicate what they are good at, but since they cannot self reflect over their knowledge, it is not possible. So they are equally confident in everything regardless of its correctness.
I think the last few years are a good example of how this isn't really true. Covid came around and everyone became an epidemiologist and public health expert. The people in charge of the US government right now are also a perfect example. RFK Jr. is going to get to the bottom of autism. Trump is ruining the world economy seemingly by himself. Hegseth is in charge of the most powerful military in the world. Humans pretending they know what they're doing is a giant problem.
They are different contexts of errors. Take any of these humans in your example, and give them an objective task, such as take any piece of literal text and reliably interpret its meaning and they can do so.
LLMs cannot do this. There are many types of human failures, but we somewhat know the parameters and context of those failures. Political/emotional/fear domains etc have their own issues, but we are aware of them.
However, LLMs cannot perform purely objective tasks like simple math reliably.
I think TA's argument fundamentally rests on two premises (quoting):
(a) If we were on the path toward intelligence, the amount of training data and power requirements would both be reducing, not increasing.
(b) [LLMs are] data bound and will always be unreliable as edge cases outside common data are infinite.
The most important observed consequences of (b) are model collapse when repeatedly fed LLM output in further training iterations; and increasing hallucination when the LLM is asked for something truly novel (i.e. arising from understanding of first principles but not already enumerated or directly implicated in its training data).
Yes, humans are capable of failing (and very often do) in the same ways: we can be extraordinarily inefficient with our thoughts and behaviors, we can fail to think critically, and we can get stuck in our own heads. But we are capable of rising above those failings through a commitment to truths (or principles, if you like) outside of ourselves, community (including thoughtful, even vulnerable conversations with other humans), self-discipline, intentionality, doing hard things, etc...
There's a reason that considering the principles at play, sitting alone with your own thoughts, mulling over a problem for a long time, talking with others and listening carefully, testing ideas, and taking thoughtful action can create incredibly valuable results. LLMs alone won't ever achieve that.
I mildly disagree with the author, but would be happy arguing his side also on some of his points:
Last September I used ChatGPT, Gemini, and Claude in combination to write a complex piece of code from scratch. It took four hours and I had to be very actively involved. A week ago o3 solved it on its own, at least the Python version ran as-is, but the Common Lisp version needed some tweaking (maybe 5 minutes of my time).
This is exponential improvement and it is not so much the base LLMs getting better, rather it is: familiarity with me (chat history) and much better tool use.
I may be be incorrect, but I think improvements in very long user event and interaction context, increasingly intelligent tool use, perhaps some form of RL to develop per-user policies for improving incorrect tool use, and increasingly good base LLMs will get us to a place that in the domain of digital knowledge work where we will have personal agents that are AGI for a huge range of use cases.
> where we will have personal agents that are AGI for a huge range of use cases
We are already there for internet social media bots. I think the issue here is being able to discern the correct use cases. What is your error tolerance? For social media bots, it really doesn't matter so much.
However, mission critical business automation is another story. We need to better understand the nature of these tools. The most difficult problem is that there is no clear line for the point of failure. You don't know when you have drifted outside of the training set competency. The tool can't tell you what it is not good at. It can't tell you what it does not know.
This limits its applicability for hands-off automation tasks. If you have a task that must always succeed, there must be human review for whatever is assigned to the LLM.
Imma be honest with you, this is exactly his I would do that math, and that is exactly the lie I would tell if you asked me to explain it. This is me-level agi.
author says we made no progress towards agi, also gives no definition for what the "i" in agi is, or how we would measure meaningful progress in this direction.
in a somewhat ironic twist, it seems like the authors internal definition for "intelligence" fits much closer with 1950s. good old-fashioned AI, doing proper logic and algebra. literally all the progress we made in ai in the last 20 years in ai is precisely because we abandoned this narrow-minded definition of intelligence.
Maybe I'm a grumpy old fart but none of these are new arguments. Philosophy of mind has an amazingly deep and colorful wealth of insights in this matter, and I don't know why this is not required reading for anyone writing a blog on ai.
> or how we would measure meaningful progress in this direction.
"First, we should measure is the ratio of capability against the quantity of data and training effort. Capability rising while data and training effort are falling would be the interesting signal that we are making progress without simply brute-forcing the result.
The second signal for intelligence would be no modal collapse in a closed system. It is known that LLMs will suffer from model collapse in a closed system where they train on their own data."
My understanding was that chain-of-thought is used precisely BECAUSE it doesn't reproduce the same logic that simply asking the question directly does. In "fabricating" an explanation for what it might have done if asked the question directly, it has actually produced correct reasoning. Therefore you can ask the chain-of-thought question to get a better result than asking the question directly.
> Which means these LLM architectures will not be producing groundbreaking novel theories in science and technology.
Is it not possible that new theories and breakthroughs could result from this so-called statistical pattern matching? The information necessary could be present in the training data and the relationship simply never before considered by a human.
We may not be on a path to AGI, but it seems premature to claim LLMs are fundamentally incapable of such contributions to knowledge.
In fact, it seems that these AI labs are leaning in such a direction. Keep producing better LLMs until the LLM can make contributions that drive the field forward.
Certainly random chance exists for discovery. But most revolutionary type discoveries come from deep understanding of the context.
The contribution of LLMs to knowledge is more like that of search engines. It is still the human which possesses understanding that ultimately will be the principle source of innovation. The LLM can assist with navigating and exploring existing information.
However, LLMs have significant downsides in this regard too. The hallucination problem is no joke. It can often mislead you and cause a loss of time on some tasks.
Overall, they will be somewhat useful in some manner, but substantially less so than the present hype machine suggests.
They seem precisely that: search engines. Instead to give you a list of webpages with possible answers they actually synthesise the results. A more direct analogy is the case where ChatGPT provides you two possible answers. Of course it could provide you more just like search engines provide more links.
> All of the current architectures are simply brute-force pattern matching
This explains hallucinations and i agree with 'braindead' argument. To move toward AGI i believe there should be some kind of social awareness component added which is an important part of human intelligence.
Red flag nowadays is when a blog post tries to judge whether AI is AGI. Because these goal posts are constantly moving and there is no agreed upon benchmark to meet. More often than not, they reason why exactly something is not AGI yet from their perspective, while another user happily use AI as a full-fledged employee depending on use case. I’m personally using AI as a coding companion and it seems to be doing extremely well for being brain dead at least.
"We made it!" "We failed!" written by somebody who doesn't have the slightest connection to the projects they're talking about. e.g. this piece doesn't even have an author but I highly doubt he has done anything more than using chatgpt.com a couple times.
Maybe this could be the Neumann's law of headlines: if it starts with We, it's bullshit.
People ditch symbolic reasoning for statistical models, then are surprised when the model does, in fact, use statistical features and not symbolic reasoning.
The hard part is that for all the things that the author says disprove LLMs are intelligent are failings for humans too.
* Humans tell you how they think, but it seemingly is not how they really think.
* Humans tell you repeatedly they used a tool, but they did it another way.
* Humans tell you facts they believe to be true but are false.
* Humans often need to be verified by another human and should not be trusted.
* Humans are extraordinarily hard to align.
While I am sympathetic to the argument, and I agree that machines aligned on their own goals over a longer timeframe is still science fiction, I think this particular argument fails.
GPT o3 is a better writer than most high school students at the time of graduation. GPT o3 is a better researcher than most high school students at the time of graduation. GPT o3 is a better lots of things than any high school student at the time of graduation. It is a better coder than the vast majority of first semester computer science students.
The original Turing test has been shattered. We're building progressively harder standards to get to what is human intelligence and as we find another one, we are quickly achieving it.
The gap is elsewhere: look at Devin as to the limitation. Its ability to follow its own goal plans is the next frontier and maybe we don't want to solve that problem yet. What if we just decide not to solve that particular problem and lean further into the cyborg model?
We don't need them to replace humans - we need them to integrate with humans.
> GPT o3 is a better writer than most high school students at the time of graduation.
All of these claims, based on benchmarks, don't hold up in the real world on real world tasks. Which is strongly supportive of the statistical model. It will be capable of answering patterns extensively trained on. But is quickly breaks down when you step outside that distribution.
o3 is also a significant hallucinator. I spent quite a bit of time with it last weekend and found it to be probably far worse than any of the other top models. The catch is that it its hallucinations are quite sophisticated. Unless you are using it on material for which you are extremely knowledgeable, you won't know.
LLMs are probability machines. Which means they will mostly produce content that aligns to the common distribution of data. They don't analyze what is correct, but only what is probable completions for your text by common word distributions. But when scaled to incomprehensible scales of combinatorial patterns, it does create a convincing mimic of intelligence and it does have its uses.
But importantly, it diverges from the behaviors we would see in true intelligence in ways that make it inadequate for solving many of the kinds of tasks we are hoping to apply them to. The being namely the significant unpredictable behaviors. There is just no way to know what type of query/prompt will result in operating over concepts outside the training set.
> o3 is also a significant hallucinator. I spent quite a bit of time with it last weekend and found it to be probably far worse than any of the other top models. The catch is that it its hallucinations are quite sophisticated. Unless you are using it on material for which you are extremely knowledgeable, you won't know.
At least 3/4 of humans identify with a religion which at best can be considered a confabulation or hallucination in the rigorous terms you're using to judge LLMs. Dogma is almost identical to the doubling-down on hallucinations that LLMs produce.
I think what this shows about intelligence in general is that without grounding in physical reality it tends to hallucinate from some statistical model of reality and confabulate further ungrounded statements without strong and active efforts to ground each statement in reality. LLMs have the disadvantage of having no real-time grounding in most instantiations; Gato and related robotics projects exempted. This is not so much a problem with transformers as it is with the lack of feedback tokens in most LLMs. Pretraining on ground truth texts can give an excellent prior probability of next tokens and I think feedback either in the weights (continuous fine-tuning) or real-world feedback as tokens in response to outputs can get transformers to hallucinate less in the long run (e.g. after responding to feedback when OOD)
I don't dispute that these are problems, but the fact that its hallucinations are quite sophisticated to me means that they are errors humans also could reach.
I am not saying that the LLMs are better than you analyze but rather that average humans are worse. (Well trained humans will continue to be better alone than LLMs alone for some time. But compare an LLM to an 18 year old.)
Essentially, pattern matching can outperform humans at many tasks. Just as computers and calculators can outperform humans at tasks.
So it is not that LLMs can't be better at tasks, it is that they have specific limits that are hard to discern as pattern matching on the entire world of data is kind of an opaque tool in which we can not easily perceive where the walls are and it falls completely off the rails.
Since it is not true intelligence, but a good mimic at times, we will continue to struggle with unexpected failures as it just doesn't have understanding for the task given.
> LLMs are probability machines.
So too are humans, it turns out.
We are capable of much more, which is why we can perform tasks when no prior pattern or example has been provided.
We can understand concepts from the rules. LLMs must train on millions of examples. A human can play a game of chess from reading the instruction manual without ever witnessing a single game. This is distinctly different than pattern matching AI.
Were you using it with search enabled?
Humans can be deceptive, but it is usually deliberate. We can also honestly make things up and present them as fact but it is not that common, we usually say that we don't know. And generally, lying is harder for us than telling the truth, in the sense that making a consistent but false narrative requires effort.
For LLMs, making stuff up is the default, one can argue that it is all they do, it just happens to be the truth most of the times.
And AFAIK, what I would call the "real" Turing test hasn't been shattered by far. The idea is that the interrogator and the human subject are both experts and collaborate against the computer. They can't cheat by exchanging secrets, but anything else is fair game.
I think it is important because the Turing test has already been "won" by primitive algorithms acting clueless to interrogators who were not aware of the trick. For me, this is not really a measure of computer intelligence, more like a measure of how clever the chatbot designers were at tricking unsuspecting people.
> we usually say that we don't know
I think this is one of the distinguishing attributes of human failures. Human failures have some degree of predictability. We know when we aren't good at something, we then devise processes to close that gap. Which can be consultations, training, process reviews, use of tools etc.
The failures we see in LLMs are distinctly of a different nature. They often appear far more nonsensical and have more of a degree of randomness.
The LLMs as a tool would be far more useful if they could indicate what they are good at, but since they cannot self reflect over their knowledge, it is not possible. So they are equally confident in everything regardless of its correctness.
I think the last few years are a good example of how this isn't really true. Covid came around and everyone became an epidemiologist and public health expert. The people in charge of the US government right now are also a perfect example. RFK Jr. is going to get to the bottom of autism. Trump is ruining the world economy seemingly by himself. Hegseth is in charge of the most powerful military in the world. Humans pretending they know what they're doing is a giant problem.
They are different contexts of errors. Take any of these humans in your example, and give them an objective task, such as take any piece of literal text and reliably interpret its meaning and they can do so.
LLMs cannot do this. There are many types of human failures, but we somewhat know the parameters and context of those failures. Political/emotional/fear domains etc have their own issues, but we are aware of them.
However, LLMs cannot perform purely objective tasks like simple math reliably.
I think TA's argument fundamentally rests on two premises (quoting):
(a) If we were on the path toward intelligence, the amount of training data and power requirements would both be reducing, not increasing.
(b) [LLMs are] data bound and will always be unreliable as edge cases outside common data are infinite.
The most important observed consequences of (b) are model collapse when repeatedly fed LLM output in further training iterations; and increasing hallucination when the LLM is asked for something truly novel (i.e. arising from understanding of first principles but not already enumerated or directly implicated in its training data).
Yes, humans are capable of failing (and very often do) in the same ways: we can be extraordinarily inefficient with our thoughts and behaviors, we can fail to think critically, and we can get stuck in our own heads. But we are capable of rising above those failings through a commitment to truths (or principles, if you like) outside of ourselves, community (including thoughtful, even vulnerable conversations with other humans), self-discipline, intentionality, doing hard things, etc...
There's a reason that considering the principles at play, sitting alone with your own thoughts, mulling over a problem for a long time, talking with others and listening carefully, testing ideas, and taking thoughtful action can create incredibly valuable results. LLMs alone won't ever achieve that.
How many books or software wrote by recently graduated students have you read/use?
And by LLMs?
I mildly disagree with the author, but would be happy arguing his side also on some of his points:
Last September I used ChatGPT, Gemini, and Claude in combination to write a complex piece of code from scratch. It took four hours and I had to be very actively involved. A week ago o3 solved it on its own, at least the Python version ran as-is, but the Common Lisp version needed some tweaking (maybe 5 minutes of my time).
This is exponential improvement and it is not so much the base LLMs getting better, rather it is: familiarity with me (chat history) and much better tool use.
I may be be incorrect, but I think improvements in very long user event and interaction context, increasingly intelligent tool use, perhaps some form of RL to develop per-user policies for improving incorrect tool use, and increasingly good base LLMs will get us to a place that in the domain of digital knowledge work where we will have personal agents that are AGI for a huge range of use cases.
> where we will have personal agents that are AGI for a huge range of use cases
We are already there for internet social media bots. I think the issue here is being able to discern the correct use cases. What is your error tolerance? For social media bots, it really doesn't matter so much.
However, mission critical business automation is another story. We need to better understand the nature of these tools. The most difficult problem is that there is no clear line for the point of failure. You don't know when you have drifted outside of the training set competency. The tool can't tell you what it is not good at. It can't tell you what it does not know.
This limits its applicability for hands-off automation tasks. If you have a task that must always succeed, there must be human review for whatever is assigned to the LLM.
Imma be honest with you, this is exactly his I would do that math, and that is exactly the lie I would tell if you asked me to explain it. This is me-level agi.
author says we made no progress towards agi, also gives no definition for what the "i" in agi is, or how we would measure meaningful progress in this direction.
in a somewhat ironic twist, it seems like the authors internal definition for "intelligence" fits much closer with 1950s. good old-fashioned AI, doing proper logic and algebra. literally all the progress we made in ai in the last 20 years in ai is precisely because we abandoned this narrow-minded definition of intelligence.
Maybe I'm a grumpy old fart but none of these are new arguments. Philosophy of mind has an amazingly deep and colorful wealth of insights in this matter, and I don't know why this is not required reading for anyone writing a blog on ai.
> or how we would measure meaningful progress in this direction.
"First, we should measure is the ratio of capability against the quantity of data and training effort. Capability rising while data and training effort are falling would be the interesting signal that we are making progress without simply brute-forcing the result.
The second signal for intelligence would be no modal collapse in a closed system. It is known that LLMs will suffer from model collapse in a closed system where they train on their own data."
Fascinating look at how AI actually reasons. I think it's pretty close to how the average human reasons.
But he's right that the efficiency of AI is much worse, and that matters, too.
Great read.
My understanding was that chain-of-thought is used precisely BECAUSE it doesn't reproduce the same logic that simply asking the question directly does. In "fabricating" an explanation for what it might have done if asked the question directly, it has actually produced correct reasoning. Therefore you can ask the chain-of-thought question to get a better result than asking the question directly.
I'd love to see the multiplication accuracy chart from https://www.mindprison.cc/p/why-llms-dont-ask-for-calculator... with the output from a chain-of-thought prompt.
> Which means these LLM architectures will not be producing groundbreaking novel theories in science and technology.
Is it not possible that new theories and breakthroughs could result from this so-called statistical pattern matching? The information necessary could be present in the training data and the relationship simply never before considered by a human.
We may not be on a path to AGI, but it seems premature to claim LLMs are fundamentally incapable of such contributions to knowledge.
In fact, it seems that these AI labs are leaning in such a direction. Keep producing better LLMs until the LLM can make contributions that drive the field forward.
Certainly random chance exists for discovery. But most revolutionary type discoveries come from deep understanding of the context.
The contribution of LLMs to knowledge is more like that of search engines. It is still the human which possesses understanding that ultimately will be the principle source of innovation. The LLM can assist with navigating and exploring existing information.
However, LLMs have significant downsides in this regard too. The hallucination problem is no joke. It can often mislead you and cause a loss of time on some tasks.
Overall, they will be somewhat useful in some manner, but substantially less so than the present hype machine suggests.
They seem precisely that: search engines. Instead to give you a list of webpages with possible answers they actually synthesise the results. A more direct analogy is the case where ChatGPT provides you two possible answers. Of course it could provide you more just like search engines provide more links.
So the “reasoning” text of openAI is no more than old broken Windows “loading” animation.
One point that I think seperates AI and human intelligence is LLM's inability to tell me how it feels or it's individual opinion on things.
I think to be considered alive you have to have an opinion on things.
> All of the current architectures are simply brute-force pattern matching
This explains hallucinations and i agree with 'braindead' argument. To move toward AGI i believe there should be some kind of social awareness component added which is an important part of human intelligence.
Red flag nowadays is when a blog post tries to judge whether AI is AGI. Because these goal posts are constantly moving and there is no agreed upon benchmark to meet. More often than not, they reason why exactly something is not AGI yet from their perspective, while another user happily use AI as a full-fledged employee depending on use case. I’m personally using AI as a coding companion and it seems to be doing extremely well for being brain dead at least.
It’s AGI when it can improve itself with no more than a little human interaction
Who is using AI as full-fledged employees?
I really dislike what I now call the American We.
"We made it!" "We failed!" written by somebody who doesn't have the slightest connection to the projects they're talking about. e.g. this piece doesn't even have an author but I highly doubt he has done anything more than using chatgpt.com a couple times.
Maybe this could be the Neumann's law of headlines: if it starts with We, it's bullshit.
Isn’t the „we“ supposed to mean „humanity“?
I’ve been saying this for ages. People use “we” way too freely.