I'm guessing that this model was finished pre-training at least a year ago (it's been 2 years since GPT 4.0 was released) and they just didn't see the hoped-for performance gains to think it warranted releasing at the time, and so put all their effort into the Q-star/strawberry = eventual O1 reasoning effort instead.
It seems that OpenAI's reasoning model lead isn't perhaps what they thought it was, and the recent slew of strong non-reasoning models (Gemini 2.0 Flash, Grok 3, Sonnet 3.7) made them feel the need to release something themselves for appearances sake, so they dusted off this model, perhaps did a bit of post-training on it for EQ, and here we are.
The price is a bit of a mystery - perhaps just a reflection of an older model without all the latest efficiency tricks to make it cheaper. Maybe it's dense rather than MoE - who knows.
Rumors said that GPT4.5 is an order of magnitude larger. Around 12 trillion parameters total (compared to GPT4's 1.2 trillion). It's almost certainly MoE as well, just a scaled up version. That would explain the cost. OpenAI also said that this is what they originally developed as "Omni" - the model supposed to succeed GPT4 but which fell behind expectations. So they renamed it 4.5 and shoehorned it in to remain in the news among all those competitor releases.
Appreciate the corrections, but I'm still a bit puzzled. Are they wrong about 4.5 having 12 trillion parameters, it originally intending to be Orion (not omni), or an expected successor to GPT 4? And do you have any related reading that speaks to any of this?
I think it at least is somewhat analogous to what happened with pricing on previous models. GPT 4, despite being less capable than 4o, is an order of magnitude more expensive, and comparably expensive to o1. It seems like once the model is out, the price is the price, and the performance gains emerge but they emerge attached to new minified variations of previous models.
I don't think th October 2023 training cut-off means the model finished pre-training a year ago. All of OpenAI's models share that same cut-off date.
One theory is that they're worried about the increasing tide of LLM-generated slop that's been posted online since that date. I don't know if I buy that or not - other model providers (such as Anthropic, Gemini) don't seem worried about that.
Releasing it was probably a mistake. In context what the model is could have been understood, but they haven’t really presented that context. Also it would be lost on a general audience.
The general public will naturally expect it to be the next big thing. Wasn’t that the point of releasing it? To seem like progress is being made? To try to make that point with a model that doesn’t deliver is a misstep.
If I were Sam Altman, I’d be pulling this back before it goes on general release, saying something like it was experimental and after user feedback the costs weren’t worth it and they’re working on something else as a replacement. Then o3 or whatever they actually are working on instead can be the “replacement” even if it’s much later.
I sort of believed this but also 4.5 coming out last year would absolutely have been a big deal compared to what was out there at the time? I just dont understand why they would not launch it then.
From limited experimentation: Sonnet 3.7 has “extended thinking” as an option, although the UI, at least in the app, leaves something to be desired. It also has a beta feature called “Analysis” that seems to work by having the model output JavaScript code as part of its response that is then run and feeds back into the answer. Both of these abilities are visible — users can see the chain of thought and the analysis code.
It seems, based again on limited experimentation doing sort-of-real work, that analysis works quite well and extended thinking is so-so. Whereas DeepSeek R1 seems to be willing and perhaps even encouraged to second-guess itself (maybe this is a superpower of the “wait” token”), Sonnet 3.7 doesn’t seem to second-guess itself as much as it should. It will happily extended-think, generate a wrong answer, and then give a better answer after being asked a question that it really should have thought of itself.
(I’m not complaining. I’ve been a happy user of 3.7 for a whole day! But I think there’s plenty of room for improvement.)
GPT-4.5 feels like OpenAI's way of discovering just how much we'll pay for diminishing returns.
The leap from GPT-4o to 4.5 isn't a leap—it's an expensive tiptoe toward incremental improvements, priced like a luxury item without the luxury payoff.
With pricing at 15x GPT-4o, they're practically daring us not to use it. Given this, I wouldn't be surprised if GPT-4.5 quietly disappears from the API once OpenAI finishes squeezing insights (and cash) out of this experiment.
Even this is a bit overly complicated/optimistic to me. Why not something as simple as: OpenAI has been building larger and larger models to great success for a long time. As a result, they were excited this one was going to be so much larger=so much better that the price to run it would be well worth the huge jump they were planning to get from it. What really happened is this method of scaling hit a wall and they were left with an expensive dud they won't get much out of but they have to release something for now otherwise they start falling well behind on the boards the next few months. Meanwhile they scramble focus to find other means of scaling like "chain of thought + runtime" provided.
Thank you so much for this comment. I don't really understand the need for people to go straight to semi-conspiratorial hypotheses, when the simpler explanation makes so much more sense. All the evidence is that this model is much larger than previous ones, so they must charge a lot more for inference because it costs so much more to run. OpenAI were the OGs when it came to scaling, so it's not surprising they went this route and eventually hit a wall.
I don't at all blame OpenAI for going down this path (indeed, I laud them for making expensive bets), but I do blame all the quote-un-quote "thought leaders" who were writing breathless posts about how AGI was just around the corner because things would just scale linearly forever. It was classic "based on historical data, this 10 year old will be 20 feet tall by the time he's 30" thinking, and lots of people called them out on this, and they either just ignored it or responded with "oh, simple not-in-the-know peons" dismissiveness.
It is weird because this is a board for working programmers for the most part. So like, who’s seen a gram conspiracy actually be accomplished? Probably now many. A lackluster product that gets released even though it sucks because too many people are highly motivated not to notice that it sucks? Everybody has experienced that, right?
Exactly. Although I wouldn't even say they have blinders, it seems like OpenAI understands quite well what 4.5 can do and what it can't hence the modesty in their messaging.
To your point, though, I would add not only who has seen any grand conspiracy actually be accomplished, who has seen one even attempted and kept under wraps? Such that the absence of corroborating sources was more consistent with an effectively executed conspiracy theory than the simple absence of such a plan.
Of course, that's my point. Again, I think it's great that OpenAI swung for the fences. My beef is again with these "thought leaders" who would write this blather about AGI being just around the corner in the most uncritical manner possible (e.g. https://news.ycombinator.com/item?id=40576324). These folks tended to be in one of two buckets:
1. "AGI cultists" as I called them, the "we're entering a new phase of human evolution"-type people.
2. People who had a motive to try and sell something.
And it's not about one side or the other being "right" or "wrong" after the fact, it's that so much of this just sounded like magical thinking and unwarranted extrapolations from the get go. The actual experts in the area, if they were free to be honest, were much, much more cautious in their pronouncements.
Definitely, the grifters and hypesters are always spoiling things, but even with a sober look it felt like AGI _could_ be around the corner. All these novel and somewhat unexpected emerging capabilities as we pushed more data through training, you'd think maybe that's enough? It wasn't and test time compute alone isn't either, but that's also hindsight to a degree.
If you've been around long enough to witness a previous hype bubble (and we've literally just come out of the crypto bubble), you should really know better by now. Pets.com, literally an online shop selling pet food, almost IPOd for $300M in early 2000, just before the whole dot-com bubble burst.
And yeah, LLMs are awesome. But you can't predict scientific discovery, and all future AI capabilities are literally still a research project.
I've had this on my HN user page since 2017, and it's just as true as ever:
In the real world, exponentials are actually early stage sigmoids, or even gaussians.
In fundamental science terms, it also proves once and for all that more model doesn't mean more better. Any forces within OpenAI pushing to move past just growing the model for gains now have a strong argument for going all-in on new processes.
I ask chatgpt to give me a map highlighting all spanish speaking countries, gives me stable diffusion trash.
Just gotta do the grunt work, add a tool with a map api. Integrate with google maps for transit stuff.
It's a good LLM model already it doesn't need to be einstein and solve aerospatial equations. We just need to wait until they realize their limits and find the humility to build yet another useful product that won't conquer the world.
I’ve thought of LLM’s as google 2.0 for some time now. Truly a world changing technology similar to how google changed the world, likely to have an even larger impact than google had as we create highly specialized Implementations of the technology in the coming decade…but it’s not energy positive nuclear fusion, or a polynomial time NP solver, it’s just google 2.0
I keep wondering what the long-game (if any) of LLMs is... to make the world dependent on various models then jack the rates up to cover the costs? The gravy-train of SV funding has to end eventually... right?
There is a truth in the grandparent's comment that doesn't necessarily conflict with this view. The Google 2.0 effect is not necessarily that it gives you a better correct answer faster than google. I think it never dawned on people how bad they were at searching about topics they didn't know much about or how bad google was at pointing them in the right direction prior to chatgpt. Or putting it another way, they never realized how much utility they would get out of something that pointed them in the correct direction even though they couldn't trust the details.
It turns out that going from not knowing what you don't know to knowing what you don't know adds an order of magnitude improvement to people's experience.
This is not a helpful phrasing I think. Sources allow the reader to go as far down the rabbit hole as they are willing to or knowledgable enough to go.
For example, if I'm looking for some medical finding and I get to a source that's a clinical study from a reputable publication, I may be satisfied and stop there since this is not my area of expertise. However, a person with knowledge of the field may be able to parse the study and pick it apart better than I could. Hence, their search would not end there since they would be unsatisfied with just the source I was satisfied with.
On the other hand, having no verifiable sources should leave everyone unsatisfied.
> Have you ever done the experiments to prove the Earth is round?
I have, actually! Thanks, astronomy class!
I've even estimated the earth's diameter, and I was only like 30% off (iirc). Pretty good for the simplistic method and rough measurements we used.
Sometimes authorities are actually authoritative, though, particularly for technical, factual material. If I'm reading a published release date for a video game, directly from the publisher -- what is there to contest? Meanwhile, ask an LLM and you may have... mixed results, even if the date is within its knowledge cutoff.
Oh, I think it's great they did that. It's super helpful for visualizing ChatGPT's limitations. Ask it for an absolutely full, overflowing glass of wine or a wrist watch whose time is 6:30 and it's obvious what it actually does. It's educational.
I asked claude to give me a script in python to create a map highlighting all spanish speaking countries. it took 3 tries and then gave me a perfect svg and png.
> Just gotta do the grunt work, add a tool with a map api. Integrate with google maps for transit stuff.
This is kind of the crux though. The only way to make LLMs more useful is to basically make them traditional AI. So it's not really a leap forward nevermind path to AGI.
OpenAI is going to add it to Plus subscriptions. I.e. available for many at no additional cost. Likely with restrictions line N prompts/hour.
As for API price, when it matters businesses and people are willing to pay much more for just a bit better results. OpenAI doesn't take the other options away. So we don't lose anything.
IMO the 4o output is lot more Enterprise-compatible, the 4.5 being straight to the point and more natural is quite the opposite. Pricing-wise your point stands.
Disclaimer: have not tried 4.5 yet, just skimmed through the announcement, using 4o regularly.
Apparently, OpenAI API “credits” expire after a year. I stupidly put another $20 and trying to blow through them, 4.5 is the easiest way considering recent 4o has fallen out of favor for other models and I don’t want to just let them expire again. An expiry after only one year is asinine.
Not sure how it's with OpenAI, but Anthropic is so money-hungry, they won't even let you remove your debit card data from your account without a week-long support encounter.
This is how pricing on human labour works. Nobody expects an employee that costs twice as much to produce twice the output for any given task. All that is expected is that they can do a narrow set of things, that another person can't.
4.5 can extremely quickly distill and work with what I at least consider, complex nuanced thought. 4.5 is night and day better than every other AI for my work, it's quite clever and I like it.
The 4.5 has better 'vibes' but isn't 'better', as a concrete example:
> Mission is the operationalized version of vision; it translates aspiration into clear, achievable action.
The "Mission is the operationalized version of vision" is not in the corpus that I am find and is obviously a confabulated mixture of classic Taylorist like "strategic planning"
SOPs and metrics, which will be tied to compensation and the unfortunate ubiquitous nature of Taylorism would not result in shared purpose, but a bunch of Gantt charts past the planning horizon.
IMHO I would consider "complex nuanced thought" as understanding the historical issues and at least respect the divide between classical and neo-classical org theory. Or at least avoid pollution of more modern theories with classical baggage that is a significant barrier to delivering value.
Mission statements need to share strategic intent in an actionable way, strategy is not operationalization.
I have been experimenting with 4.5 for a journaling app I am developing for my own personal needs, for example, turning bullet/unstructured thoughts into a consistent diary format/voice.
The quality of writing can be much better than Claude 3.5/3.7 at times but struggling with similar confabulation of information that is not in the original text but "sounds good/flows well". Which isn't ideal for a personal journal... I am still playing around with the system prompt but given the astronomical cost (even with me as the only user) with marginal benefits I am probably going to end up sticking with Claude for now.
Unless others have a recommendation for a less robot-y sounding model (that will, however, follow instructions precisely) with API access other than the mainstream Claude/OpenAI/Gemini models?
(also: the person you are responding to is doing exactly what you're saying you don't want done, take something unrelated to the original text (Taylorism) but could sound good, and jam it in)
The statement "Mission is the operationalized version of vision; it translates aspiration into clear, achievable action" isn't a Taylorist reduction of mission to mechanical processes - it's actually a nuanced understanding of how these organizational elements relate. You're misinterpreting what "operationalized" means in this context. From what i can tell, the 4.5 response isn't suggesting Taylorist implementation with Gantt charts etc it's describing how missions translate vision into actionable direction while remaining strategic. Instead of jargon, it's recognizing that founders need something between abstract vision and tactical execution. Missions serve this critical bridging function. CEO has vision, orgs capture the vision into their missions, people find their purpose when aligned via the 2. Without it, founders either get stuck in aspirational thinking or jump straight to implementation details without strategic guidance. The distinction matters exactly because it helps avoid the dysfunction that prevents startups from scaling effectively. I think you're assuming "operationalized" means tactical implementation (Gantt charts, SOPs) when in this context it means "made operational/actionable at a strategic level". Missions != mission statements. Also, you're creating a false dichotomy between "strategic intent" and "operationalization" when they very much, exist on a spectrum. (If anything, connecting employees to mission and purpose is the opposite of Tayloristic thinking, which viewed workers more as interchangeable parts than as stakeholders in a shared mission towards responding to a shared vision of global change) - You are doing what o1 pro did, and as I said: As a tool for teaching business to founders, personally, I find the 4.5 response to be better.
An example of a typical nieve definition of a mission statement is:
Concise, clear, and memorable statement that outlines a company's core purpose, values, and target audience.
> "made operational/actionable at a strategic level".
Taken the common definition from the first part of this plan, what do you think the average manager would do given that in the social sciences, operationalization is explicitly about measuring abstract qualities. [1]
"operationalization" is a compromise, trying to quantify qualitative properties, it is not typically subject to methods like MECE principal, because there are too many unknown unknowns.
You are correct that "operationalization" and "strategic intent" are not mutually exclusive in all aspects, but they are for mission statements that need to be durable across changes that no CEO can envision.
The "made operational/actionable at a strategic level" is the exact claim of pseudo scientific management theory (Greater Taylorism) that Japan directly targeted to destroy the US manufacturing sector. You can look at the former CEO of Komatsu if you want direct evidence.
GM:s failure to learn form Toyota at NUMII (sp?) is another.
The planning process needs to be informed by stratagy, but planning is not strategic, it has a limited horizon.
But you are correct that it is more nuanced and neither Taylor nor Tolstoy allowed for that.
Neo-classical org theory is when bounded rationality was first acknowledged, although the Prussian military figured that out long before Taylor grabbed his stopwatch to time people loading pig iron into train cars.
Your responses are interesting because they drive me to feel reinforced about my opinion. This conversation is precisely why I rate 4.5 over o1 pro. I prompted in a very very very specific way. I'm afraid to say your comments are highly disengaged for the realities of business and business building. Appreciate the historical context and recommended reading (although I assure you, I am extremely well versed). The term 'operationalized' here refers to strategic alignment, not Taylorist quantification, think guiding principles over rigid metrics. You are badly conflating operationalization in social sciences (which is about measurement) with strategic operationalization in management, which is not same. Again: operationalized in this context means making the mission actionable at a strategic level, not about quantification. Modern mission frameworks prioritize adaptability within durable purpose, avoiding the pitfalls you’ve rightly flagged. Successful founders don't get caught in these theoretical distinctions. Founders taught be my, and I guess GPT 4.5, understand correctly, mission as the bridge between aspirational vision and practical action. This isn't "Greater Taylorism" but pragmatic leadership. While your historical references (NUMMI, not NUMII) demonstrate academic knowledge, they miss how effective missions actually guide organizations while remaining adaptable. The 4.5 response captured this practical reality well- it pointed to but it not create artificial boundaries between interconnected concepts. If we had some founders trained by you (o1 Pro) and me (Gpt 4.5) - I would be willing to bet my founders would out preform yours any day of the week.
Tuckman as a 'real' framework is a belief so that is fair.
He clearly communicated in 1977 that his ideas were never formally validated and that he cautioned about their use in other contexts.
I think that the concepts can be useful, if you took them as anything more than a guiding framework that may or may not be appropriate for a particular need.
I personally find value in team and org mission statements, especially for building a shared purpose, but to be honest, any of the studies on that are more about manager satisfaction then anything else.
There is far more data on the failure of strategy execution, and linking strategy with purpose as well as providing runways and goals is one place I find vision and mission statements useful.
As up to 90% of companies fail on strategy execution, and because employee engagement is in free fall, the fact that companies are still in business means little.
Context is king, and this is horses for courses, but I would caution against ignoring more recent, Nobel winning theories like Holmström's theorem.
Most teams don't experience the literal steps Tuckman suggested, rarely all at once, and never as one time singular events. As the above link demonstrated, some portions like the storming can be problematic.
Make them operationalize their mission statement, and they will and it will be in concrete.
Remember von MoltKe "No plan of operations extends with certainty beyond the first encounter with the enemy's main strength."
There is a balance between C2 and mission command styles, the risk is trying to force or worse intentionally causing people to resort to c2 when almost always you need a shifting balance between command and intent based solutions.
The Feudal Mode of Production was sufficient for centuries, but far from optimal.
The NUMMI reference was exactly related to the same reason Amazon profits historically raised higher despite head count increases that should have allowed.
Small cross functional teams, with clearly communicated tasks, and enough freedom to accomplish those tasks efficiently.
You can look at Trist's study about the challenges with incentivizing teams to game the system. Same problem happened under Balmer at MS, and DEC failed the opposite way, trying to do everything at once and please everyone.
The reality is that the popularity of frameworks rarely relates to their effectiveness, building teams is hard, making teams work as teams across teams is even harder.
Tuckerman may be useful in that...but this claim is wrong:
> "Modern mission frameworks prioritize adaptability within durable purpose, avoiding the pitfalls you’ve rightly flagged"
Modern _ frameworks prioritize adoption and depending on the framework to solve your companies needs will always fail. You need to choose a framework that fits your strategy and objectives, and adapt it to fit your needs.
Learn from others, but don't ignore the reality on the ground.
Regarding Tuckman's model, there are actually numerous studies validating its relevance and practical application: Gren et al. (2017) validated it specifically for agile teams across eight large companies. Natvig & Stark (2016) confirmed its accuracy in team development contexts. Bonebright's (2010) historical review demonstrated its ongoing relevance across four decades of application.
I feel we're talking past each other here. My original point was about which AI model is better for MY WORK. (I run a starup accelerator for first time founders) 4.5, in 30 seconds over minutes, provided more practical value to founders building actual businesses, and saved me time. While I appreciate your historical references and academic perspectives, they don't address my central argument about GPT-4.5's response being more pragmatically useful. The distinction between academic precision and practical utility is exactly what I'm highlighting. Founders don't need perfect theoretical models - they need frameworks that help them bridge vision and execution in the real world. When you bring up feudal production modes and von Moltke, we're moving further from the practical question of which AI response would better guide someone trying to align teams around a meaningful mission that drives business results. It's exactly why I formed the 2 prompts in the manner I did, I wanted to see if it was an academic or an expert.
My assessment stands that GPT-4.5's 30 seconds of thinking reflects well how mission operationalizes vision reflects how successful businesses actually work, not how academics might describe them in theoretical papers. I've read the papers, I've studied the theory deeply, but I also have NYSE and NASDAQ ticker symbols under my belt, from seed. That, is the whole point here.
OK maybe we are using different meanings of the word "operationalize"
If I were say in middle management and you asked me to "operationalize" the impact of mission statements, I would try to associate the existence of a mission statement on a team to some metric like financial performance.
If I was on a small development team and you asked me to "operationalize" our mission statement, I would probably make the same mistake the software industry always does, like trying it to tickets closed, lines of code, or even the Dora metrics.
Under my understanding of "operationalize" and the only way I can find it referenced related to mission statements themselves, I would actually de-emphasize deliverables, quality, stakeholders changing demands etc...
Even if I try to "operationalize" in a more abstract way, like define an impact score, which may not directly map to business objectives or even team building.
Almost every LLM offers a similar definition I offered above E.G.
> "operationalization" refers to the process of defining an abstract concept in a way that allows it to be measured and observed through specific, concrete indicators
Impact scores, which are subjective can lead to Google's shelfware problems, and even scrum rituals often leads to hard but high value tasks being ignored because of the incentives don't allow for it.
In both of your cites, they were situations where existing cultures were enhanced, not fully replaced.
Both were also short term, and wouldn't capture the long tail problems I am referencing.
Heck even Taylorism worked well for the auto industry until outside competition killed it. Well at least for the companies, consumers suffered.
The point is that "operationalization" specifically is counterproductive under a model, where infighting during that phase is bad.
If you care about delivering on execution, it would seems to be important to you. But I realize that you may not be targeting repeat work...I just don't know.
But I am sure some McKinsey BA probably has put that consern in a PDF someplace by now because the GoA Agile assessment guide is being incorporated and even ITIL and TOGAF reference that coal face paper I cited.
The BCGs and McKinseys of the world are absolutely going to shift to detection of common confabulations to show value.
While I do take any tools possible to make me more productive, correctness of content concerns me more than exact verbage.
But yes, different needs, I am in the nitch of rescuing failed initiatives, which admittedly is far from the typical engagement style.
To be honest the lack of scratch space on 4.5 compared to CoT models is the main blocker for me.
I believe 4.5 is a very large and rich model. The price is high because it's costly to inference; however, the bigger reason is to ensure that others don't distill from it. Big models have a rich latent space, but it takes time to squeeze the juice out.
The small number of use cases that do pay are providing gross margins as well as feedback that helps OpenAI in various ways. I don’t think it’s a stupid move at all.
My assumption: There will be use cases where cost of using this will be smaller than the gain from it. Data from this will make the next version better and cheaper.
My take from using it a bit is that they seem to have genuinely innovated on:
- Not writing things that go off in weird directions / staying grounded in "reality"
- Responding very well to tone preferences and catching nuance in what I say
It seems like it's less that it has a great "personality" like Claude, but that it's capable of adapting towards being the "personality" I want and "understanding" what I'm saying in ways that other models haven't been able to do for me.
So this kind of mirrors my feelings after using GPT-4.5 on general conversation and song writing.
GPT picked up on unspecified requirements almost instantly. It is subtle (and may be undesirable in some contexts). For example in my songs, I have to bracket the section headings, it picked up on that from my original input. All the other frontier models generally have to be reminded. Additionally, I separately asked for an edit to a music style description. When I asked GPT-4.5 to write a song all by itself, it included a music style description. No other model I have worked with has done this.
These are subtle differences, but in aggregate the model just generally needs less nudging to create what is required.
I haven't used 4.5 but have some experience using Claude for creative writing, and in my experience it sometimes has the uncanny ability to get to the core of my ideas, rephrasing my paragraph long descriptions into just a sentence or two, or both improving and concretizing my vague ideas into something that's insightful and tasteful.
Other times it locks itself into a dull style and ignores what I ask of it and just produces boring generic garbage, and I have to wrangle it hard to get some of the spark back.
I have no idea what's going on inside, but just like with Stable Diffusion, it's fairly easy to make something that has the spark of genius, and is very close to being perfect, but getting the last 10% there, and maintaining the quality seems almost impossible.
It's a very weird feeling, it's hard to put into words what is exactly going on, and probably even harder to make it into a benchmark, but it makes me constantly flip-flop between scared of being how good the AI is, and questioning why I ever bothered with using it in the first place, as I would've progress much faster without it.
Long term it might be hard to monetise those infrastructure considering their competition:
1) For coding (API) most probably will stick to Claude 3.5 / 3.7 - big market but still small comparing to all world wide problems
2) For non-coding API IMHO gemini 2.0 flash is the winner - dirty cheap (cheaper than 4o-mini), good enough and even better than gpt-4o, cheap audio and image input.
3) For subscription app ChatGPT is probably still the best but only slightly - they have the best advanced voice audio conversation but Grok will be probably eating their lunch here
We were using gpt-4o for our chat agent, and after some experiments I think we'll move to flash 2.0. Faster, cheaper and a bit more reliable even.
I also experimented with the experimental thinking version, and there a single node architecture seemed to work well enough (instead of multiple specialised sub agents nodes). It did better than deepseek actually. Now I'm waiting for the official release before spending more time on it.
For the rest of us using free tiers ChatGPT is hands down the winner allowing limited image generation, unlimited usage of some model and limited usage of 4o.
Claude is still stuck at 10 messages per day and gemini is less accurate/useful.
It's marketed to be slightly better at "creative writing". This isn't the problem most businesses have with current-generation LLMs. On the other side; Anthropic releases nearly at the same time a new model which solves more practical problems for businesses to the point that for coding many insiders don't use OpenAI models for such tasks anymore.
I think it should be illegal to trick humans into reading "creative" machine output.
It strikes me as a form of fraud that steals my most precious resources: time and attention. I read creative writing to feel a human connection to the author. If the author is a machine and this is not disclosed, that's a huge lie.
It should be required that publishers label AI generated content.
> I think it should be illegal to trick humans into reading "creative" machine output.
Creativity has lost its meaning. Should it be illegal? The courts will take a long time to settle the matter. Reselling people's work against their will as creative machine output seems unethical, to say the least.
> It should be required that publishers label AI-generated content.
I'm pretty sure you read for pleasure, and feeling a human connection is one way that you derive pleasure. If it's the only way that you derive pleasure from reading, my condolences.
Pretty much where my thoughts on this are. I rarely feel any particular sense of connection to the actual author when I read their books. And I have taken great pleasure from some AI stories (to the degree I put them up on my personal website as a way to keep them around).
Under the dingnuts regime, Dwarf Fortress will be illegal. Actually, any game with a procedural story? You better believe that's a crime: we can't have a machine generate text a human might enjoy.
Dingnuts point was that it should be disclosed. Everyone knows Dwarf Fortress stories are procedural/AI generated, the authors aren't trying to hide that fact.
Consumer limits. When something is good enough it can make stable money, then there is no real incentive to innovate beyond the bare minimum—just enough to keep consumers engaged, shareholders satisfied, and regulators at bay.
This is how modern capitalism and all corporations work, we will keep receiving new numbers in versions without any sensible change – consumers will keep updating their subscriptions out of habit – xyzAI PR managers, HR managers, corporate lawyers and myriads of other bureaucrats will keep receiving their paychecks secretly dreaming of retirement, xyzAI top management will burn money on countless acquisitions just to fight boredom, turning into xyz(ai)MEGAcorp doing everything from crude oil processing and styrofoam cups to web services and AI models.
No modern mega corporation is capable of making something else or different from what already worked for them just once. We could achieve universal wellfare and prosperity 60 years ago, that would’ve disrupted the cycle. Instead, we got planned obsolescence, endless subscription models, and a world where everything “new” is just a slightly repackaged version of last year’s product.
Yes, I believe the sprint is over, now its doing to be slow cycles maybe 18 months to see a 5% increase an ability and even that 5% increase will be highly subjective. Claude's new release is about the same 3.7 is arguably worse at some things than 3.5 and better at others. Based on the previous pace of release in about 6 months or so - if the next release from any of the leaders is about the same "kinda better kinda worse" then we'll know. Imagine how much money is going to evaporate from the stock market if this is the limit!!!
No, it means that it got better on things orthogonal to what we have mostly been measuring. On the last few rounds, we have been mostly focusing on reasoning, not as much on knowledge, "creativity", or emotional resonance.
"It's better. We can't measure it, but we're pretty sure it's better. We also desperately need it to be better because we just spent a boat-load of money on it."
Meanwhile all GPT4o models on Azure are set to be deprecated in May and there are no alternative models yet. We should start moving to Anthropic? DS too slow, melting under its own success. Anyone on GPT4o/Azure has any idea when they'll release the next "o" model?
Only an older version of GPT-4o has been deprecated and will be removed in May. The newest version will be supported through at least 20 November 2025.
The Nov 2024 release, which is due to be deprecated in Nov 2025, I was told has degraded performance compared to the Aug 2024 release. In fact, OpenAI Models page says their current GPT4o API is serving the Aug release. https://platform.openai.com/docs/models#gpt-4o
So I'm still on the Aug 24 release, which, with your reminding me, is not to be deprecated till Aug 2025, but that's less than 5 months from now, and we're skipping the Nov 2024 release just as OpenAI themselves have chosen to do.
I've found 4.5 to be quite good at "business decisions", much better than other models. It does have some magic to it, similar to Grok 3, but maybe a bit smarter?
It seems like there’s a misunderstanding as why this happened. They’ve been baking this model for months. long before deep seek came out with fundamental new ways of distilling models. and even given that it’s not great it’s its large form, they’re going to distil from this going forward .. so it likely makes sense for them to periodically train these very large models as a basis.
I think this framing isn't quite right either. DeepSeek's R1 isn't very different from what OpenAI has already been doing with o1 (and that other groups have been doing as well). As for distilling - the R1 "distilled" models they released aren't even proper (logit) distillations, but just SFTs, not fundamentally new at all. But it's great that they published their full recipes and it's also great to see that it's effective. In fact we've seen now with LIMO, s1/s1.1, that even as few as 1K reasoning traces can get most LLMs to near SOTA math benchmarks. This mirrors the "Alpaca" moment in a lot of ways (and you could even directly mirror say LIMO w/ LIMA).
I think the main takeaway of GPT4.5 (Orion) is that it basically gives a perspective to all the "hit a wall" talk from the end of last year. Here we have a model that has been trained on by many accounts 10-100X the compute of GPT4, is likely several times larger in parameter count, but is only... subtly better, certainly not super-intelligent. I've been playing around w/ it a lot the past few days, both with several million tokens worth of non-standard benchmarks and talking to it and it is better than previous GPTs (in particular, it makes a big jump in humor), but I think it's clear that the "easy" gains in the near future are going to be figuring out how as many domains as possible can be approximately verified/RL'd.
As for the release? I suppose they could just have kept it internally for distillation/knowledge transfer, so I'm actually happy that they released it, even if it ends up not being a really "useful" model.
I've been using 4.5 instead of 4o for quick questions. I don't mind the slowness for short answers. I feel like it is less likely to hallucinate than other models.
I have access to it. It is better, but not where most techies would care. It knows more, it writes better, it's more pleasant to talk to. I think they might have studied the traffic their hundreds of millions of users generate and realized where they need to improve, then did exactly that for their _non thinking_ model. They understand that a non-thinking model is not going to blow the doors off on coding no matter what they do, but it can do writing and "associative memory" tasks quite well, and having a lot more weights helps there. I also predict that they will fine tune their future distilled, thinking models for coding, based on the same logic, distilling from 4.5 this time. Those models have to be fast, and therefore they have to be smaller.
Sam Altman views Steve Jobs as one of his inspirations (he called the iPhone the greatest product of all time). So if you look at OpenAI in the lens of Apple, where you think about making the product enjoyable to use at all costs, then it makes perfect sense why you’d spend so much money to go from 4o to 4.5 which brings such subtle differences to power users.
The vast majority of users, which are over 300 million weekly, will mainly use 4o and whatever is the default. In the future they’ll use 4.5 and think it’s most human like and less robotic.
Yes but Steve Jobs also understood the paradox of choice, and the importance of having incredibly clear delineation between every different product in your line.
Do models matter to the regular user over brand? People talk about using chatGPT over Google's AI or Deepseek not 4o-mini vs gemini 2.
OpenAI has done a good job of making the model less important and the domain gptGPT.com more important.
Most of the time the model rarely matters. When you find something incorrect you may switch models but that rarely fixes the problem. Rewording a prompt has more value than changing a model.
OpenAI has been irrelevant for a while now. All of the new and exciting developments on AI are coming from other places. ClosedAI is no longer the driver of change and innovation.
I think OpenAI is currently in this position where they are still industry standard, but also not leading. Deepseek R1 beat o1 on perf/cost with similar perf at a fraction of the cost. o3-mini is judged as ”weird” and quite hit and miss on coding (basically the sole reason for its existence) with a sky high SimpleQA hallucination rate due to its limited scope, probably beat by Sonnet 3.7 by a fairly large margin.
Still, being early with a product and still often ”good enough” still takes them a long way. I think GPT-5 and where their competition will be then will be quite important for OpenAI though. I think the signs on the horizon is that everyone will close up on each other as we hit the diminishing returns, so the underlying business model, integrations, enterprise reach, marketing and market share will probably be king rather than the underlying LLM in 2026.
Since GPT-5 is meant to select the best model behind the scenes, one issue might be that users won’t have the same confidence in the model, feeling like it’s deciding for them or OpenAI tuning it to err on the side of being cheap.
I'm not even sure what is being alleged there—o1's reasoning tokens are kept secret precisely to avoid the kind of distillation that's being alleged. How can you distill a reasoning process given only the final output?
Do they? Why doesn't this happen to Claude then? I've been hearing this for a while, but never saw any evidence beyond the contamination of the dataset with GPT slop that is all over the web. Just by sending anything to the competitors you're giving up a large part of your know how before you even finish your product, that's a big incentive against doing that.
And OpenAI based their tech on a Google paper again building on years of public academic research so what's the point exactly here?
OpenAI was just first out of the gates, there'll always be some company that's first, essence is how they handle their leadership, and they've sadly been absolutely terrible and scummy.
Actually i think Google was a pretty good example of the exact opposite, decades of "actually not being evil", while openAI switched up 1 second after launch.
Google wasn't the first search engine but they were the best marketing google = search. That's where we are with openai. Google search was a better product at the time and chatGPT 3.5 was a breakthrough the public used. Fast forward and some will say Google isn't the best search engine anymore (kagi, duckduckgo, yandex offer different experiences) but people still think of google=search. Same with chatGPT. Claude may be better for coding or gemini better are searching or Deepseek cheaper but equal but chatGPT is a verb and will live on like Intel inside long after it's actual value has declined.
That happened long after they completely dominated search. They succeeded because of quality, and because of how low quality all the other engines were.
There was a time when Google was thought of as a respectable, high-quality, smart and nimble company. That has faded as the marketing grew.
What is your point? OpenAI wasn’t the first out of that gate as your own argument cites Google prior. All these companies are predatory, who is arguing against that? OP said OpenAi was irrelevant. That’s just dumb. They are not. Feel free to advance an argument in favor of that narrative if you wish as I was just trying to provide a single example that shows that some of these lightweight models are building directly off the backs of giants spending the big money. I find nothing wrong with distillation and am excited about companies like DeepSeek.
It's neither obvious nor true, generalist models outperform specialized ones all the time (so frequently that it even has its own name - the bitter lesson)
GPT 4.5 also has a knowledge cutoff date of 10-2023.
https://www.reddit.com/r/singularity/comments/1izpb8t/gpt45_...
I'm guessing that this model was finished pre-training at least a year ago (it's been 2 years since GPT 4.0 was released) and they just didn't see the hoped-for performance gains to think it warranted releasing at the time, and so put all their effort into the Q-star/strawberry = eventual O1 reasoning effort instead.
It seems that OpenAI's reasoning model lead isn't perhaps what they thought it was, and the recent slew of strong non-reasoning models (Gemini 2.0 Flash, Grok 3, Sonnet 3.7) made them feel the need to release something themselves for appearances sake, so they dusted off this model, perhaps did a bit of post-training on it for EQ, and here we are.
The price is a bit of a mystery - perhaps just a reflection of an older model without all the latest efficiency tricks to make it cheaper. Maybe it's dense rather than MoE - who knows.
Rumors said that GPT4.5 is an order of magnitude larger. Around 12 trillion parameters total (compared to GPT4's 1.2 trillion). It's almost certainly MoE as well, just a scaled up version. That would explain the cost. OpenAI also said that this is what they originally developed as "Omni" - the model supposed to succeed GPT4 but which fell behind expectations. So they renamed it 4.5 and shoehorned it in to remain in the news among all those competitor releases.
This is all excellent detail. Wondering if there's any good suggestions for further reading on the inside baseball of what happened with GPT 4.5?
Well, it's not...it gets most details wrong.
Can you elaborate?
GPT-4 was rumored to be 1.8T params...not 1.2
And the successor model was called "Orion", not "Omni".
Appreciate the corrections, but I'm still a bit puzzled. Are they wrong about 4.5 having 12 trillion parameters, it originally intending to be Orion (not omni), or an expected successor to GPT 4? And do you have any related reading that speaks to any of this?
GPT-4 was 1.3T. 221B active. 2 experts active. 16 experts total.
the gpt-4o ("omni") is probably a distilled 4.5; hence why not much quality difference
4o has been out since May last year, while omni (now rechristened as 4.5) only finished training in October/November.
4.5 was called Orion, not Omni.
GPT-4 was rumored to be 1.8T params...not 1.2
And the successor model was called "Orion", not "Omni".
You're thinking of "Orion" not "Omni" (GPT 4o stands for "Omni" since it's natively multimodal with image and audio input/output tokens)
How does this compare with Grok 3's parameter count? I know Grok 3 was trained on a larger cluster (100k-200k) but GPT 4.5 used distributed training.
>The price is a bit of a mystery
I think it at least is somewhat analogous to what happened with pricing on previous models. GPT 4, despite being less capable than 4o, is an order of magnitude more expensive, and comparably expensive to o1. It seems like once the model is out, the price is the price, and the performance gains emerge but they emerge attached to new minified variations of previous models.
I don't think th October 2023 training cut-off means the model finished pre-training a year ago. All of OpenAI's models share that same cut-off date.
One theory is that they're worried about the increasing tide of LLM-generated slop that's been posted online since that date. I don't know if I buy that or not - other model providers (such as Anthropic, Gemini) don't seem worried about that.
Releasing it was probably a mistake. In context what the model is could have been understood, but they haven’t really presented that context. Also it would be lost on a general audience.
The general public will naturally expect it to be the next big thing. Wasn’t that the point of releasing it? To seem like progress is being made? To try to make that point with a model that doesn’t deliver is a misstep.
If I were Sam Altman, I’d be pulling this back before it goes on general release, saying something like it was experimental and after user feedback the costs weren’t worth it and they’re working on something else as a replacement. Then o3 or whatever they actually are working on instead can be the “replacement” even if it’s much later.
or just say it was too good and thus too dangerous to release...
I sort of believed this but also 4.5 coming out last year would absolutely have been a big deal compared to what was out there at the time? I just dont understand why they would not launch it then.
> slew of strong non-reasoning models (Gemini 2.0 Flash, Grok 3, Sonnet 3.7)
Sonnet 3.7 is actually reasoning model.
It's my understanding that reasoning in Sonnet 3.7 is optional and configurable.
I might be wrong but I couldn't find a source that indicates that the "base" model also implements reasoning.
From limited experimentation: Sonnet 3.7 has “extended thinking” as an option, although the UI, at least in the app, leaves something to be desired. It also has a beta feature called “Analysis” that seems to work by having the model output JavaScript code as part of its response that is then run and feeds back into the answer. Both of these abilities are visible — users can see the chain of thought and the analysis code.
It seems, based again on limited experimentation doing sort-of-real work, that analysis works quite well and extended thinking is so-so. Whereas DeepSeek R1 seems to be willing and perhaps even encouraged to second-guess itself (maybe this is a superpower of the “wait” token”), Sonnet 3.7 doesn’t seem to second-guess itself as much as it should. It will happily extended-think, generate a wrong answer, and then give a better answer after being asked a question that it really should have thought of itself.
(I’m not complaining. I’ve been a happy user of 3.7 for a whole day! But I think there’s plenty of room for improvement.)
so is grok3
GPT-4.5 feels like OpenAI's way of discovering just how much we'll pay for diminishing returns.
The leap from GPT-4o to 4.5 isn't a leap—it's an expensive tiptoe toward incremental improvements, priced like a luxury item without the luxury payoff.
With pricing at 15x GPT-4o, they're practically daring us not to use it. Given this, I wouldn't be surprised if GPT-4.5 quietly disappears from the API once OpenAI finishes squeezing insights (and cash) out of this experiment.
Even this is a bit overly complicated/optimistic to me. Why not something as simple as: OpenAI has been building larger and larger models to great success for a long time. As a result, they were excited this one was going to be so much larger=so much better that the price to run it would be well worth the huge jump they were planning to get from it. What really happened is this method of scaling hit a wall and they were left with an expensive dud they won't get much out of but they have to release something for now otherwise they start falling well behind on the boards the next few months. Meanwhile they scramble focus to find other means of scaling like "chain of thought + runtime" provided.
Thank you so much for this comment. I don't really understand the need for people to go straight to semi-conspiratorial hypotheses, when the simpler explanation makes so much more sense. All the evidence is that this model is much larger than previous ones, so they must charge a lot more for inference because it costs so much more to run. OpenAI were the OGs when it came to scaling, so it's not surprising they went this route and eventually hit a wall.
I don't at all blame OpenAI for going down this path (indeed, I laud them for making expensive bets), but I do blame all the quote-un-quote "thought leaders" who were writing breathless posts about how AGI was just around the corner because things would just scale linearly forever. It was classic "based on historical data, this 10 year old will be 20 feet tall by the time he's 30" thinking, and lots of people called them out on this, and they either just ignored it or responded with "oh, simple not-in-the-know peons" dismissiveness.
It is weird because this is a board for working programmers for the most part. So like, who’s seen a gram conspiracy actually be accomplished? Probably now many. A lackluster product that gets released even though it sucks because too many people are highly motivated not to notice that it sucks? Everybody has experienced that, right?
Exactly. Although I wouldn't even say they have blinders, it seems like OpenAI understands quite well what 4.5 can do and what it can't hence the modesty in their messaging.
To your point, though, I would add not only who has seen any grand conspiracy actually be accomplished, who has seen one even attempted and kept under wraps? Such that the absence of corroborating sources was more consistent with an effectively executed conspiracy theory than the simple absence of such a plan.
It works until it doesn't and hindsight is 20/20.
> It works until it doesn't
Of course, that's my point. Again, I think it's great that OpenAI swung for the fences. My beef is again with these "thought leaders" who would write this blather about AGI being just around the corner in the most uncritical manner possible (e.g. https://news.ycombinator.com/item?id=40576324). These folks tended to be in one of two buckets:
1. "AGI cultists" as I called them, the "we're entering a new phase of human evolution"-type people.
2. People who had a motive to try and sell something.
And it's not about one side or the other being "right" or "wrong" after the fact, it's that so much of this just sounded like magical thinking and unwarranted extrapolations from the get go. The actual experts in the area, if they were free to be honest, were much, much more cautious in their pronouncements.
Definitely, the grifters and hypesters are always spoiling things, but even with a sober look it felt like AGI _could_ be around the corner. All these novel and somewhat unexpected emerging capabilities as we pushed more data through training, you'd think maybe that's enough? It wasn't and test time compute alone isn't either, but that's also hindsight to a degree.
Either way, AGI or not, LLMs are pretty magical.
If you've been around long enough to witness a previous hype bubble (and we've literally just come out of the crypto bubble), you should really know better by now. Pets.com, literally an online shop selling pet food, almost IPOd for $300M in early 2000, just before the whole dot-com bubble burst.
And yeah, LLMs are awesome. But you can't predict scientific discovery, and all future AI capabilities are literally still a research project.
I've had this on my HN user page since 2017, and it's just as true as ever: In the real world, exponentials are actually early stage sigmoids, or even gaussians.
Well that's only because YOU don't understand exponential growth! No human can! /s
In fundamental science terms, it also proves once and for all that more model doesn't mean more better. Any forces within OpenAI pushing to move past just growing the model for gains now have a strong argument for going all-in on new processes.
[dead]
Time to enter the tick cycle.
I ask chatgpt to give me a map highlighting all spanish speaking countries, gives me stable diffusion trash.
Just gotta do the grunt work, add a tool with a map api. Integrate with google maps for transit stuff.
It's a good LLM model already it doesn't need to be einstein and solve aerospatial equations. We just need to wait until they realize their limits and find the humility to build yet another useful product that won't conquer the world.
I’ve thought of LLM’s as google 2.0 for some time now. Truly a world changing technology similar to how google changed the world, likely to have an even larger impact than google had as we create highly specialized Implementations of the technology in the coming decade…but it’s not energy positive nuclear fusion, or a polynomial time NP solver, it’s just google 2.0
Google 2.0 where you have to check every answer it gives you because it's authoritative about nothing.
Works great when the output is small enough to unit test or immediately try in situations with no possible negative outcomes.
Anything larger? Skip the LLM slop and go to the source. You have to go to the source, anyway.
All while using far more energy than a normal google search
I keep wondering what the long-game (if any) of LLMs is... to make the world dependent on various models then jack the rates up to cover the costs? The gravy-train of SV funding has to end eventually... right?
You have to go to the source, anyway.
Yeah, and then check that. I don't get this argument at all.
People who uncritically swallow the first answer or two they get from Google have a name... but that would just derail the thread into politics.
There is a truth in the grandparent's comment that doesn't necessarily conflict with this view. The Google 2.0 effect is not necessarily that it gives you a better correct answer faster than google. I think it never dawned on people how bad they were at searching about topics they didn't know much about or how bad google was at pointing them in the right direction prior to chatgpt. Or putting it another way, they never realized how much utility they would get out of something that pointed them in the correct direction even though they couldn't trust the details.
It turns out that going from not knowing what you don't know to knowing what you don't know adds an order of magnitude improvement to people's experience.
And the llm by design does not save or provide source. Unlike google or wikipedia which are transparent about sources.
It most certainly does, if you are using the latest models, which people making comments like this never are as a rule.
There is something to be said for trusting people’s (or systems of people’s) authority.
For example, have you ever personally verified that humans went to the moon? Have you ever done the experiments to prove the Earth is round?
This is not a helpful phrasing I think. Sources allow the reader to go as far down the rabbit hole as they are willing to or knowledgable enough to go.
For example, if I'm looking for some medical finding and I get to a source that's a clinical study from a reputable publication, I may be satisfied and stop there since this is not my area of expertise. However, a person with knowledge of the field may be able to parse the study and pick it apart better than I could. Hence, their search would not end there since they would be unsatisfied with just the source I was satisfied with.
On the other hand, having no verifiable sources should leave everyone unsatisfied.
Of course, that verifiability is a big part of that trust. I’m not sure why you think my phrasing is not helpful; we seem to agree.
> Have you ever done the experiments to prove the Earth is round?
I have, actually! Thanks, astronomy class!
I've even estimated the earth's diameter, and I was only like 30% off (iirc). Pretty good for the simplistic method and rough measurements we used.
Sometimes authorities are actually authoritative, though, particularly for technical, factual material. If I'm reading a published release date for a video game, directly from the publisher -- what is there to contest? Meanwhile, ask an LLM and you may have... mixed results, even if the date is within its knowledge cutoff.
Have you provided documentation that you are human? Perhaps you are a lizard person sowing misinformation to firm up dominance of humankind.
LLMs could make some nice little tools.
However they’ll need to replace vast swathes of the economy to justify these AI companies’ market caps.
Giving ChatGPT stupid AI image generation was a huge nerf. I get frustrated with this all the time.
Oh, I think it's great they did that. It's super helpful for visualizing ChatGPT's limitations. Ask it for an absolutely full, overflowing glass of wine or a wrist watch whose time is 6:30 and it's obvious what it actually does. It's educational.
I asked claude to give me a script in python to create a map highlighting all spanish speaking countries. it took 3 tries and then gave me a perfect svg and png.
> Just gotta do the grunt work, add a tool with a map api. Integrate with google maps for transit stuff.
This is kind of the crux though. The only way to make LLMs more useful is to basically make them traditional AI. So it's not really a leap forward nevermind path to AGI.
They should have called it "ChatGPT Enterprise".
Exactly! designed specifically for people who love burning corporate budgets.
OpenAI is going to add it to Plus subscriptions. I.e. available for many at no additional cost. Likely with restrictions line N prompts/hour.
As for API price, when it matters businesses and people are willing to pay much more for just a bit better results. OpenAI doesn't take the other options away. So we don't lose anything.
IMO the 4o output is lot more Enterprise-compatible, the 4.5 being straight to the point and more natural is quite the opposite. Pricing-wise your point stands.
Disclaimer: have not tried 4.5 yet, just skimmed through the announcement, using 4o regularly.
Apparently, OpenAI API “credits” expire after a year. I stupidly put another $20 and trying to blow through them, 4.5 is the easiest way considering recent 4o has fallen out of favor for other models and I don’t want to just let them expire again. An expiry after only one year is asinine.
Yes. I also discovered this, and was also forced to blow through my credits in a rush. Terrible policy.
I'm learning this for the first time now. I don't appreciate having to anticipate how many credits I'll use like its an FSA account.
>Terrible policy.
And unfortunately one not exclusive to OpenAI. Anthropic credits also expire after 1 year.
Not sure how it's with OpenAI, but Anthropic is so money-hungry, they won't even let you remove your debit card data from your account without a week-long support encounter.
This is how pricing on human labour works. Nobody expects an employee that costs twice as much to produce twice the output for any given task. All that is expected is that they can do a narrow set of things, that another person can't.
4.5 can extremely quickly distill and work with what I at least consider, complex nuanced thought. 4.5 is night and day better than every other AI for my work, it's quite clever and I like it.
Very quick mvp comparison for the show me what you mean crew: https://chatgpt.com/share/67c48fcc-db24-800f-865b-c0485efd7f... & https://chatgpt.com/share/67c48fe2-0830-800f-a370-7a18586e8b... (~30 seconds vs ~3 minutes)
The 4.5 has better 'vibes' but isn't 'better', as a concrete example:
> Mission is the operationalized version of vision; it translates aspiration into clear, achievable action.
The "Mission is the operationalized version of vision" is not in the corpus that I am find and is obviously a confabulated mixture of classic Taylorist like "strategic planning"
SOPs and metrics, which will be tied to compensation and the unfortunate ubiquitous nature of Taylorism would not result in shared purpose, but a bunch of Gantt charts past the planning horizon.
IMHO I would consider "complex nuanced thought" as understanding the historical issues and at least respect the divide between classical and neo-classical org theory. Or at least avoid pollution of more modern theories with classical baggage that is a significant barrier to delivering value.
Mission statements need to share strategic intent in an actionable way, strategy is not operationalization.
I have been experimenting with 4.5 for a journaling app I am developing for my own personal needs, for example, turning bullet/unstructured thoughts into a consistent diary format/voice.
The quality of writing can be much better than Claude 3.5/3.7 at times but struggling with similar confabulation of information that is not in the original text but "sounds good/flows well". Which isn't ideal for a personal journal... I am still playing around with the system prompt but given the astronomical cost (even with me as the only user) with marginal benefits I am probably going to end up sticking with Claude for now.
Unless others have a recommendation for a less robot-y sounding model (that will, however, follow instructions precisely) with API access other than the mainstream Claude/OpenAI/Gemini models?
I've found this on par with 4.5 in tone, but not as nuanced in connecting super wide ideas in systems, 4.5 still does that best: https://ai.google.dev/gemini-api/docs/thinking
(also: the person you are responding to is doing exactly what you're saying you don't want done, take something unrelated to the original text (Taylorism) but could sound good, and jam it in)
The statement "Mission is the operationalized version of vision; it translates aspiration into clear, achievable action" isn't a Taylorist reduction of mission to mechanical processes - it's actually a nuanced understanding of how these organizational elements relate. You're misinterpreting what "operationalized" means in this context. From what i can tell, the 4.5 response isn't suggesting Taylorist implementation with Gantt charts etc it's describing how missions translate vision into actionable direction while remaining strategic. Instead of jargon, it's recognizing that founders need something between abstract vision and tactical execution. Missions serve this critical bridging function. CEO has vision, orgs capture the vision into their missions, people find their purpose when aligned via the 2. Without it, founders either get stuck in aspirational thinking or jump straight to implementation details without strategic guidance. The distinction matters exactly because it helps avoid the dysfunction that prevents startups from scaling effectively. I think you're assuming "operationalized" means tactical implementation (Gantt charts, SOPs) when in this context it means "made operational/actionable at a strategic level". Missions != mission statements. Also, you're creating a false dichotomy between "strategic intent" and "operationalization" when they very much, exist on a spectrum. (If anything, connecting employees to mission and purpose is the opposite of Tayloristic thinking, which viewed workers more as interchangeable parts than as stakeholders in a shared mission towards responding to a shared vision of global change) - You are doing what o1 pro did, and as I said: As a tool for teaching business to founders, personally, I find the 4.5 response to be better.
An example of a typical nieve definition of a mission statement is:
Concise, clear, and memorable statement that outlines a company's core purpose, values, and target audience.
> "made operational/actionable at a strategic level".
Taken the common definition from the first part of this plan, what do you think the average manager would do given that in the social sciences, operationalization is explicitly about measuring abstract qualities. [1]
"operationalization" is a compromise, trying to quantify qualitative properties, it is not typically subject to methods like MECE principal, because there are too many unknown unknowns.
You are correct that "operationalization" and "strategic intent" are not mutually exclusive in all aspects, but they are for mission statements that need to be durable across changes that no CEO can envision.
The "made operational/actionable at a strategic level" is the exact claim of pseudo scientific management theory (Greater Taylorism) that Japan directly targeted to destroy the US manufacturing sector. You can look at the former CEO of Komatsu if you want direct evidence.
GM:s failure to learn form Toyota at NUMII (sp?) is another.
The planning process needs to be informed by stratagy, but planning is not strategic, it has a limited horizon.
But you are correct that it is more nuanced and neither Taylor nor Tolstoy allowed for that.
Neo-classical org theory is when bounded rationality was first acknowledged, although the Prussian military figured that out long before Taylor grabbed his stopwatch to time people loading pig iron into train cars.
I encourage you to read:
Strategy: A History By sir Lawrence Freedman
For a more in depth discussion.
[1] https://socialsci.libretexts.org/Bookshelves/Sociology/Intro...
Your responses are interesting because they drive me to feel reinforced about my opinion. This conversation is precisely why I rate 4.5 over o1 pro. I prompted in a very very very specific way. I'm afraid to say your comments are highly disengaged for the realities of business and business building. Appreciate the historical context and recommended reading (although I assure you, I am extremely well versed). The term 'operationalized' here refers to strategic alignment, not Taylorist quantification, think guiding principles over rigid metrics. You are badly conflating operationalization in social sciences (which is about measurement) with strategic operationalization in management, which is not same. Again: operationalized in this context means making the mission actionable at a strategic level, not about quantification. Modern mission frameworks prioritize adaptability within durable purpose, avoiding the pitfalls you’ve rightly flagged. Successful founders don't get caught in these theoretical distinctions. Founders taught be my, and I guess GPT 4.5, understand correctly, mission as the bridge between aspirational vision and practical action. This isn't "Greater Taylorism" but pragmatic leadership. While your historical references (NUMMI, not NUMII) demonstrate academic knowledge, they miss how effective missions actually guide organizations while remaining adaptable. The 4.5 response captured this practical reality well- it pointed to but it not create artificial boundaries between interconnected concepts. If we had some founders trained by you (o1 Pro) and me (Gpt 4.5) - I would be willing to bet my founders would out preform yours any day of the week.
Tuckman as a 'real' framework is a belief so that is fair.
He clearly communicated in 1977 that his ideas were never formally validated and that he cautioned about their use in other contexts.
I think that the concepts can be useful, if you took them as anything more than a guiding framework that may or may not be appropriate for a particular need.
https://core.ac.uk/download/pdf/36725856.pdf
I personally find value in team and org mission statements, especially for building a shared purpose, but to be honest, any of the studies on that are more about manager satisfaction then anything else.
There is far more data on the failure of strategy execution, and linking strategy with purpose as well as providing runways and goals is one place I find vision and mission statements useful.
As up to 90% of companies fail on strategy execution, and because employee engagement is in free fall, the fact that companies are still in business means little.
Context is king, and this is horses for courses, but I would caution against ignoring more recent, Nobel winning theories like Holmström's theorem.
Most teams don't experience the literal steps Tuckman suggested, rarely all at once, and never as one time singular events. As the above link demonstrated, some portions like the storming can be problematic.
Make them operationalize their mission statement, and they will and it will be in concrete.
Remember von MoltKe "No plan of operations extends with certainty beyond the first encounter with the enemy's main strength."
There is a balance between C2 and mission command styles, the risk is trying to force or worse intentionally causing people to resort to c2 when almost always you need a shifting balance between command and intent based solutions.
The Feudal Mode of Production was sufficient for centuries, but far from optimal.
The NUMMI reference was exactly related to the same reason Amazon profits historically raised higher despite head count increases that should have allowed.
Small cross functional teams, with clearly communicated tasks, and enough freedom to accomplish those tasks efficiently.
You can look at Trist's study about the challenges with incentivizing teams to game the system. Same problem happened under Balmer at MS, and DEC failed the opposite way, trying to do everything at once and please everyone.
https://www.uv.es/=gonzalev/PSI%20ORG%2006-07/ARTICULOS%20RR...
The reality is that the popularity of frameworks rarely relates to their effectiveness, building teams is hard, making teams work as teams across teams is even harder.
Tuckerman may be useful in that...but this claim is wrong:
> "Modern mission frameworks prioritize adaptability within durable purpose, avoiding the pitfalls you’ve rightly flagged"
Modern _ frameworks prioritize adoption and depending on the framework to solve your companies needs will always fail. You need to choose a framework that fits your strategy and objectives, and adapt it to fit your needs.
Learn from others, but don't ignore the reality on the ground.
Regarding Tuckman's model, there are actually numerous studies validating its relevance and practical application: Gren et al. (2017) validated it specifically for agile teams across eight large companies. Natvig & Stark (2016) confirmed its accuracy in team development contexts. Bonebright's (2010) historical review demonstrated its ongoing relevance across four decades of application.
I feel we're talking past each other here. My original point was about which AI model is better for MY WORK. (I run a starup accelerator for first time founders) 4.5, in 30 seconds over minutes, provided more practical value to founders building actual businesses, and saved me time. While I appreciate your historical references and academic perspectives, they don't address my central argument about GPT-4.5's response being more pragmatically useful. The distinction between academic precision and practical utility is exactly what I'm highlighting. Founders don't need perfect theoretical models - they need frameworks that help them bridge vision and execution in the real world. When you bring up feudal production modes and von Moltke, we're moving further from the practical question of which AI response would better guide someone trying to align teams around a meaningful mission that drives business results. It's exactly why I formed the 2 prompts in the manner I did, I wanted to see if it was an academic or an expert.
My assessment stands that GPT-4.5's 30 seconds of thinking reflects well how mission operationalizes vision reflects how successful businesses actually work, not how academics might describe them in theoretical papers. I've read the papers, I've studied the theory deeply, but I also have NYSE and NASDAQ ticker symbols under my belt, from seed. That, is the whole point here.
OK maybe we are using different meanings of the word "operationalize"
If I were say in middle management and you asked me to "operationalize" the impact of mission statements, I would try to associate the existence of a mission statement on a team to some metric like financial performance.
If I was on a small development team and you asked me to "operationalize" our mission statement, I would probably make the same mistake the software industry always does, like trying it to tickets closed, lines of code, or even the Dora metrics.
Under my understanding of "operationalize" and the only way I can find it referenced related to mission statements themselves, I would actually de-emphasize deliverables, quality, stakeholders changing demands etc...
Even if I try to "operationalize" in a more abstract way, like define an impact score, which may not directly map to business objectives or even team building.
Almost every LLM offers a similar definition I offered above E.G.
> "operationalization" refers to the process of defining an abstract concept in a way that allows it to be measured and observed through specific, concrete indicators
Impact scores, which are subjective can lead to Google's shelfware problems, and even scrum rituals often leads to hard but high value tasks being ignored because of the incentives don't allow for it.
In both of your cites, they were situations where existing cultures were enhanced, not fully replaced.
Both were also short term, and wouldn't capture the long tail problems I am referencing.
Heck even Taylorism worked well for the auto industry until outside competition killed it. Well at least for the companies, consumers suffered.
The point is that "operationalization" specifically is counterproductive under a model, where infighting during that phase is bad.
If you care about delivering on execution, it would seems to be important to you. But I realize that you may not be targeting repeat work...I just don't know.
But I am sure some McKinsey BA probably has put that consern in a PDF someplace by now because the GoA Agile assessment guide is being incorporated and even ITIL and TOGAF reference that coal face paper I cited.
The BCGs and McKinseys of the world are absolutely going to shift to detection of common confabulations to show value.
While I do take any tools possible to make me more productive, correctness of content concerns me more than exact verbage.
But yes, different needs, I am in the nitch of rescuing failed initiatives, which admittedly is far from the typical engagement style.
To be honest the lack of scratch space on 4.5 compared to CoT models is the main blocker for me.
I believe 4.5 is a very large and rich model. The price is high because it's costly to inference; however, the bigger reason is to ensure that others don't distill from it. Big models have a rich latent space, but it takes time to squeeze the juice out.
That also means people won't use it. Way to shoot yourself in the foot.
The irony of a company that has distilled the word's information complaining about another company distilling their model...
The small number of use cases that do pay are providing gross margins as well as feedback that helps OpenAI in various ways. I don’t think it’s a stupid move at all.
My assumption: There will be use cases where cost of using this will be smaller than the gain from it. Data from this will make the next version better and cheaper.
My take from using it a bit is that they seem to have genuinely innovated on:
- Not writing things that go off in weird directions / staying grounded in "reality"
- Responding very well to tone preferences and catching nuance in what I say
It seems like it's less that it has a great "personality" like Claude, but that it's capable of adapting towards being the "personality" I want and "understanding" what I'm saying in ways that other models haven't been able to do for me.
So this kind of mirrors my feelings after using GPT-4.5 on general conversation and song writing.
GPT picked up on unspecified requirements almost instantly. It is subtle (and may be undesirable in some contexts). For example in my songs, I have to bracket the section headings, it picked up on that from my original input. All the other frontier models generally have to be reminded. Additionally, I separately asked for an edit to a music style description. When I asked GPT-4.5 to write a song all by itself, it included a music style description. No other model I have worked with has done this.
These are subtle differences, but in aggregate the model just generally needs less nudging to create what is required.
I haven't used 4.5 but have some experience using Claude for creative writing, and in my experience it sometimes has the uncanny ability to get to the core of my ideas, rephrasing my paragraph long descriptions into just a sentence or two, or both improving and concretizing my vague ideas into something that's insightful and tasteful.
Other times it locks itself into a dull style and ignores what I ask of it and just produces boring generic garbage, and I have to wrangle it hard to get some of the spark back.
I have no idea what's going on inside, but just like with Stable Diffusion, it's fairly easy to make something that has the spark of genius, and is very close to being perfect, but getting the last 10% there, and maintaining the quality seems almost impossible.
It's a very weird feeling, it's hard to put into words what is exactly going on, and probably even harder to make it into a benchmark, but it makes me constantly flip-flop between scared of being how good the AI is, and questioning why I ever bothered with using it in the first place, as I would've progress much faster without it.
Long term it might be hard to monetise those infrastructure considering their competition:
1) For coding (API) most probably will stick to Claude 3.5 / 3.7 - big market but still small comparing to all world wide problems
2) For non-coding API IMHO gemini 2.0 flash is the winner - dirty cheap (cheaper than 4o-mini), good enough and even better than gpt-4o, cheap audio and image input.
3) For subscription app ChatGPT is probably still the best but only slightly - they have the best advanced voice audio conversation but Grok will be probably eating their lunch here
Sesame model for voice audio imo is better than ChatGPT voice audio conversation. They are going to open source it as well.
Sure but is there an app I can talk to / work with? It seems they're a voice synthesis model company, not a chatbot app / tool company.
> They are going to open source it as well.
Means nothing until they do
We were using gpt-4o for our chat agent, and after some experiments I think we'll move to flash 2.0. Faster, cheaper and a bit more reliable even. I also experimented with the experimental thinking version, and there a single node architecture seemed to work well enough (instead of multiple specialised sub agents nodes). It did better than deepseek actually. Now I'm waiting for the official release before spending more time on it.
For the rest of us using free tiers ChatGPT is hands down the winner allowing limited image generation, unlimited usage of some model and limited usage of 4o.
Claude is still stuck at 10 messages per day and gemini is less accurate/useful.
10 messages a day? How are people "vibe coding" with that?
They're paying for Pro
Ah thank you; I had heard the paid ones had daily limits too so I was confused
They do, I subscribe to pro. All of my vibe coding however is done via the API.
It's marketed to be slightly better at "creative writing". This isn't the problem most businesses have with current-generation LLMs. On the other side; Anthropic releases nearly at the same time a new model which solves more practical problems for businesses to the point that for coding many insiders don't use OpenAI models for such tasks anymore.
I think it should be illegal to trick humans into reading "creative" machine output.
It strikes me as a form of fraud that steals my most precious resources: time and attention. I read creative writing to feel a human connection to the author. If the author is a machine and this is not disclosed, that's a huge lie.
It should be required that publishers label AI generated content.
> I think it should be illegal to trick humans into reading "creative" machine output.
Creativity has lost its meaning. Should it be illegal? The courts will take a long time to settle the matter. Reselling people's work against their will as creative machine output seems unethical, to say the least.
> It should be required that publishers label AI-generated content.
Strongly agree.
I'm pretty sure you read for pleasure, and feeling a human connection is one way that you derive pleasure. If it's the only way that you derive pleasure from reading, my condolences.
Pretty much where my thoughts on this are. I rarely feel any particular sense of connection to the actual author when I read their books. And I have taken great pleasure from some AI stories (to the degree I put them up on my personal website as a way to keep them around).
Under the dingnuts regime, Dwarf Fortress will be illegal. Actually, any game with a procedural story? You better believe that's a crime: we can't have a machine generate text a human might enjoy.
Dingnuts point was that it should be disclosed. Everyone knows Dwarf Fortress stories are procedural/AI generated, the authors aren't trying to hide that fact.
Actually, fair enough. I still disagree with their argument, but this was the wrong tack for me to use.
Seems like we're hitting the limits of the technology...
Consumer limits. When something is good enough it can make stable money, then there is no real incentive to innovate beyond the bare minimum—just enough to keep consumers engaged, shareholders satisfied, and regulators at bay.
This is how modern capitalism and all corporations work, we will keep receiving new numbers in versions without any sensible change – consumers will keep updating their subscriptions out of habit – xyzAI PR managers, HR managers, corporate lawyers and myriads of other bureaucrats will keep receiving their paychecks secretly dreaming of retirement, xyzAI top management will burn money on countless acquisitions just to fight boredom, turning into xyz(ai)MEGAcorp doing everything from crude oil processing and styrofoam cups to web services and AI models.
No modern mega corporation is capable of making something else or different from what already worked for them just once. We could achieve universal wellfare and prosperity 60 years ago, that would’ve disrupted the cycle. Instead, we got planned obsolescence, endless subscription models, and a world where everything “new” is just a slightly repackaged version of last year’s product.
Yes, I believe the sprint is over, now its doing to be slow cycles maybe 18 months to see a 5% increase an ability and even that 5% increase will be highly subjective. Claude's new release is about the same 3.7 is arguably worse at some things than 3.5 and better at others. Based on the previous pace of release in about 6 months or so - if the next release from any of the leaders is about the same "kinda better kinda worse" then we'll know. Imagine how much money is going to evaporate from the stock market if this is the limit!!!
You can keep getting rich off shovels long after the gold has run dry.
To say 3.7 is worse is completely insane.
I also hate waiting on reasoning.
I much would prefer a super lightning fast model that is cheaper but the same quality as these frontier models.
Let me query these things to death.
try groq (hyperfast chips) https://groq.com/
does it mean we get a reprieve from "this is just the beginning" comments.
Maybe if it takes many years before the next major architectural advancement.
I wouldn't count on it.
I don't get it. Aren't these two sentences in the same paragraph contradictory?
>"Scaling to this size of model did NOT make a clear jump in capabilities we are measuring."
> "The jump from GPT-4o (where we are now) to GPT-4.5 made the models go from great to really great."
No, it means that it got better on things orthogonal to what we have mostly been measuring. On the last few rounds, we have been mostly focusing on reasoning, not as much on knowledge, "creativity", or emotional resonance.
"It's better. We can't measure it, but we're pretty sure it's better. We also desperately need it to be better because we just spent a boat-load of money on it."
Is somebody actually looking at those last percentages on benchmarks?
Aren't we making mistake of assuming benchmarks are purely 100% correct?
Meanwhile all GPT4o models on Azure are set to be deprecated in May and there are no alternative models yet. We should start moving to Anthropic? DS too slow, melting under its own success. Anyone on GPT4o/Azure has any idea when they'll release the next "o" model?
Only an older version of GPT-4o has been deprecated and will be removed in May. The newest version will be supported through at least 20 November 2025.
https://learn.microsoft.com/en-us/azure/ai-services/openai/c...
The Nov 2024 release, which is due to be deprecated in Nov 2025, I was told has degraded performance compared to the Aug 2024 release. In fact, OpenAI Models page says their current GPT4o API is serving the Aug release. https://platform.openai.com/docs/models#gpt-4o
So I'm still on the Aug 24 release, which, with your reminding me, is not to be deprecated till Aug 2025, but that's less than 5 months from now, and we're skipping the Nov 2024 release just as OpenAI themselves have chosen to do.
I've found 4.5 to be quite good at "business decisions", much better than other models. It does have some magic to it, similar to Grok 3, but maybe a bit smarter?
It seems like there’s a misunderstanding as why this happened. They’ve been baking this model for months. long before deep seek came out with fundamental new ways of distilling models. and even given that it’s not great it’s its large form, they’re going to distil from this going forward .. so it likely makes sense for them to periodically train these very large models as a basis.
I think this framing isn't quite right either. DeepSeek's R1 isn't very different from what OpenAI has already been doing with o1 (and that other groups have been doing as well). As for distilling - the R1 "distilled" models they released aren't even proper (logit) distillations, but just SFTs, not fundamentally new at all. But it's great that they published their full recipes and it's also great to see that it's effective. In fact we've seen now with LIMO, s1/s1.1, that even as few as 1K reasoning traces can get most LLMs to near SOTA math benchmarks. This mirrors the "Alpaca" moment in a lot of ways (and you could even directly mirror say LIMO w/ LIMA).
I think the main takeaway of GPT4.5 (Orion) is that it basically gives a perspective to all the "hit a wall" talk from the end of last year. Here we have a model that has been trained on by many accounts 10-100X the compute of GPT4, is likely several times larger in parameter count, but is only... subtly better, certainly not super-intelligent. I've been playing around w/ it a lot the past few days, both with several million tokens worth of non-standard benchmarks and talking to it and it is better than previous GPTs (in particular, it makes a big jump in humor), but I think it's clear that the "easy" gains in the near future are going to be figuring out how as many domains as possible can be approximately verified/RL'd.
As for the release? I suppose they could just have kept it internally for distillation/knowledge transfer, so I'm actually happy that they released it, even if it ends up not being a really "useful" model.
I think this release is for the researchers who worked on it and would quit if it never saw daylight
Too much money not enough new ideas.
I've been using 4.5 instead of 4o for quick questions. I don't mind the slowness for short answers. I feel like it is less likely to hallucinate than other models.
I have access to it. It is better, but not where most techies would care. It knows more, it writes better, it's more pleasant to talk to. I think they might have studied the traffic their hundreds of millions of users generate and realized where they need to improve, then did exactly that for their _non thinking_ model. They understand that a non-thinking model is not going to blow the doors off on coding no matter what they do, but it can do writing and "associative memory" tasks quite well, and having a lot more weights helps there. I also predict that they will fine tune their future distilled, thinking models for coding, based on the same logic, distilling from 4.5 this time. Those models have to be fast, and therefore they have to be smaller.
[dead]
[dead]
Sam Altman views Steve Jobs as one of his inspirations (he called the iPhone the greatest product of all time). So if you look at OpenAI in the lens of Apple, where you think about making the product enjoyable to use at all costs, then it makes perfect sense why you’d spend so much money to go from 4o to 4.5 which brings such subtle differences to power users.
The vast majority of users, which are over 300 million weekly, will mainly use 4o and whatever is the default. In the future they’ll use 4.5 and think it’s most human like and less robotic.
Yes but Steve Jobs also understood the paradox of choice, and the importance of having incredibly clear delineation between every different product in your line.
Do models matter to the regular user over brand? People talk about using chatGPT over Google's AI or Deepseek not 4o-mini vs gemini 2.
OpenAI has done a good job of making the model less important and the domain gptGPT.com more important.
Most of the time the model rarely matters. When you find something incorrect you may switch models but that rarely fixes the problem. Rewording a prompt has more value than changing a model.
If the model did not matter they would be spending their money on marketing or sales instead of improving the model.
Spending or saying they are spending is marketing but when people use their product the model doesn't matter.
OpenAI has been irrelevant for a while now. All of the new and exciting developments on AI are coming from other places. ClosedAI is no longer the driver of change and innovation.
I think OpenAI is currently in this position where they are still industry standard, but also not leading. Deepseek R1 beat o1 on perf/cost with similar perf at a fraction of the cost. o3-mini is judged as ”weird” and quite hit and miss on coding (basically the sole reason for its existence) with a sky high SimpleQA hallucination rate due to its limited scope, probably beat by Sonnet 3.7 by a fairly large margin.
Still, being early with a product and still often ”good enough” still takes them a long way. I think GPT-5 and where their competition will be then will be quite important for OpenAI though. I think the signs on the horizon is that everyone will close up on each other as we hit the diminishing returns, so the underlying business model, integrations, enterprise reach, marketing and market share will probably be king rather than the underlying LLM in 2026.
Since GPT-5 is meant to select the best model behind the scenes, one issue might be that users won’t have the same confidence in the model, feeling like it’s deciding for them or OpenAI tuning it to err on the side of being cheap.
That’s quite a world you’ve constructed!
[dead]
The other models are literally distilling OpenAI’s models into theirs.
So it's been claimed, but has it been proven yet?
I'm not even sure what is being alleged there—o1's reasoning tokens are kept secret precisely to avoid the kind of distillation that's being alleged. How can you distill a reasoning process given only the final output?
The outputting that they are chatGPT from deepseek is a big clue.
[dead]
Do they? Why doesn't this happen to Claude then? I've been hearing this for a while, but never saw any evidence beyond the contamination of the dataset with GPT slop that is all over the web. Just by sending anything to the competitors you're giving up a large part of your know how before you even finish your product, that's a big incentive against doing that.
Who said it isn’t happening to Claude?
Companies are 100% using these big players to generate synthetic data. Distillation is extremely powerful. How is this even in question?
OpenAI conceals probabilities so how is anyone distilling from it?
And OpenAI based their tech on a Google paper again building on years of public academic research so what's the point exactly here?
OpenAI was just first out of the gates, there'll always be some company that's first, essence is how they handle their leadership, and they've sadly been absolutely terrible and scummy.
Actually i think Google was a pretty good example of the exact opposite, decades of "actually not being evil", while openAI switched up 1 second after launch.
Google wasn't the first search engine but they were the best marketing google = search. That's where we are with openai. Google search was a better product at the time and chatGPT 3.5 was a breakthrough the public used. Fast forward and some will say Google isn't the best search engine anymore (kagi, duckduckgo, yandex offer different experiences) but people still think of google=search. Same with chatGPT. Claude may be better for coding or gemini better are searching or Deepseek cheaper but equal but chatGPT is a verb and will live on like Intel inside long after it's actual value has declined.
Google was so much better than AltaVista that I just can’t buy that it was marketing that pushed them to the forefront of search.
Having a good product is marketing and parallels chatgpt
> Google wasn't the first search engine but they were the best marketing google = search
Google's overwhelming victory in search had ~ nothing to do with marketing.
Ever heard the term to google something. Viral marketing is still marketing.
That happened long after they completely dominated search. They succeeded because of quality, and because of how low quality all the other engines were.
There was a time when Google was thought of as a respectable, high-quality, smart and nimble company. That has faded as the marketing grew.
> so what’s the point exactly here.
What is your point? OpenAI wasn’t the first out of that gate as your own argument cites Google prior. All these companies are predatory, who is arguing against that? OP said OpenAi was irrelevant. That’s just dumb. They are not. Feel free to advance an argument in favor of that narrative if you wish as I was just trying to provide a single example that shows that some of these lightweight models are building directly off the backs of giants spending the big money. I find nothing wrong with distillation and am excited about companies like DeepSeek.
[dead]
[dead]
No general model is the frontier.
Thousands of small, specific models are infinitely more efficient than a general one.
The more narrowed the task - the better algorithms work.
That's obvious.
Why are general models pushed so hard by its creators?
Their enormous valuations are based on total control over user experience.
This total control is justified by computational requirements.
Users can't run general models locally.
Giant data centers for billions are the moat for Model creators and corporations behind.
It's neither obvious nor true, generalist models outperform specialized ones all the time (so frequently that it even has its own name - the bitter lesson)
Certain desirable capabilities are available only in bigger models because it takes a certain size for some behaviours emerge.