My GPT-5.5 Review: A Big Upgrade That Doesn't Always Feel Like One

GPT-5.5 is one of the strangest model upgrades I have tested, because it is clearly better, and yet for a lot of the work I do, I do not always feel the difference in the way I expected to.

That sounds more negative than I mean it. My experience with the model has been really good. I have enjoyed using it a lot. It is smarter, more reliable, more rounded, and better across a bunch of categories that used to be weaker spots for OpenAI's models.

But the honest reaction is a little weird: this is the first time where the upgrade feels relatively large, but most of the time it does not matter that much.

Not because the model is disappointing. Because the last set of models was already so good.

The Frontier Is Getting Weird

For engineering, we are now in a strange place. The best models are already exceptional engineers. They can read a codebase, understand the architecture, make good decisions, write the code, test it, notice what went wrong, fix it, and keep going until the thing works.

That used to be science fiction. Now it is just normal.

So when GPT-5.5 is better, the improvement is real, but it does not always translate into a dramatic change in my daily workflow. If I ask it to build something normal, it crushes it. But GPT-5.3-Codex already crushed it. GPT-5.4 already crushed it. Opus often crushed it. The ceiling is getting so high that a lot of normal work does not stress the models anymore.

This is the best way I can describe it: you do not need Hemingway to write a product description. At some point, the model is already good enough for the task, and additional intelligence is only visible when you push into harder territory.

GPT-5.5 is clearly more capable. I just notice it most when I give it tasks that are annoying, ambiguous, security-sensitive, design-constrained, or likely to break in subtle ways.

The Harnesses Are Starting to Matter More

One thing I keep coming back to: now that the model is this reliable, I think we are increasingly limited by the harnesses around it.

A single model run is already extremely powerful. But the real power of GPT-5.5 probably will not come from one chat window, or one coding agent sitting in one repo, doing one task at a time. I think it will show up as people build better systems around the model: multi-agent setups where several instances can split work, critique each other, specialize, test each other's output, and coordinate toward larger goals, with a higher-level agent orchestrating the whole thing almost like a CEO managing a team.

This is one of the areas where I am most excited. I have seen enough from using GPT-5.5 to manage sub-agents that I think orchestration may end up being one of the biggest unlocks. A strong worker model is useful. A strong model that can break work apart, delegate it, monitor progress, compare outputs, and keep a larger goal moving is a different thing.

That is where things get really interesting. The model is strong enough now that the limiting factor is often not "can it do the work?" It is "have we built the right environment for it to keep doing more of the work, across more moving pieces, without losing the thread?"

I expect a lot of the surprising GPT-5.5 use cases to come from better harnesses, not just better prompts.

Engineering: It Was Already Great, Now It Is More Stable

On pure software engineering, GPT-5.5 is fantastic. It is thoughtful. It makes good decisions. A lot of the time, it makes the same decisions I would make. If you have read my other Codex reviews, this is the same basic direction, just pushed forward again.

The thing I care about most is not whether the model can write a clever function. Every frontier model can do that now. I care about whether it can keep the whole job in its head.

Can it understand what I actually want? Can it avoid breaking unrelated things? Can it make reasonable judgment calls when I did not spell out every detail? Can it test its work? Can it notice when the first attempt is wrong and keep going?

GPT-5.5 is very, very good at this.

The code is more readable, too. This is a small thing that matters a lot over time, as the cleaner the code is, the easier it'll be for models to update it as the codebase grows larger.

It can work until it is done. That is still the core unlock. You give it a clear goal, enough context, and real validation, and it will iterate against the target until the work is actually finished. It does not just produce a plausible patch and call it a day. It keeps pushing.

But again, this is where the review gets hard to write. Is it a big leap over the previous best coding models for the things they were already amazing at? Yes... it is better, definitely. But the difference is often hard to feel because the previous models were already operating at a level that would have sounded absurd a year ago.

For most things most people care about, GPT-5.5 is more than good enough. Honestly, it is far past that.

The Real Story Is Rounding Out

The bigger deal with GPT-5.5 is not that it suddenly became good at engineering. It already was. The bigger deal is that it rounds out the capabilities around engineering.

Design is the clearest example.

OpenAI models have historically been weaker at design than I wanted. They could build functional UIs, but the taste was often not there. You could get something usable, but not something that felt like it had been designed by a good designer.

GPT-5.5 is a lot better here, with an important caveat: it is much better when the design direction already exists.

If you give it a Figma design, it is pretty great. If you ask it to extend an existing site where the visual decisions are already in place, it is really solid. It can pick up the style, continue the patterns, and avoid the usual "AI frontend" smell much better than previous GPT models.

I would say it is relatively close to Opus in that kind of work, though not on par. From a blank canvas, I now usually try the GPT-Image-2 into GPT-5.5 workflow first. If that does not go well, Gemini is still a strong backup for initial designs from scratch.

One thing has changed recently, though: GPT-Image-2 is so damn good at generating screenshots of UI mockups that pairing it with GPT-5.5 shores up a lot of this model's design weaknesses.

My new workflow for generating new UI is basically: make the image first, then have GPT-5.5 implement it. Generate a screenshot or mockup with GPT-Image-2, give it to the model, and let it build from there. It is really good at getting close to the design.

This also explains the remaining split for me. Once I already have a UI in place, I still tend to go with Opus for a lot of frontend work because Opus can build new UI without image references really well. GPT-5.5 is only okay at that. If I just want to add a button, a small panel, or a new state, generating another image reference can be a lot of work. For that kind of fast incremental UI work, Opus still feels easier.

But for new UI where I can start from a strong mockup, GPT-Image-2 plus GPT-5.5 is a very strong combination.

For most frontend work, I still default to Opus. But when the task is more engineering-heavy or logic-heavy than pure design, GPT-5.5 becomes much more appealing. If the UI needs to work correctly, respect a lot of constraints, handle state cleanly, and fit into an existing system, GPT-5.5 is excellent.

Mobile and Native Apps Feel Better

I also found GPT-5.5 better and more well-rounded for mobile app development and native Mac app development.

Swift work in particular feels stronger. This matters a lot because mobile and native app work can punish models in ways web apps do not. There are lots of platform-specific details, little issues and annoying build problems, and more places where a model can produce broken code.

GPT-5.5 feels generally more comfortable there. More rounded. Less brittle.

I do not have one dramatic "it built the impossible app overnight" story here. It is more that, across repeated use, I found myself trusting it more with the whole task instead of carving off tiny pieces.

Security Is the Biggest New Reason to Care

The area where GPT-5.5 surprised me most was security.

In the wake of Anthropic's Project Glasswing / Claude Mythos Preview announcement, and OpenAI's GPT-5.4-Cyber and Trusted Access for Cyber release, I spent time testing GPT-5.5 on security use cases.

I am going to keep examples general for obvious reasons, but the short version is: it is really quite good.

It found vulnerabilities in codebases that previous GPT models and Opus did not find in my testing. Not theoretical nonsense. Real issues. The kind of thing where you look at the result and go, okay, that actually matters.

I have not personally tested every class of security problem here, but another early reviewer I trust found GPT-5.5 Pro shockingly strong on crypto and security problems. That lines up with what I have seen in codebase security review: when the task is crisp and the model has room to think, there is something very real happening here.

It was not as strong as GPT-5.4-Cyber when that model was properly harnessed for security work. That model is tuned and deployed for this kind of thing, and that matters. But GPT-5.5 is still a huge step up from existing general-purpose models in my experience.

Everyone should be running GPT-5.5 against their codebases to look for vulnerabilities. Give it a clear task, tell it exactly what kind of issues you care about, and let it go deep.

The clarity of the task matters a lot. Do not just say "is this secure?" That is too vague. Ask it to look for specific classes of problems, trace sensitive flows, inspect auth boundaries, check permission assumptions, look for injection risks, review dependency usage, or audit a subsystem with a clear threat model.

When you give it a crisp target, it goes against that target extremely well.

This is one of those capabilities that I think people will underuse at first because they are still thinking of security review as a specialized thing you do occasionally. That is going to change. If a model can inspect code cheaply, repeatedly, and with increasing depth, security review becomes something you do all the time.

GPT-5.5 makes that feel much more real.

It Feels More Opusified

After using it more, this is the thing I keep coming back to: GPT-5.5 feels more Opusified.

What I mean by that is: Claude Opus and Claude Code have always felt faster in the hand. Not necessarily better overall, but more fun. You ask for something, it gets moving, it gives you a strong answer quickly, and you stay in flow. Codex has often been better for serious engineering, but it has not always been as fun to use.

GPT-5.5 starts to close that gap.

I do not know if this is because the model is bigger, because the harness is better, or because it is just more token-efficient. But it often feels like it can use fewer tokens to get to the same result. Smarter per token, basically. That changes the day-to-day experience a lot.

You can seemingly get more out of less intense thinking modes than before. I still use extra-high reasoning when I care most about depth. If I am running a lot of tasks at once, or if something really matters, I am not racing. I am happy to let it think.

But when I just want to get things done quickly, the lower modes feel really good now. High and even medium can produce excellent results, and I have heard from early testers getting strong results from low or minimal thinking too. I personally have not built my workflow around low or minimal, but the fact that people are getting useful work out of those modes is exciting.

This makes GPT-5.5 feel more fun. Less waiting, and much more momentum.

Pro Is Fast, But I Have Mixed Feelings

The same thing shows up with Pro in a stranger way. GPT-5.5 Pro is very fast. Sometimes it is amazing. But I have not had the greatest experience with it overall.

My issue is not exactly that it returns answers in a few minutes. Speed is obviously good. The issue is that it sometimes does not feel as thorough as previous Pro models. It feels like it is more token-efficient, which is usually a good thing, but for Pro I often want the opposite.

I also really dislike the writing from Pro right now. This has been one of the biggest regressions for me. It writes these weird one-liner pieces: one sentence, new line, another sentence, new line, another sentence. It makes the final piece feel choppy and unnatural, and I have not been able to reliably prompt around it. I reported this to the team because it is bad enough that I just do not want to use Pro for writing in its current form.

I have also seen some strange judgment failures. One example: I was drafting a document for someone else, and Pro included the behind-the-scenes reasoning for why I was saying certain things directly in the document itself. That is obviously not what you want. If you are sending someone a carefully framed note, you do not want the model including the strategy behind the framing.

When I use Pro, I want it to burn the tokens. I want it to turn the problem over a few extra times. I want the answer that feels overthought in the useful way. GPT-5.5 Pro can be great, but there are times where it comes back quickly and I am left thinking: wait, that was it?

So the lower modes getting faster and stronger is a clear win. Pro being faster is more complicated. For that tier, I care less about speed and more about whether it used enough compute to justify being Pro in the first place.

When to Use What

This is how I am thinking about it right now.

GPT-5.5 is what I reach for when the task is engineering-heavy, logic-heavy, security-sensitive, or needs strong execution inside an existing system. Backend work, architecture, mobile apps, native apps, tricky integrations, code review, vulnerability hunting, and anything where I want the model to keep iterating until the result is solid. I am also using lower reasoning modes more than I expected, because they now feel good enough for a lot of real work. And if I have a GPT-Image-2 mockup, I am very comfortable using GPT-5.5 to turn it into the actual UI.

Opus is still my default for a lot of frontend work, especially when I need the model to invent or extend UI without an image reference. If I need a small UI change quickly, and I do not want to go generate a new mockup first, I still often start there.

Gemini is more of a backup for me now. If the GPT-Image-2 into GPT-5.5 workflow does not go well, or if I just cannot get the mockup I want, I will still try Gemini for initial design from scratch.

That split is not permanent. These models are moving too fast for any workflow to stay stable for long. But for now, this is the stack.

The Oddest Part: Better, But Less Dramatic

There is something psychologically strange about this release.

GPT-5 felt like a phase change because it moved the ceiling of what I thought was possible. GPT-5.3-Codex felt like a phase change because it made long-running autonomous engineering feel operationally real.

GPT-5.5 does not feel like that in the same way. It feels more like the frontier getting filled in. The rough edges getting smoother. The weak categories becoming less weak. The model becoming more useful across more of the messy work that real people actually do.

That is not as flashy. It is also extremely valuable.

If you mostly care about engineering, GPT-5.5 is already an exceptional engineer. It is thoughtful, persistent, and makes good decisions. For many tasks, it is hard to tell the difference between this and the previous best models because the previous best models were already amazing.

But when you push it into the areas around engineering -- design-constrained UI work, native apps, mobile apps, security review, lower-thinking-mode usage, and faster Opus-like workflows -- the improvement becomes much more visible.

Final Thoughts

GPT-5.5 is a real upgrade. It is better on the things GPT was already good at, and it is much better at a bunch of things where previous versions felt weaker.

It is not perfect. It is not suddenly the best design model. It is not magic. You still need to give it clear goals, real context, and validation if you want serious work done well.

But it is very, very good.

The best summary I can give is this: for most people, on most tasks, GPT-5.5 is less about a new kind of capability and more about making the existing capability feel broader, safer, and more dependable. It rounds out the model.

But when you need to push harder, and the current models just are not cracking it, GPT-5.5 is significantly better. You will not need that 99% of the time. But when you do, it matters a lot.

And for engineering, that may be exactly what matters now. The raw capability is already here. The question is how often you can trust it, how many domains it covers, and how reliably it can keep working until the job is actually done.

On that front, GPT-5.5 is a big step forward.

Follow me on X for updates on new models, workflows, and products worth using.
Follow @mattshumer_

My GPT-5.5 Review

A Big Upgrade That Doesn't Always Feel Like One

TL;DR

The Good

The Not-So-Good

The Frontier Is Getting Weird

The Harnesses Are Starting to Matter More

Engineering: It Was Already Great, Now It Is More Stable

The Real Story Is Rounding Out

Mobile and Native Apps Feel Better

Security Is the Biggest New Reason to Care

It Feels More Opusified

Pro Is Fast, But I Have Mixed Feelings

When to Use What

The Oddest Part: Better, But Less Dramatic

Final Thoughts