My GPT-5.3-Codex Review

Every major model leap changes how I use these AI systems. While Opus 4.5 got us very close to full autonomy, GPT-5.3-Codex is the first time a model feels like: I specify the outcome, set up validation (clear pass/fail tests), and press go... and feel pretty confident that when I come back an hour or two (or sometimes, even more) later, my task (no matter how complex) will be done near perfectly.

Every Leap Changes My Workflow, and This Is a Big Leap

It is kind of insane that this was only about a year and a half ago, but Sonnet 3.5 was, more or less, an English-to-code translator. It was great, but you still had to drive every step. It would do (almost) exactly what you said, which was useful, but it did not really carry the work forward on its own. You still needed to know how to build (to some extent) to wield it effectively.

The next few model releases got stronger. They started to feel more like junior engineers. You could hand them somewhat bigger tasks. They would run for longer. But you still had to hold their hands through pretty much every step, and iteration was a constant battle. To get a feature right, it might take 10 to 20 prompts, or more if it was super complex.

GPT‑5 was the next big phase change. I stopped spoon-feeding it steps and started giving larger outcomes. It could do a lot, somewhat autonomously, but large repos still tripped it up, and it still made mistakes relatively often, especially as time went on and I really started pushing the limits of what it could do. Most importantly, I still had to drive to some extent, giving it extremely detailed prompts describing how I wanted things to be done.

As we all know, Opus 4.5 was a big leap from GPT‑5 (and other models released thereafter). Opus 4.5 is insanely fast and usually nails most things I throw at it, but it still needs very tight guardrails. If I am not extremely explicit about constraints, non-goals, and how we will validate, it will often choose the fastest plausible path to success. It might patch around the root cause, stub something it should not, or optimize for "looks done" over "is solid, and done in the way the user wants."

Even when I am explicit, it will occasionally solve the prompt in a way I would not ship. That last 5 to 10 percent of judgment still leaks on long, messy, high-stakes tasks, and that is the part that costs you hours later.

But now we have hit the next phase change, and I am calling it now: this is full autonomy.

We have arrived.

The Big Difference: It Makes Calls I Would Make Under Ambiguity

The most important upgrade is not speed. It is not even raw intelligence. It is judgment.

"But Matt!", you say. "Judgment is uniquely human!" I am sorry, but no.

It has become increasingly clear that as long as data for a given thing exists, a model trained on that data can do that thing. Human judgment is available in vast amounts of data on the internet. The model companies are paying tons of money for data that will help the model with judgment and taste as well. This is the first model that feels like it has internalized that at a deep level for a specific domain.

When a prompt leaves room for interpretation, GPT-5.3-Codex tends to choose what I would have chosen. It fills in missing context in a way that feels aligned with how I actually think about the problem.

Assumption quality under ambiguity matters more than people realize, and GPT-5.3-Codex is much better at it than previous models I have used.

Multi-Agent Collaboration Is Finally Real

I also tested GPT-5.3-Codex in a multi-agent harness I built with AgentRelay (disclosure: I recently invested). I had multiple GPT-5.3-Codex instances chatting together to solve problems, and the results were absolutely incredible. I will be sharing more on this soon.

This is one of the first models I have seen that can really collaborate with other models, and not in a superficial way. With Opus in the same harness, it often felt like "talking to talk," and it was not obvious that multiple models were actually better. With GPT-5.3-Codex, communication was efficient, agents split off into focused workstreams on their own, and the collaboration actually produced better work. Things happened much faster, and each agent stayed more specialized. It was fucking incredible. I think this will be very, very commonplace soon.

The Unlock: Validation Turns This Into a Real Agent

If you want full autonomy, there is one approach that dominates everything else: give the model strong validation and tests up front.

With clear validation targets, GPT-5.3-Codex will iterate for hours without losing the thread. It does not drift. It does not get confused halfway through. It keeps pushing until the constraints are satisfied and the tests are green.

Without tests, it is excellent. With tests, it becomes a different class of tool.

Note: this is true for any modern coding agent. GPT-5.3-Codex is just in a different class when it comes to using validation and tests effectively to iterate toward a goal.

It Even Uses Skills Without Being Told

This is a small detail that ends up mattering a lot: it is willing to use local skills and tooling at the right time without me explicitly instructing it to do so.

Even Opus 4.5 often needs a nudge like "check if there is a skill for this." It does not naturally scan what is available. GPT-5.3-Codex does, and it tends to use skills when they are actually useful, not just because they exist.

The First Model I Confidently Walk Away From

For long-horizon, tricky engineering work, this is the first model where I can start a run and go do something else without feeling a need to constantly check and make sure it is on track. It just keeps going. It does not slowly degrade. It does not give up early. It tends to work.

Yes, it is slower than Opus 4.5. Runs often take multiple hours (I had a couple go for over 8 hours). That tradeoff is very, very real. But the stability is so much higher that I end up trusting it more on anything I really do not want to get wrong.

Code Quality Is Better, Too

This part is easy to miss because you feel it weeks later: the code quality and architecture are usually significantly better than what I get from Opus 4.5.

I see fewer hacky patches, less dead code left behind, and fewer subtle bugs accumulating as the repo evolves. It is not just that it finishes the task. It usually leaves the codebase in much better shape, which is extremely impressive because it is often working for much longer and making bigger changes.

It Uses Time Like a Good Engineer

Another underrated behavior: it uses dead time well. If something is running and there is nothing useful it can do in that exact moment, it often goes and gathers context, improves documentation, or fixes issues on its own.

Other models will sit there unless I explicitly tell them what to do next. GPT-5.3-Codex tends to do the obvious useful thing, without overreaching into changes I did not ask for.

Cross-Repo Work Is Wild (In a Good Way)

I am usually giving it access beyond the single repo it is scoped into. That has unlocked a totally different workflow.

I can say things like "find the repo on this machine that exposes the API for X," and it will go find it, learn the pattern, apply it correctly in the current repo, and keep moving. It can even make changes in the other repo, push there, and come back to the main thread without getting lost.

Watching it move across my machine like this is still kind of insane.

It Can Close the Loop on Deployment (Railway CLI)

I have been giving it Railway CLI access, and it has been able to close the loop on the full lifecycle of development for me. I will say something like: "get this on Railway when you are ready, and make sure it works perfectly," and it just does it.

It will make changes, push them, deploy, check the real production URL, tail the logs, and keep iterating until it is actually working. We have seen glimpses of this with other models. Opus can do a solid job using logs to self-correct, but it still makes mistakes. Gemini 3 Pro in Antigravity includes browser-driven iteration, and most coding tools now have plugins for parts of this loop. The difference here is that it finally feels like a true closed loop... it works almost every damn time.

It is absolutely crazy to watch this happen. I can start a fresh project, walk away from my computer, and come back an hour or two (or sometimes, even more) later to multiple fresh codebases on GitHub, new deployments on Railway, and the whole system interacting perfectly.

When to Use What (My Actual Decision Rule)

This is how things have settled for me right now:

Opus 4.5 is still my default for quick work and fast iteration loops, especially when I care more about speed than depth. But I am using it less and less each day. Lately, I find myself packing a bunch of little issues I'd normally use Opus for together into one big prompt for Codex and letting it run for an hour or so.

GPT-5.3-Codex is what I reach for when the task is long-horizon, tricky, full of constraints, or something I really do not want to get wrong. Anything I want to start and walk away from. But again, the more I use it, the more I want to use it for, so I expect this is going to change over the coming weeks, with Codex eating up much more of my work.

UI and styling are still not GPT-5.3-Codex's strengths. Opus is better here, and Gemini 3 Pro is still the best I have used for styling.

In my GPT-5.2 review, I said the model was amazing but too slow. GPT-5.3-Codex is not dramatically faster. But weirdly, that no longer matters as much for me. It is so reliable on long-horizon work that I can just let it run and come back later. The speed is still a tax, but it just stops being a dealbreaker when the model works as well as this one does.

On reasoning modes: OpenAI recommends Medium, which is strong, but when I plan to walk away, Extra High just makes sense. For me, Extra High is the right setting for "do it right, take your time."

It’s Better, But It’s Not As Fun

There are small paper cuts I deal with, and the strangest one is that it can run for hours, literally hours, and I do not always know what to do with myself while it is happening.

With Claude, I would be launching parallel runs for other small things because the main run would not quite be able to do everything in one shot. GPT-5.3-Codex is so capable that one run can often cover most of, if not all of, what I want. That is amazing, but it also leaves me sitting on my hands sometimes. It is a weird adjustment, and I am still getting used to it.

Side Notes

Prompt and Agent Design

I build a lot of agents. GPT-5.3-Codex is not my favorite model for prompt architecture itself. It sometimes makes poorly thought-out decisions about what should go into prompts and agent flow, and I have had it break agent flows I care about. I still reach for Opus to refine prompts and build agents.

One caveat to the caveat: if I give it very explicit validation for what the agent needs to do, concrete tests for outputs and behaviors, it can iterate toward something that works even when the first attempt misses. It will grind toward green.

That said, once the direction is clear, GPT-5.3-Codex is exceptional at building the systems around the agent and executing the work.

Status Narration Can Be Brittle

By status narration, I mean the model talking through what it is doing while it works: "I see this issue, I am going to check X, then run Y." It is usually pretty good at this.

Sometimes it just stops narrating for a while, which makes it harder than it should be to tell what is happening mid-run. The task checkboxes in the UI help a lot. It will list the tasks it plans to do and you can watch them get checked off. But I have noticed those checkboxes sometimes do not update until the end of a run. This is mostly a visibility issue. It has not meaningfully affected output quality for me.

Run Summaries Can Be Too Technical

Another paper cut: at the end of a run, the model will often give an update using very jargon-heavy language. If you are a more vibe-codery builder and do not have deep fundamentals, this is going to be rough. You will often have to ask it to re-explain in plain English.

Even if you are technical, it can still be annoyingly dense. Most of the time I just want a quick, clear read on what changed and whether it worked, not a wall of technical mud. The whole point of using these models is to avoid that mud in the first place.

Why I Did Not Review the Mac App Here

A few people asked me why I did not review the Codex Mac App, even though I had early access. The reason is, I have been so wowed by 5.3-Codex that it has not felt worth spending time on almost anything else.

That said, the app itself is strong: managing many runs in one place is genuinely useful, and support for local/cloud runs plus worktrees/branches is great. I have still seen a few UI bugs (especially around mid-run updates), and there is room to streamline the interface, but the model quality has been so far ahead that it has dominated my attention.

This Model Changed How I Work

My workflow now looks like this: I write extremely detailed prompts, define explicit validation and test cases up front, and then I let it run.

GPT-5.3-Codex is the first coding model I have used where full autonomy starts feeling operationally real.

It is not perfect. Speed is still a huge downside. But because its judgment under ambiguity is better, its long-horizon stability is better, and when you give it validation targets it becomes incredibly reliable, this is now my favorite model for most of my work.

Opus 4.5 is still my go-to for quick work. But for anything difficult, long, or something I really do not want to get wrong, this is the first model where I am comfortable pressing go, leaving my computer, and expecting it to actually... work.

Follow me on X for updates on new models, workflows, and products worth using.
Follow @mattshumer_

My GPT-5.3-Codex Review

Full Autonomy Has Arrived

TL;DR

The Good

The Not-So-Good

Every Leap Changes My Workflow, and This Is a Big Leap

The Big Difference: It Makes Calls I Would Make Under Ambiguity

Multi-Agent Collaboration Is Finally Real

The Unlock: Validation Turns This Into a Real Agent

It Even Uses Skills Without Being Told

The First Model I Confidently Walk Away From

Code Quality Is Better, Too

It Uses Time Like a Good Engineer

Cross-Repo Work Is Wild (In a Good Way)

It Can Close the Loop on Deployment (Railway CLI)

When to Use What (My Actual Decision Rule)

It’s Better, But It’s Not As Fun

Side Notes

Prompt and Agent Design

Status Narration Can Be Brittle

Run Summaries Can Be Too Technical

Why I Did Not Review the Mac App Here

This Model Changed How I Work