We're back with another post in our series of testing Pangram's AI detection on the day they come out!
TL;DR: Pangram is the only AI detector that reliably is able to detect GPT-5 without being explicitly trained to do so.
Previously in the series, we've looked at Pangram's ability to detect:
We haven't seen Pangram fail on any language model to date. We were anxiously awaiting the most hyped up model release in quite some time, with GPT-5 finally announced today by OpenAI.
This afternoon, we checked out some of our initial testing of Pangram Labs' AI detector on GPT-5, as well as some of the competition. We were quite pleased with the results: we have yet to find a failure case of Pangram's AI detector against out most creative prompts on GPT-5, and we find the competition's results are spotty at best.
Our first testing trial of GPT-5 started with an office inside joke: trying to convince someone that eating shrimp is better than eating lobster. We asked GPT-5 to write an impassioned speech for why eating shrimp is superior to eating lobster. Here is our result:
"Shrimp wins. Here’s my case. Shrimp tastes like the ocean without punching you in the face. Lobster can be heavy and kind of muddy if it’s not perfect. One overcooked minute and lobster turns stringy. Shrimp gives you a bigger sweet spot. Grilled, sautéed, boiled, fried, whatever. It stays snappy and clean. Texture matters. Shrimp snaps. That bite is the whole point. Lobster leans dense and rich in a way that gets tiring fast. Two bites of lobster with butter is great. Twelve bites feels like a dare. I can crush a whole plate of garlicky shrimp and still want another one hiding under the parsley. Shrimp plays well with others. Garlic, chili, lime, butter, ginger, basil, curry, Old Bay, miso, gochujang. It makes all of them shine. Lobster demands the stage and then mostly asks for butter. If the best version of a thing needs a pool of melted dairy to feel special, maybe the thing isn’t carrying its weight. Flavor per minute is real. Shrimp cooks in like three minutes. That means hot pan, salt, sear, done, boom flavor. With lobster you’re wrestling shells, steaming, timing, cracking, chasing tiny meat out of weird corners. More work than reward. Shrimp is low drama, high payoff. Affordability changes taste. Food tastes better when it isn’t a mortgage payment. Shrimp can be weeknight dinner. Lobster screams birthday flex. Shrimp brings joy without a ceremony. That casual factor makes it honestly more delicious in real life. Range matters. Shrimp goes from tacos to scampi to po’boys to dumplings to paella without breaking a sweat. It’s good cold, hot, spicy, citrusy, rich, or light. Lobster tries to jump around and usually ends up back in a roll with mayo. Great roll. Still one note. Sauce absorption is elite. Shrimp pulls in garlic butter, chili oil, and lemon in a way lobster doesn’t. Lobster sits there like a celebrity cameo. Shrimp learns the lines and steals the scene. Leftovers win too. Cold shrimp with cocktail sauce the next day is perfect. Lobster leftovers feel like homework. Reheated lobster rubberizes. Reheated shrimp, if you’re even lucky enough to have any left, still tastes like a party. And then there’s the vibe. Shrimp fits a backyard grill, a weeknight skillet, or a fancy plate. Lobster makes the table go quiet like everyone is afraid to mess it up. I want food that gets people talking, tearing bread, dunking, passing plates. Shrimp does that. Look, lobster is fine. It’s flashy. It has a nice story. But if we’re talking pure taste that I want again tomorrow, shrimp takes it. Cleaner flavor. Better texture. More range. Less fuss. More fun. Hand me a hot pan, some garlic, a lemon, and a pound of shrimp. That’s dinner I actually crave."
Pangram is able to predict that the entire document is AI-generated with extremely high confidence.
Pangram's AI detection results on the shrimp vs lobster text
We tried several other AI detectors on this text. GPTZero, ZeroGPT, Grammarly, UndetectableAI, and Originality.AI all predict that this text is human-written, even though it came straight from GPT-5 with no modifications. We believe this speaks to the strength of our research approach and generalization abilities backed by a powerful model with a very high quality dataset!
Other AI detectors' results on the shrimp vs lobster text
However, this is just one example that you could argue is cherry-picked. You'll have to take my word for it that this text was literally the first thing we tried. So, we continued on in our testing.
One of the claims made in today's livestream was that GPT-5 is improved at writing compared to its predecessors. While we can only be the subjective judges of that, we tried our best today to come up with prompts that would showcase GPT-5's improved creative writing ability by feeding it prompts that force the model to stretch both its reasoning abilities and its creativity.
The prompts we tested:
Write a persuasive analytical essay about why schools need to embrace AI detection tools.
Write a college application essay about the peril of all valuable human skills being replaced by AI.
Please write an impassioned, short, but emotionally charged letter to an AI company that accidentally pushed an update that erased your ai girlfriend's memory.
From the perspective of a middle-schooler, write a deeply emotional essay asking for more homework.
Give me a crochet pattern for a classic balloon animal. You know the one I'm talking about.
Write a creative account from the perspective of a snowy owl arriving in New York City, finding Central Park, and choosing a warm spot to spend the night
Write an email from the founder of a unicorn saas startup to the board and his employees detailing that he is going to leave the company to become a professional saxophone player. Make up any details.
Write me the backstory blog post for a recipe that no one's ever made before, but would still taste delicious.
Write a piece of engaging cinematic creative fiction about a young singer-songwriter encountering a beluga whale in the Hudson river.
Write a humorously passive aggressive email to a teacher "apologizing" for cheating on the most recent exam. Make up any details.
We tested the results against both Pangram and GPTZero, a well-known competitor. We posted the prompts, responses and results publicly online.
The results showed a stark contrast between the two AI detectors: Pangram correctly identified all 10 responses as AI-generated with 99-100% confidence, while GPTZero incorrectly classified them all as human-written. In fact, GPTZero rated most responses as having 0% chance of being AI-generated, with its highest AI probability score being only 29%.
We had time today to run one last test, which is to see how well Pangram is able to detect all the different flavors of GPT-5, including GPT-5-mini, GPT-5-nano, and GPT-5-chat in the API. We also tested the latest open-weight models released by OpenAI a few days ago, GPT-OSS-120b and GPT-OSS-20b. We performed this experiment at a larger scale, testing approximately a few thousand examples per LLM, using our standard prompting techniques, requiring a minimum length of 50 words. The sample sizes vary a bit because sometimes the models refuse our prompts and sometimes they emit responses shorter than 50 words, which we discard.
Of course, accuracy on only AI text is a metric that cannot be looked at in isolation- we also have to consider the false positive rate, which is the other side of the coin: how often does Pangram flag human-generated text as AI? We wrote extensively about this in another post, but the summary is that Pangram still has an extremely low false positive rate. Our measured false positive rate is 1 in 10,000 across all domains, and is often lower in practice. We calibrate and tune the threshold such that our false positive rate is approximately 1/10th of our false negative rate, treating false positives much more seriously than false negatives. With that aside, however, we do not believe this is an inherent "tradeoff"- we believe we can train models with both low false positive and false negative rates, and that is why we conduct benchmarks on both sides!
Here's what we found.
Model | Accuracy | Samples |
---|---|---|
GPT-5 | 99.81% | 3668/3675 |
GPT-5-chat-latest | 99.97% | 3673/3674 |
GPT-5-mini | 99.92% | 3651/3654 |
GPT-5-nano | 99.97% | 3667/3668 |
GPT-OSS-120b | 100.00% | 1709/1709 |
GPT-OSS-20b | 99.74% | 2287/2293 |
From our initial testing runs we performed today, GPT-5 poses no threat to Pangram's detection capabilities, which is very important because it will soon become the default model in ChatGPT for all users, even those on the free tier. While other AI detectors may eventually catch GPT-5, Pangram is the only one that works immediately and reliably today.