Pangram detects GPT-5 with 99.8%+ accuracy! Learn more
AI detection is often described as an "arms race" between large language models, detectors, and "humanizers," which are a class of tools online that are meant to obfuscate AI-generated text and introduce intentional errors in order for the resulting text to sound human.
At Pangram, we are always trying to stay ahead of the curve, and reacting to the latest technology advancements in both new models and humanizers.
In January 2025, we published an update to our technical report where we audited 19 different humanizers and paraphraser tools. The core findings were:
However, the humanizer landscape is evolving rapidly, and so we wanted to publish updated numbers on our latest humanizer benchmark.
Humanizer | Accuracy |
---|---|
Ahrefs | 100.0% |
aihumanizer.com | 100.0% |
Bypass GPT | 99.7% |
DIPPER | 97.6% |
Ghost AI | 100.0% |
GPTinf | 99.2% |
Grammarly | 100.0% |
humanizeai.io | 93.8% |
humanizeai.pro | 100.0% |
Just Done | 93.5% |
Quillbot | 100.0% |
Scribbr | 99.0% |
Semihuman AI | 100.0% |
Smodin | 100.0% |
StealthGPT | 95.6% |
Surfer SEO | 100.0% |
surgegraph.io | 100.0% |
TwainGPT | 92.7% |
Undetectable AI | 90.3% |
Writesonic AI | 98.1% |
Pangram performs above 90% on all the notable humanizers that we tested.
In Russell et. al., Pangram is benchmarked against GPTZero and several open source methods on humanized text. Pangram's best model is 97% accurate on humanized text, compared to GPTZero at 46%, FastDetectGPT at 23%, and Binoculars at 7%.
Pangram's performance on humanized text compared to other detectors
A very recent study by Jabarian and Imas found that Pangram is the only detector among 4 commercial detectors whose performance is robust to humanizers:
For longer passages, Pangram detects nearly 100% of AI-generated text. The FNR increases a bit as the passages get shorter, but still remains low. The other detectors are less robust to humanizers. The FNR for Originality.AI increases to around 0.05 for longer text, but can reach up to 0.21 for shorter text, depending on the genre and LLM model. GPTZero largely loses its capacity to detect AI-generated text, with FNR scores around 0.50 and above across most genres and LLM models. RoBERTa does similarly poorly with high FNR scores throughout.
There are several ways that you can tell by eye that a text has been fed through a humanizer.
One of the easiest ways that you can spot a humanizer is by looking for "tortured phrases", which are out of place synonym replacements meant to disguise plagiarism. Word spinner tools, such as Grammarly and Quillbot, have been using these synonym replacement algorithms even before AI to disguise plagiarism.
Examples of tortured phrases would be "counterfeit consciousness" instead of "artificial intelligence", or "bosom peril" instead of "breast cancer." We heard a funny case last year of "Martin Luther Ruler, Jr." showing up in a student essay in place of "Martin Luther King, Jr."
It is important to be careful of using tortured phrases as the only way to spot humanized AI text, because tortured phrases also commonly show up in nonnative English writing when nonnative speakers misuse or misinterpret the direct meaning or typical way that certain words are used.
Humanizers often try to fool the tokenizer of the AI detectors by adding or removing spaces. Especially common are space removals between sentences.
Humanized AI text still exhibits the same repetitive phrases as non-humanized AI text. It is especially telling that text came from a humanizer if the same tortured phrase appears twice in the same document, as it is evidence that the humanizer is systematically applying the same synonym replacements.
Humanizers also typically use non-standard Unicode characters in order to fool the tokenizers of AI detectors as well. An example of this is a popular humanizer that uses "U+2009", which is the unicode character for "thin space" instead of a normal space. We recommend this website https://www.soscisurvey.de/tools/view-chars.php which allows you to see all non-printable characters that may be hidden in copy and pasted strings.
Example of non-printable characters in humanized text
Using Pangram's new Writing Playback feature in Google Docs, you can also check to see if a significant portion of the text in a Google doc was copied and pasted rather than manually typed in.
Example of writing playback showing copy and paste
There are several reasons why Pangram is not a perfect detector on humanized AI text.
Pangram is not willing to compromise on its False Positive Rate. Several of our internal models are able to detect humanizers with near-perfect accuracy, but exhibit higher false positive rates. We do not ship these models because it is more important to us that genuine human writing never gets flagged as AI than catching all humanizer outputs.
Extremely low-quality "junk" text is easily detectable by eye. Most of the cases in which Pangram does not catch humanized output, the text is so badly garbled and obfuscated that it barely resembles English. These cases are easy to spot by eye, but are hard to catch algorithmically because there are infinitely many ways to produce gibberish. We would rather descope gibberish than try to detect it, as it is not even well-posed to try to distinguish human gibberish from humanizer gibberish.
Yes, humanizer detection is an active area of research for Pangram and we hope to continue to characterize the properties of these humanizers and publicize our research into detecting humanizer outputs. If Pangram is to be seen as a reliable tool in academic integrity, we must be able to detect text produced by these cheating tools as well as text directly copied and pasted from large language models.