Are authors using LLMs to write AI research papers? Are peer reviewers outsourcing the writing of their reviews of these papers to generative AI tools? In order to find out, we analyzed all 19,000 papers and 70,000 reviews from the International Conference on Learning Representations, one of the most important and prestigious AI research publication venues. Thanks to OpenReview and ICLR's public review process, all of the papers and their reviews were made publicly available online, and this open review process enabled this analysis.
We made all the results publicly available on iclr.pangram.com.
Well, for one, we were offered a bounty!
Graham Neubig's tweet offering a bounty for analyzing ICLR submissions
In all seriousness, many ICLR authors and reviewers have been noticing some cases of blantant AI-related scientific misconduct, such as an LLM-generated paper with completely hallucinated references, and many authors claiming to receive completely AI-generated reviews.
One author even reported that a reviewer asked 40 AI-generated questions in their peer review!
We wanted to measure the scale of this problem at large: are these examples of bad behavior one-off incidents, or are they indicative of a larger pattern at work? That's why we took Graham up on his offer!
ICLR has a very clear and descriptive policy on what is allowed and disallowed in terms of LLM usage in both papers and reviews.
Policy 1. Any use of an LLM must be disclosed, following the Code of Ethics policies that “all contributions to the research must be acknowledged” and that contributors “should expect to … receive credit for their work”.
Policy 2. ICLR authors and reviewers are ultimately responsible for their contributions, following the Code of Ethics policy that “researchers must not deliberately make false or misleading claims, fabricate or falsify data, or misrepresent results.”
ICLR also has guidelines that authors should follow when using LLMs in their papers and reviews. To summarize:
So, we do not perform this study as a means of calling out individual offenders- as LLMs are actually allowed in both the paper submission and the peer review process. We instead wish to draw attention to the amount of AI usage in the papers and peer review, and highlight that fully AI-generated reviews (which indeed, are likely to be Code of Ethics violations) are a much more widespread problem than many realize.
We first downloaded all of the PDFs of the ICLR submissions using the OpenReview API. We also downloaded all of the notes, which allowed us to extract the review.
We found that using a regular PDF parser such as PyMuPDF was insufficient for the ICLR papers, as line numbers, images, and tables were often not handled correctly. Therefore, in order to extract the main text of the paper, we used Mistral OCR to parse the main text of the paper from the PDF as Markdown. Because AI tends to prefer markdown output as well, in order to mitigate false positives coming from the formatting alone, we then reformatted the Markdown as plain text.
We then ran Pangram's extended text classifier on the parsed plain text from these PDFs. The extended version of the classifier first splits the text into segments, and runs the AI detection model on each segment individually. The result is a percentage showing how many segments came back positive for AI-generated text, so the result can indicate that a paper is fully human-written, fully AI-generated, or mixed, with some segments coming back positive and some segments coming back negative.
We also checked the peer reviews for AI using our new EditLens model. EditLens is able to not only detect the presence of AI, but can also describe the degree to which AI was involved in the editing process. EditLens can predict that a text falls within one of five categories:
EditLens is currently only available to customers in our private beta, but will become publicly available in early December. We will have more to say on this model in the coming weeks, but in our research preprint, we describe its performance as state-of-the-art in co-authored text generation, and on internal benchmarks, it has a similar accuracy to our current model when evaluated as a binary classifier, and an exceptionally low false positive rate of 1 in 10,000 on fully human-written text.
In our previous analysis of AI conference papers, we found that Pangram has a 0% false positive rate on all available ICLR and NeurIPS papers published prior to 2022. While some of these papers are indeed in the training set, not all of them are; and so we believe the true test set performance of Pangram is actually very close to 0 percent.
What about peer reviews? We ran an additional negative control experiment, where we run the newer EditLens model on all 2022 peer reviews. We find about a 1 in 1,000 error rate on Lightly Edited vs. Fully Human, a 1 in 5,000 error rate on Medium Edited vs. Fully Human, and a 1 in 10,000 error rate on Heavily Edited vs. Fully Human. We find no confusions between Fully AI-generated and Fully Human.
Distribution of EditLens predictions on ICLR 2022 reviews (negative control)
For the experiment itself, we ran Pangram on all papers and peer reviews. Here are the main findings:
We found 21%, or 15,899 reviews, were fully AI-generated. We found over half of the reviews had some form of AI involvement, either AI editing, assistance, or full AI-generation.
Distribution of EditLens predictions on ICLR 2026 reviews
Paper submissions, on the other hand, are still mostly human-written (61% were mostly human-written). However, we did find several hundred fully AI-generated papers, though they seem to be outliers, and 9% of submissions had over 50% AI content. As a caveat, some fully AI-generated papers were already desk rejected and removed from OpenReview before we had a change to perform the analysis.
Distribution of AI content in ICLR 2026 paper submissions
We found some interesting trends in the results that shed light on how AI is being used in both paper submissions and peer reviews, and what the downstream effects of this usage are on the review process itself.
Contrary to a previous study that showed that LLMs often prefer their own outputs to human writing when used as a judge, we find the opposite: the more AI-generated text present in a submission, the worse the reviews are.
Average review scores by AI content in papers
This could be for multiple reasons. One is that the more AI is used in a paper, the less well-thought out and executed the paper is overall. It is possible that when AI is used in scientific writing, it is more often used for offloading and shortcutting rather than used as an additive assistant. Additionally, fully AI-generated papers receiving lower scores potentially indicates that AI-generated research is still low quality slop, and not a real contribution to science (yet).
Average review scores by AI involvement level
We find the more AI is present in a review, the higher the score is. This is problematic: it means rather than reframing the reviewer's own opinion using AI as the frame (if this were the case, we'd expect the average score to be the same for AI reviews and human reviews), reviewers are actually outsourcing the judgement of the paper to AI as well. Misrepresenting the LLM's opinion as a reviewer's own actual opinion is a clear violation of the Code of Ethics. We know that AI tends to be sycophantic, which means it says things that people want to hear and are pleasing rather than giving an unbiased opinion: a completely undesirable property when applied to peer review! This could explain the positive bias in scores among AI reviews.
Average review length by AI involvement level
Previously a longer review meant that the review was well thought-out and higher quality, but in the era of LLMs, it can often mean the opposite. AI-generated reviews are longer and have a lot of "filler content" in them. According to Shaib et. al., in a research paper called Measuring AI Slop in Text, one property of AI "slop" is that it has low information density-- which means the AI uses a lot of words to say very little in terms of actual content.
We find this to be true in the LLM reviews as well: AI is using a lot of words but not actually giving very high information dense feedback. We argue this is problematic because authors have to waste time parsing a long review and answering vacuous questions that don't actually contain much helpful feedback. It is also worth mentioning that most authors will probably ask a large language model for a review of their submission before they actually submit it. In these cases, the feedback from an LLM review is largely redundant and unhelpful, because the author has already seen the obvious criticisms that an LLM will make.
While Pangram's false positive rate is extremely low, it is non-zero, and therefore we have the responsibility of quantifying the reliability of the tool before recommending it to make discrete decisions on a paper's fate (such as a desk rejection decision) or punishing a peer reviewer. We directly measured the in-domain false positive rate using the negative control studies described above, but what about on other datasets, benchmarks, and on general text?
We documented Pangram's false positive rate in this previous blog post.
Pangram's accuracy has also been validated by multiple third-party studies, including recently studies by UChicago Booth and the American Association for Cancer Research.
To put these numbers into context, the false positive rate of Pangram is comparable to the false positive rate of DNA testing or a drug test: a true false positive, where a fully AI-generated text is confused with a fully-human text, is non-zero, but exceedingly rare.
If you're an author who suspects you've received an AI-generated review, there are several telltale signs you can look for. While Pangram can detect AI-generated text, you can also spot the signs of AI reviews by eye.
We have put together a general guide to detecting AI writing patterns by eye, but we do notice some additional signals and markers present specifically within AI peer reviews.
Some of the "tells" that we notice in AI peer reviews:
Strengths: Clear problem formulation: The paper addresses a real problem—VLM-based OCR systems hallucinate on degraded documents without signaling uncertainty, which is worse than classical OCR systems that produce obviously garbled output. The motivation is well-articulated. Systematic methodology: The two-stage training approach (pseudo-labeled cold start + GRPO) is reasonable and well-described. The multi-objective reward design with safeguards against reward hacking (especially the length-mismatch damping factor η) demonstrates careful engineering.
Questions: 1. Generalization to real degradations: Can the authors evaluate on real-world degraded documents (e.g., historical document datasets) to demonstrate that the approach generalizes beyond the specific synthetic degradation pipeline? 2. Comparison with MinerU systems: MinerU and MinerU2.5 [2,3] represent recent advances in document parsing. How does the proposed method compare against these systems on Blur-OCR? If these systems cannot produce uncertainty estimates, can they be combined with the proposed tagging approach?
Shallow nit-picks rather than genuine analysis: AI-generated reviews tend to focus on surface-level issues rather than real concerns with the scientific integrity of the paper. Typical AI criticisms might include more ablations needed that are very similar to the ablations presented, requested increase in the size of the test set or number of controls, or asking for more clarification or more examples.
Saying a lot of words that say very little: AI reviews often exhibit low information density, using verbose language to make points that could be expressed more concisely. This verbosity creates extra work for authors who must parse through lengthy reviews to extract the actual substantive critiques.
Earlier this year, researchers from UNIST in Korea published a position paper in which they outline some of the reasons for the decline in the quality of the peer review process. As AI continues to grow as a field, the resource strain placed on the peer review system is ultimately starting to show cracks. There simply are a limited number of qualified reviewers for the explosive rise in the number of papers.
The biggest issue with poor quality AI-generated papers is that they simply waste time and resources that are in limited supply. According to our analysis, AI-generated papers are simply not as good as human-written papers, and even more problematically, they can be generated cheaply by dishonest reviewers and paper mills that "spray and pray" (submit a high volume of submissions to a conference in hopes that one of them will get accepted by chance). If AI-generated papers are allowed to flood the peer review system, review quality will continue to decline, and reviewers will be less motivated by having to read "slop" papers instead of real research.
Understanding why AI-generated reviews can be harmful is a bit more nuanced. We agree with ICLR that AI can be used positively in an assistive capacity to help reviewers better articulate their ideas, especially when English is not a reviewer's native language. Additionally, AI can often provide genuinely helpful feedback, and it is often productive for authors to roleplay the peer review process with LLMs, to get the LLMs to critique and poke holes in the research, and catch mistakes and errors that the author may not have caught originally.
However, the question remains: if AI can generate helpful feedback, why should we prohibit fully AI-generated reviews? University of Chicago economist Alex Imas articulates the core issue in a recent tweet: the answer depends on whether we want human judgment involved in scientific peer review.
Alex Imas tweet on AI-generated reviews
If we believe current AI models are sufficient to replace human judgment entirely, then conferences should simply automate the entire review process—feed papers through an LLM and assign scores automatically. But if we believe human judgment should remain part of the process, then fully AI-generated content must be sanctioned. Imas identifies two key problems: first, a pooling equilibrium where AI-generated content (being easier to produce) will quickly crowd out human judgment within a few review cycles; and second, a verification problem where determining if an AI review is actually good requires the same effort as reviewing the paper yourself—so if LLMs can generate better reviews than humans, why not automate the entire process?
In my opinion, human judgments are complementary, yet provide orthogonal value to AI reviews. Humans can often come up with out of distribution feedback that may not be immediately obvious. Expert opinions are more useful than LLMs because their opinions are shaped by experience, context, and a perspective that is curated and refined over time. LLMs are powerful, but their reviews often lack taste, judgment, and therefore feel "flat."
Perhaps conferences in the future can put the SOTA LLM review next to the human reviews to ensure that the human reviews are not just restating the "obvious" critiques that can be pointed out by an LLM.
The rise of AI-generated content in academic peer review represents a critical challenge for the scientific community. Our analysis shows that fully AI-generated peer reviews represent a significant proportion of the overall ICLR review population, and the number of AI-generated papers is also rising. Yet, these AI-generated papers are more often slop than genuine research contributions.
We argue that this trend is problematic and harmful for science, and we call on conferences and publishers to embrace AI detection as a solution to deter abuse and preserve scientific integrity.
