AI detection for ML & data teams

AI detector for ML engineers & data scientists

Optimize LLM training and data selection. Prevent model collapse by filtering synthetic text from your pre-training or fine-tuning datasets with 99.98% accuracy and high-throughput API performance.

Built by researchers from Google, Tesla, and Stanford. Validated by ICLR and University of Maryland.

filter_pipeline.py
from pangram import Pangram

# Filter synthetic data from corpus
client = Pangram(api_key="your-api-key")
clean_corpus = []

for doc in training_corpus:
  result = client.predict(doc.text)
  if result['fraction_ai'] < 0.3:
    clean_corpus.append(doc)

print(f"Corpus: {len(clean_corpus)} clean docs")
Confiado por
marcas globais
TelaGoogle Sala de AulaQuoraTremauA Empresa da TransparênciaNewsguardChatPDFEm destaqueHaroHorizonteCitadoO VigilanteTutores universitáriosVerifiquei a minha escritaVibegradeWHEWikiEduTelaGoogle Sala de AulaQuoraTremauA Empresa da TransparênciaNewsguardChatPDFEm destaqueHaroHorizonteCitadoO VigilanteTutores universitáriosVerifiquei a minha escritaVibegradeWHEWikiEdu

Use cases

Don't train your models
on bad data.

Synthetic text is contaminating public datasets. Filter AI-generated content from your training pipelines with the most accurate AI detection engine to maintain corpus purity.

AI Data Analysis

Prevent Model Collapse

Recursive training on AI-generated content degrades model performance and diversity. Identify and filter AI-written content from your scraping pipelines to ensure corpus purity.

RLHF Verification

Verify RLHF Inputs

Ensure your Human Feedback (RLHF) data is actually human. Detect if crowd-workers are using ChatGPT to generate responses for your fine-tuning tasks.

Granular Analysis

Granular Interpretability

Don't settle for a binary label. Our Premium API returns token-level probabilities, allowing you to retain human-edited segments while discarding fully synthetic "slop".

Technical approach

A model you
can trust

Built for engineers who need confidence in their data filtering. Our model addresses false positives, adversarial robustness, and evolving AI outputs.

Mineração negativa difícil

We train on 'hard negatives' — human writing that is stylistically formal or repetitive — to minimize false positives and ensure you don't discard valuable human data.

Adversarial Robustness

Pangram handles paraphrased or modified AI content. Our models are trained against "humanizers" and adversarial attacks to detect obfuscated synthetic text.

Future-Proofing

Detects text from the latest models including GPT-5, Claude 3.5, and Llama 3, ensuring your filters stay ahead of the current SOTA.

Integration

Built for your
data pipeline

01

SDK Python

Install pangram-sdk and integrate detection into your Airflow or Databricks pipelines with just a few lines of code. Optimized for connection pooling and error handling.

View Docs →

02

High-Throughput
API

Process massive datasets with low latency. Our infrastructure supports batching and guarantees throughput, handling millions of requests for enterprise scraping operations.

Get API Key →

03

Security &
Compliance

Fully SOC 2 Type 2 certified. We offer private endpoints and strict data retention policies — we never train on your proprietary inputs.

Saiba mais →

Perguntas frequentes

Perguntas frequentes sobre detecção de IA

Common questions about AI detection for ML engineers
and data scientists.

Our model is trained on a diverse, proprietary dataset of millions of paired human and AI documents. We use active learning to target edge cases and specifically reduce bias against ESL writers.
The API returns a prediction_score (0.0 to 1.0) and a categorical label. Advanced endpoints provide window-level analysis to visualize "burstiness" and syntax patterns across the document.
No. For enterprise clients, we offer zero-retention guarantees where data is processed in memory and discarded immediately after scoring to ensure privacy.
Yes. We continuously retrain our classifier on outputs from new frontier models (like Gemini Ultra and GPT-4) within days of their release.
Our models are trained specifically against adversarial attacks and "humanizers" that attempt to obfuscate synthetic text. By using hard negative mining during training, we minimize false positives on stylistically formal human writing.

Yes. You can install the pangram-sdk to integrate detection into Airflow or Databricks pipelines with just a few lines of code. Our API is optimized for high-throughput enterprise scraping operations, supporting millions of requests with low latency.

Unlike binary detectors, Pangram provides token-level probabilities. This granular interpretability allows you to identify and retain human-edited segments while filtering out fully synthetic "slop" from your training datasets.
Using Pangram helps prevent model collapse. By filtering recursive AI-generated content from your scraping pipelines, you maintain corpus purity and ensure your models don't degrade in performance or diversity due to training on bad data.

Clean your training data today

Prevent model collapse, verify RLHF inputs, and filter synthetic content from your datasets with 99.98% accuracy.