How Our ML Works: Tabular & Vision CNN (and What’s Next)

AuthentiScan combines two complementary machine learning approaches for AI-image detection today: a Tabular model that ingests engineered forensic signals, and a Vision CNN that learns visual patterns directly from pixels. We fuse and calibrate their outputs to present a transparent, human-readable result.

Tabular model (engineered forensic signals)

The tabular model aggregates measurable signals that often differ between camera-native photos and model-generated images. Examples include:

Metadata strength: EXIF presence/consistency (when available), orientation flags, device fields.
Compression cues: JPEG quantization patterns, block grid periodicity, DCT-domain stats.
Spectral/texture stats: FFT energy distribution, noise level estimates, color channel skew.
Heuristics: repetition/banding indicators, overly-smooth skin or microtexture deficits, edge halos.

These features are normalized and fed to a lightweight classifier (e.g., gradient-boosted trees or logistic ensemble). The tabular model is fast, robust to many image sizes, and excels when metadata and compression traces survive. It’s also highly interpretable—great for our “why did we think this?” breakdown.

Vision CNN (learned pixel patterns)

Our vision model is a convolutional neural network trained on diverse real vs. AI-image datasets (multiple generators, prompts, and styles). It ingests resized crops/patches and leverages data augmentation (resize, slight blur, JPEG re-encode, small color jitter) to reduce overfitting to any single source.

Input: RGB patches/images with standardized preprocessing.
Architecture: modern CNN backbone with global pooling and a calibrated sigmoid head.
Training: class-balanced sampling, strong augmentations, and hard-example mining on ambiguous cases.
Calibration: temperature scaling + isotonic regression to align probability outputs with observed accuracy.

The CNN shines when metadata is missing or images have been re-encoded: it captures subtle textural cues that are hard to hand-engineer. We still surface limitations (“uncertain” zones) where both models disagree or cues are weak.

How we combine them

We compute both scores, check agreement, then apply a small calibrator (stacking) to produce the final estimate with confidence. When models disagree, we show you the evidence (e.g., ELA overlays, FFT heatmaps) so you can weigh context and decide. Results are probabilistic—we never present a single opaque verdict.

Evaluation (ongoing)

Holdout sets covering multiple generators (Midjourney, SDXL, Flux, etc.) and camera sources.
Robustness sweeps across JPEG quality, downscale, light blur, cropping.
Drift checks as new AI models emerge; we periodically retrain and re-calibrate.

Privacy note

We process files transiently to compute signals and model outputs. We don’t keep uploads for user scans. For our internal training, we use curated datasets and opt-in collections. See Privacy for details.

Known limitations

Heavy recompression or platform relays (e.g., chat apps) can erase useful cues.
Strong post-processing (beauty filters, denoise, upscalers) can mimic model-like texture.
Small, low-detail images can be ambiguous; we label these as “uncertain” rather than overconfident.

What we’re training next

1) AI video detection

We’re training a video pipeline that samples frames and short clips, then aggregates evidence across time. The stack includes:

Frame-level models: our current tabular + CNN analysis applied to representative frames.
Temporal features: consistency of textures, flicker/warping, lip-sync anomalies, motion vectors.
Clip models: lightweight 3D CNN/transformer blocks for short segments, fused with frame scores.

Output will show a per-frame timeline plus a clip-level estimate, highlighting where artifacts concentrate.

2) General AI content detection

Beyond images and video, we’re expanding to broader AI-generated content signals:

Text heuristics + embeddings: burstiness, repetition, lexical variety, and semantic fingerprints.
Cross-modal corroboration: do image/video/text claims support each other?
Provenance: C2PA/Content Credentials parsing when available; we surface provenance above any score.

As always, we’ll present transparent rationales and uncertainty—detection is one input to your judgment, not a final arbiter.

Try AuthentiScan

Upload an image or paste a link to get a transparent breakdown.

Open app