← Field log Matching NVIDIA's flagship PII detector with a model 3× smaller

Matching NVIDIA's flagship PII detector with a model 3× smaller

AI agents are moving into healthcare, insurance, and finance — exactly where the data is most sensitive. Sending raw patient records or claims to a hosted LLM is, for a lot of teams, a non-starter.

The usual fix is to redact PII into placeholder tokens (PERSON_1, EMAIL_2). It doesn't work well. Over a long context, an LLM loses track of meaningless symbols and reasons worse.

We wanted two things: detection that never leaves the machine, and masking that doesn't break the model. So we built pii-proxy and open-sourced it (MIT).

What it does

import { PrivacyProxy } from 'pii-proxy';

const proxy = PrivacyProxy.withLocalLlm({ model: 'qwen3:1.7b' });

const masked = await proxy.mask(
  "Patient Marcus Weber, marcus.weber@gmail.com"
);
// → "Patient James Thompson, lizeth53@yahoo.com"

// send masked.text to your LLM, then:
const real = proxy.unmask(llmResponse);

Three steps:

  • Detect — deterministic regex for structured types (emails, IPs, IDs), then a local model for what regex can't catch (names, organizations, locations, domain-specific entities).
  • Replace — each value becomes a plausible fake, "Marcus Weber" → "James Thompson", so the LLM reads natural text.
  • Reverse — a bijective map restores every substitution when results flow back to your systems. Exact round-trip.

The plausible-fakes part matters. The model reasons over fluent text instead of stumbling over PERSON_1, and you still get the real values back.

The benchmark

The interesting result is the detector. NVIDIA publishes gliner-PII, a strong model based on gliner_large-v2.1 (~300M params) — but ships only the weights, no training recipe.

We took gliner_small-v2.1 (~50M params, 3× smaller) and fine-tuned it on the full Nemotron-PII dataset: 100k records, 55 entity types, 30 domains. Three epochs, 4×A100 on Modal, bf16. About 39 minutes and ~$8.

ModelBaseF1PrecisionRecall
Ours (gliner_small-v2.1)~50M90.1%90.5%89.7%
nvidia/gliner-PII~300M89.3%90.0%88.6%

Both evaluated on the same 500-record held-out set, with the same 55 native labels, at threshold 0.5.

Benchmark: our gliner_small-v2.1 (~50M) scores 90.1% F1, matching nvidia/gliner-PII (~300M) at 89.3%. Trained in 39 min for ~$8, recipe open, runs on-device

The caveats

+0.8pp is a tie, not a victory. On a 500-record test set, one misclassified record moves the number ~0.1–0.2pp. Treat ours and NVIDIA's as equal. Both models trained on the same synthetic distribution as the test set. Nemotron-PII is synthetic, and NVIDIA's model was trained on it too. This measures "can we reproduce flagship training quality at 3× smaller" — not generalization to messy real-world text. The real test is still owed. Independent research finds that popular PII models look strong on their own benchmarks but have real gaps on noisy, real-world, and novel data — and that their model cards overstate them (Unmasking the Reality of PII Masking Models). The next step is out-of-domain clinical and customer data, and we'll publish those numbers too — whatever they say.

The claim, then: a base model 3× smaller, the same data, and a reproducible recipe yield flagship-class results. The recipe is public — reproduce it with one modal run.

The number that matters is yours

90% is on synthetic data. On your actual records — your abbreviations, your form fields, your entity types — an off-the-shelf detector scores lower. The OOD research above is the rule, not the exception.

That's not a flaw in the approach; it's the reason it works. The recipe is cheap and reproducible so it can be pointed at your distribution: feed it your domain's text, compile a specialist, and measure it on your data — not someone else's benchmark. The pluggable detection layer lets you mix regex, a local model, and a compiled specialist per entity type.

If you're putting this into production somewhere the data is sensitive and the domain is specific, that last mile — your data, your numbers — is the part worth getting right.

The numbers we most want to see are the ones we don't have yet: real clinical, financial, and legal text. If you're deploying this somewhere real, we'd like to measure it with you — on your infrastructure. hello@daslab.run

Why small matters: it runs where the data lives

A 50M model runs on a laptop, a phone, or a $249 edge board. That changes the privacy story. Detection happens on your infrastructure, and the PII never leaves your network — not even to be found. No data-residency exception to file, no third-party processor to vet for the detection step.

The bigger bet

pii-proxy is our first public data point for something we think is true broadly: the minimal model to solve a specific task is one to two orders of magnitude smaller than the general model. Generalization is what you pay the frontier model for, and most real deployments don't need it. You can compile that intelligence down into a tiny, fast, verifiable specialist that runs at the edge.

PII is where we started, because the pain is real and immediate. It won't be where we stop.

Code and reproducible benchmarks: github.com/daslabhq/pii-proxy