🛡️ Shield 82M: A PII stripping/filtering model 🛡️

Posted by LH-Tech_AI@reddit | LocalLLaMA | View on Reddit | 27 comments

Hey, r/LocalLLaMA !

I am finally back with a new model: 🛡️ Shield 82M

It's a finetuned version of distilroberta-base and it's able to filter out all types of PII (Personally identifiable information) of texts in any language.

Here are some examples:

1) Test with name ,email and phone:

Original: My name is John Doe. Email: john@example.com. Phone: +49 123 45678.
Protected: My name is [PERSON]. Email: [EMAIL]. Phone: [PHONE].

2) basic test:

Original: I live in Cambridge
Protected: I live in [ADDRESS]

3) French test (multilingual):

Original: Mon e-mail est jean.dupont@example.fr et mon téléphone est +33 6 12 34 56 78.
Protected: Mon e-mail est [EMAIL] et mon téléphone est [PHONE].

So, we see that this model performs really well with a total accuracy of \~96%.

And: it's completely open-source like all my models. :D

If you want to try it out: https://huggingface.co/LH-Tech-AI/Shield-82M

Have fun with it. :-)

See you in the comments. Would really like to get some feedback from you.

[-]

Bootes-sphere@reddit

Solid contribution. One thing I'd test hard: how does it handle edge cases like variation in formatting? PII detection often breaks on stuff like "john.doe@company.com" vs "john doe @ company dot com" or dates written different ways.

Also curious about false positives on legitimate text — I've seen aggressive filtering strip things that shouldn't be redacted (product names, technical identifiers, etc.). Did you benchmark against a dataset with intentional false positives?

The 82M size is smart for local inference. What's your latency on CPU vs GPU? And are you handling structured data (JSON, CSVs) or mainly unstructured text? That matters a lot for production use.

[-]

LH-Tech_AI@reddit (OP)

Thanks so much for your great feedback.

I didn't measure the inference time, but it's very quick - especially on GPU (<100ms per text).

It can only handle unformatted text i guess (no JSON, CSVs, etc.) - but you can just try it i guess.

have fun :D

[-]

BitGreen1270@reddit

I have another follow up question if you don't mind - would this also be possible via a lora on a smaller open source model like Gemma4-E2B? What are the benefits/downsides of that approach?

[-]

LH-Tech_AI@reddit (OP)

I think, a lora on a small LLM like Gemma4-E2B would be even more impressive than my tool AND it would also "think" about the context...

But my idea was to create a model that JUST ERASES PII.

But you can try it - dataset is in the HF Repo (link above) :-)

[-]

BitGreen1270@reddit

Haha maybe in a year's time I might attempt it 😄

[-]

LH-Tech_AI@reddit (OP)

haha, okay :D

[-]

Noxusequal@reddit

Thest really cool but how does it do with secondary identifiers ? Like for example that the person is the only Doktor in a village. Or other stuff like this where you can use secondary info to identify the person.

[-]

LH-Tech_AI@reddit (OP)

Fair point :D

Tried this just now on a fake pii doc generated by gemini. Seems to work reasonably well. I think just a couple of callouts are the name didn't get redacted fully, left the initials in. Also if an address has multiple parts, it replaces it with multiple `[ADDRESS]` words.

Here is the input: https://ctxt.io/2/AAD4l4UrEg

Here is the output: https://ctxt.io/2/AAD4LxuHEQ

[-]