HumanMint - Normalizing & Cleaning Government Contact Data
Posted by AmbitiousTie@reddit | Python | View on Reddit | 1 comments
Hey r/Python!
I just released a small library I've been building at work for cleaning messy human-centric data: HumanMint.
Think government contact records with chaotic names, weird phone formats, noisy department strings, inconsistent titles, etc.
It was coded in a single day, so expect some rough edges, but the core works surprisingly well.
Note: This is my first public library, so feedback and bug reports are very welcome.
What it does (all in one mint() call)
- Normalize and parse names
- Infer gender from first names (probabilistic, optional)
- Normalize + validate emails (generic inboxes, free providers, domains)
- Normalize phones to E.164, extract extensions, detect fax/VoIP/test numbers
- Parse US postal addresses into components
- Clean + canonicalize departments (23k -> 64 mappings, fuzzy matching)
- Clean + canonicalize job titles
- Normalize organization names (strip civic prefixes)
- Batch processing (bulk()) and record comparison (compare())
Example
from humanmint import mint
result = mint(
name="Dr. John Smith, PhD",
email="JOHN.SMITH@CITY.GOV",
phone="(202) 555-0173",
address="123 Main St, Springfield, IL 62701",
department="000171 - Public Works 850-123-1234 ext 200",
title="Chief of Police",
)
print(result.model_dump())
Result (simplified):
- name: John Smith
- email: john.smith@city.gov
- phone: +1 202-555-0173
- department: Public Works
- title: police chief
- address: 123 Main Street, Springfield, IL 62701, US
- organization: None
Why I built it
I work with thousands of US local-government contacts, and the raw data is wildly inconsistent.
I needed a single function that takes whatever garbage comes in and returns something normalized, structured, and predictable.
Features beyond mint()
- bulk(records) for parallel cleaning of large datasets
- compare(a, b) for similarity scoring
- A full set of modules if you only want one thing (emails, phones, names, departments, titles, addresses, orgs)
- Pandas .humanmint.clean accessor
- CLI: humanmint clean input.csv output.csv
Install
pip install humanmint
Repo
https://github.com/RicardoNunes2000/HumanMint
If anyone wants to try it, break it, suggest improvements, or point out design flaws, I'd love the feedback.
The whole goal was to make dealing with messy human data as painless as possible.
Scared_Sail5523@reddit
It's cool