LLM for name/gender classification

Posted by trosler@reddit | LocalLLaMA | View on Reddit | 11 comments

Hey there,
I have a task where I have a huge list with names (e.g. John Smith). And I want to use a LLM to assign a gender to each name (m/f/ambiguous). I have read some research papers that recommended mistral-nemo for this task, yet in my personal tests, the results were mixed. When running the model on the identical data, the results vary a lot, sometimes with very clear names (e.g. John Smith). I hand the LLM the prompt and included, a short list of the names (say, 10 at a time).
- Can you recommend a local LLM for this task?
- Is this "batch" approach fine?
Thanks for ideas and input.
PS: for the "easy" names I used another Python library, so only the truly difficult names remain in the actual dataset.

[-]

Daemontatox@reddit

I highly suggest not using llms as you have noticed, the results wont be deterministic or as accurate as you hope and some names will introduce ambiguity,
I suggest taking a look at bert based classification models instead or their successors (not really upto date with the topic or task) , they should be faster ,and better.

For example you can try this model and compare the results to nemo that you have tried .

Another repo that should prove helpful.

[-]

CooperDK@reddit

That would have been true two years ago. Modern LLMs are much better.

[-]

mtmttuan@reddit

If you only have the name then you should just use a lookup table. Probably more accurate than any models.

If you're fancy you can aggregate all similar names and calculate how often that name is of male/female.

[-]

RiseStock@reddit

If you are in the USA the SSa releases baby names annually for the most popular names with sex and ethnicity information. You don't need an LLM or necessarily want an LLM. You could implement a lookup and take the precedence

[-]

Darth_Candy@reddit

OP, this is the way to go. Another similar option is to take that database and try to make your own classification model using the SK-Learn Python library (they make it easy and there are endless tutorials online, I promise). Using a large language model for this is hydrogen bomb vs coughing baby.

[-]

1-800-methdyke@reddit

The only correct use for an LLM in this situation is to write some code to train a classification model.

[-]

SadEntertainer9808@reddit

OP, how many names are you talking about? Any modern LLM should be able to perform this task, but some will cost more (in time and/or money) than others.

[-]

somerussianbear@reddit

I can see people already pissed online complaining they were “CLASSIFIED” by an algorithm that didn’t get their pronouns right.

[-]

DevilaN82@reddit

Why not extracting name and check in database of known names?

Seems that everything looks like a nail when you've got hammer in your hand...

[-]

my_name_isnt_clever@reddit

Have you tested this at scale to determine if it's good enough for you? Given the nature of LLMs, I wouldn't be at all surprised to see it determine a name like "Leslie" as female one run and ambiguous on another. And that's not even considering rare names with incomplete training data.

[-]

CooperDK@reddit

Any model should be able to do this. But you can train a small LLM say around 0.5B parameters on lists of male and female names. You just have to write each Json entry as you need them output.