Llama Image Tagger: A project I made to help me sort thousands of images
Posted by Eisenstein@reddit | LocalLLaMA | View on Reddit | 26 comments
This is a project I have been working on that I thought I would share with the community.
What is it?
It will take folders of images and it will create keywords for them and put them into the image metadata. There is no database (except to keep track of what files were processed) so you can move them wherever you want and the metadata stays. You can use any program or app that can read image metadata to sort them, search them, or categorize them. It can also provide full captions/descriptions.
It does this completely locally. It will download the gguf model weights from hugginface and then run on your machine using a single exec and some scripts. It does not reside in your system or install anything but a few small python libraries for image handling and a small program to write the metadata. Run it and delete it and it is gone.
- https://github.com/jabberjabberjabber/LLavaImageTagger
Why is it?
I do a lot of electronics repair. An important part of any repair is documenting the steps so that when you put the thing back together you can reference them. This means I have several tens of gigabytes of folders full of generically named pictures of random circuit boards. This is combined normal pictures that I care about, as well as hundreds (thousands?) of screenshots and various other crap that I have been dumping into folders for over a decade.
I thought to myself -- this is a solved problem, I am sure!
I looked at all the various professional image software that catalogs. The problem is that I am not a photographer and I don't care about any of the features besides 'sort and label my pictures'.
I looked at the options that non-professional people use to catalog photographs. The problem is that I don't want to put all my stuff in someone else's computer, and narry a non-cloud option can be found (yes, I know about that one everyone loves that has breaking updates every week but that is just a cloud service but on your own network which is not what I want).
I looked at solutions provided by the SD/Image Gen crowd. The problem is that my images are not porn. And here we are.
This project is composed of four components:
-
KoboldCpp runs the backend: It is one executable, it is updated frequently, the dev actively listens to community feedback, and when you are done using it, your system is the same as before -- no stupid random hidden directories filled with hundreds of gigs of model weights you already had, no docker crap, no python dependency hell, -- it even will download the model weights you need if you specify their location in a config file with the executable
-
MiniCPM-V 2.6: It is the best vision capable model that can be run as a gguf right now. It is absolutely adequate for these needs
-
Exiftool: File metatdata is a horrorshow of conflicting standards going back to the early 90s starting with Adobe and turning into a real nightmare in the 2000s when every camera manufacturer and photo software creator decided they would make their own tags and treat everyone else's tags however they felt like. There is a dev who has spent over a decade figuring it all out and as a result we have a tiny but immensely useful program that you can throw metadata at and ignore a lot of the details
-
The script that does the coordination between these things. This is the only part that I am responsible for
The llmii script does the following:
-
Crawls through a directory tree
-
Finds images
-
Sends images to the KoboldCpp API
-
Asks a vision model to create a caption for the image, keywords for the image, or both
-
Takes whatever output the model gives and figures out if it is usable or if the model is just explaining why keywords are awesome or something
-
If garbage it tries again one more time, if good it writes the metada to the image file, including a unique ID and an entry in a text (JSON) file in the root folder that lets us know we have done this file in case we need to quit and come back
-
When finished, it can take all the generated keywords and (after de-pluralizing them) expand them so that all images with synonyms of other keywords get each other's synonyms (I know this sounds confusing but it is super useful, give it a second to sink in) OR it can 'deduplicate' them by finding the synonyms and replacing all of them with the most frequently used synonyms
Example of keyword post-processing
Image01: sedan, roadway, carwash
Image02: car, street, carwash
Image03: car, washing, bigfoot
Expand
Image01: sedan, car, street, roadway, washing, carwash
Image02: sedan, car, street, roadway, washing, carwash
Image03: sedan, car, street, roadway, washing, carwash, bigfoot
DeDupe
Image01: car, roadway, carwash
Image02: car, roadway, carwash
Image03: car, roadway, carwash, bigfoot
In dedupe since street and roadway are tied it would just be the first one that got put on the list.
Note: You do not have to do this! The default is to just leave the keywords alone as they were generated.
Please also note that I am not a great or talented programmer. I am amazed any of this works, so I am happy to take good faith advice and critique and any issues will be looked into, but you should know what you are doing when you run this, but I am sure you are all technically proficient.
If your comment is only going to be 'another project does this' or 'I don't want this' or something else completely useless, write it in the comment box, then close the tab without hitting post. Thanks, but it is rude.
FitItem2633@reddit
StarGeekSpaceNerd@reddit
Results:
Description:
The image depicts a medieval-themed illustration, likely from a historical or fantasy context. At the center, a bearded man with a long white beard is depicted, wearing a red tunic and a golden crown, signifying royalty or high status. His arms are raised triumphantly, and he is shouting "TETTEN!" which is German for "THERE! Ready!" or "Done! Ready!" This suggests a moment of victory or completion of a task. The background is simple and does not provide additional context, focusing the viewer's attention on the central figure. The style of the illustration is reminiscent of medieval manuscript illuminations, with bold outlines and flat colors, typical of historical art from the medieval period.
Keywords:
expression, exclamation, white beard, crown, cartoon, traditional, lifestyle, brick wall, background, victory, royalty, red robe, king, posture
blurredphotos@reddit
Generated by which model?
StarGeekSpaceNerd@reddit
The model that was included. See this post.
StarGeekSpaceNerd@reddit
Well Actually… ;)
It's now over two decades, as the first version came out in 2003.
Eisenstein@reddit (OP)
Howdy. Fancy seeing you here. Thanks for all your help!
StarGeekSpaceNerd@reddit
I summoned where ever exiftool is mentioned. No need to even say it more than once, unlike that ghost.
Actually, I have an IFTTT set up to watch reddit for mentions of exiftool.
And always glad to help.
One feature request. How about the ability to append the AI description if there is already an existing
Description
. A lot of my images already have some sort of minimal description, and I wouldn't want to overwrite that.Eisenstein@reddit (OP)
How should it be appended? Unlike keywords you can't just add a new description to a list. If you could give an example of the XMP:Description field with the ideal combination of one of your descriptions with an AI description appended to it, I can use that as the template (doesn't have to be real content).
You can use four backtics after an empty line and newline to get reddit to do a code box (I have no idea how familiar you are with reddit, and I generally avoid looking at people's post histories unless I am very bored or very angry, so don't be offended if am stating the obvious to you).
StarGeekSpaceNerd@reddit
On the command line, you would use the command link this
exiftool "-Description=AI Description" "-Description<${Description} AI Description" /path/to/files/
But that would only be a space between the two where I think one or two new lines would be better. And that makes the whole thing more complicated.
I unfortunately don't know Python or how to use PyExiftool, but I think something like this ChatGPT answer would work, though it has some obvious errors. You would have to adapt it to your code.
Eisenstein@reddit (OP)
What is your use case for the existing captions? I ask because having two captions in one tag doesn't seem practical, and if it were meant to contain more than one entry it would be a list, not a string.
I am of course not an expert on metadata; I just want to keep this thing as simple as possible and not get into weeds for use cases which don't conform to standards which need me to keep track of them.
Eisenstein@reddit (OP)
That should be doable.
GortKlaatu_@reddit
This is good for labelling, but how do you then search and are you searching through the static keywords?
For me there's just a disconnect between pure labelling and tools like rclip.
Eisenstein@reddit (OP)
That's a whole different thing, but once the metadata is generated it moves with the pictures and will get absorbed by any application which reads it.
Check out diffractor if you want something that can show you how useful tagging your images is.
StarGeekSpaceNerd@reddit
The metadata is embedded in the image in standard image metadata locations and will be read by any competent Digital Asset Management (DAM) program, such as Lightroom or DigiKam.
TwiKing@reddit
Which llava model do i use with it?
Eisenstein@reddit (OP)
You can use any vision model, but the config is set to automatically download MiniCPM-V 2.6 Q4_K_M from huggingface the first time you run it.
TwiKing@reddit
Ah, thank you. I was completely confused, since other vision models I've tried have very strict image size limitations and wasn't sure which one to try.
Eisenstein@reddit (OP)
I actually did make an early version of the script that will change filenames.
It is an absolutely terrible idea though; I strongly advise against it. Once you have keywords inside the image metadata it no longer matters wht the filename is; you can find it from the metadata tag using an image viewer that supports metadata searching. Xnview and diffractor are examples.
TwiKing@reddit
Yeah, youre right, the metadata approach does sound a lot better. .
Eisenstein@reddit (OP)
As long as the files are under ~30MB each, it doesn't matter what resolution they are. Of course, a 10px by 10px image isn't going to get a very good generation, but you shouldn't have to worry about resizing anything.
cleverusernametry@reddit
This needs to be in the readme
Eisenstein@reddit (OP)
Thanks for the feedback, I will just added it to the top of the installation section. Let me know if that works ok.
adriosi@reddit
Try using lm-format-enforcer to enforce an output format of your model. It supports JSON and regex, so you can tweak the format of the resulting keywords too.
Eisenstein@reddit (OP)
Just for reference, out of 14,197 images in my Pictures folder:
So, a total of a little less than 1% failure rate for any reason. Now that I think about it, it might be helpful to add specific reasons for failure, but if I had to guess I would say about 3/4 of those failures are 'non-parseable output'. So, it isn't a huge deal.
In this case, though, a single key in a JSON followed by a list, even if mangled horribly but the LLM, is relatively easy to reconstruct, so although not necessary in this case, your link is, I am sure, helpful for others.
Eisenstein@reddit (OP)
Using grammar slows down inference considerably.
ali0une@reddit
Looks useful, will test. Thanks for sharing.