Some advise or suggestions?

[-]

Haniro@reddit

I commented on the crossposted thread, but I'll put it here too:

---

The first question is should you, the second question is how can you.

Regarding the latter: break your problem down into manageable chunks. You ultimately want a pipeline that:

Identifies papers from your interested domain
Downloads and parses the papers + supporting data
Identify UMAPs from the rest of the data
Uses vision + text embeddings to "describe" the UMAP
Record results

For 1: use a platform like OpenAlex and NCBI with MESH terms to identify papers that are relevant to your field, which makes 2 simple enough. For 3: use a pdf-parsing library like pdfminer.six to parse and identify images + text embedded in your document. Then, 4) use a text+vision->text language model to i) determine if it is a UMAP, and ii) "describe" the UMAP. Qwen3.5 or Gemma4 are hot right now and have relatively low-parameters that you can run locally. You'll never get a perfect parsing + identification of UMAPs without fiddling with some parameters: a little supervised fine tuning (SFT) with good/bad examples and some careful prompting will help.

Now back to the first question: should you?

idk why my boss want this

Yeah I don't either. You should probably clarify this: visually describing a UMAP is a waste of time IMO. What are you describing, the local/global structure? The cell types identified? Trying to infer deep meaning from a UMAP is a fool's errand: they depend so heavily on input parameters and they ultimately distort local and global structures (by design). This paper is be required reading for anyone trying to read a UMAP: https://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1011288

Ask your boss why they want you to do this. Like what is the end goal of all of this; I guarantee there is a much better, more reproducible, and data-driven way to answer your ultimate question without going through all of this.

[-]

xylose@reddit

The only way you're likely to tell UMAP from tSNE / PCA is by parsing the text or maybe axis legends. There's nothing fundamentally different about these in terms of their data representation.

If your aim is to find examples of UMAP plots and their contents then you're going to have a much easier time using an LLM on the PDF text. That should find it pretty easy to say if there is a UMAP plot in the paper and provide the figure number. You'll probably even get a reasonable description of the contents from the associated figure legend.

[-]

PeakTurbulent5545@reddit (OP)

Thanks!

[-]

Hot-Improvement9260@reddit

You're trying to solve a really specific computer vision problem with general-purpose tools, and that's why you're hitting walls. YOLO is great for detecting objects but terrible at fine-grained classification of similar-looking plots, and Qwen2.5-VL is powerful but needs context and examples to be reliable at this level of specificity. The honest answer is that there probably isn't a pre-trained checkpoint that's already expert in distinguishing UMAPs from t-SNEs at the panel level because that's such a niche domain. PubLayNet and similar datasets are good for document layout but not for the visual nuances of specific plot types. Your boss might actually be onto something though, even if it feels like a stretch right now.

What you're describing is totally solvable, but it's going to need either fine-tuning on a smaller dataset of actual examples from your papers, or a hybrid approach where you combine detection with some metadata extraction from the paper text itself. If you've got access to maybe 100-200 manually labeled examples of UMAPs from your PDF collection, you could fine-tune something like a smaller vision model pretty quickly. Alternatively, you could build a two-stage system where you extract figures first, then use a combination of visual features and OCR on the axis labels and plot characteristics to classify them.

That's more work upfront but more reliable. If this is something your lab is going to be doing repeatedly and it's eating up time, it might be worth getting someone in to build this properly rather than trying to DIY it with off-the-shelf models. What's the actual bottleneck right now, the detection or the classification?

[-]

PeakTurbulent5545@reddit (OP)

I will go with detection IMO, because sometimes it completely decides that a barplot is a UMAP for some reason XD. Qwen knows when there is a UMAP in a multi-panel image, but when I ask it to give me the coordinates, it is not always reliable.