Some advise or suggestions?
Posted by PeakTurbulent5545@reddit | LocalLLaMA | View on Reddit | 5 comments
I’m a bioinformatician tasked with building a pipeline to automatically find, catalog, and describe UMAP plots from large sets of scientific PDFs (mostly single-cell RNA-seq papers). i never used AI for this kind of task so right now i don't really know what I am doing, idk why my boss want this, i don't think is a good idea but maybe i am wrog.
What I've tried so far:
- YOLO (v8/v11): Good for fast detection of "figures" in general, but it struggles to specifically distinguish UMAPs from t-SNEs or other scatter plots without heavy custom fine-tuning (which I'd like to avoid if a pre-trained solution exists).
- Qwen2.5-VL: I’ve experimented with this Vision-Language Model. While powerful, the zero-shot performance on specific "panel-level" identification is inconsistent, and I’m getting mixed results without a proper fine-tuning setup.
Are there any ready-to-use models or specific Hugging Face checkpoints that are already "expert" in scientific document layout or biological figure classification?
I’m looking for something that might have been trained on datasets like PubLayNet or PMC-Reports and can handle the visual nuances of bioinformatics plots. Is there a better alternative to the Qwen/YOLO combo for this specific niche, or is fine-tuning an absolute must here?
Haniro@reddit
I commented on the crossposted thread, but I'll put it here too:
---
The first question is should you, the second question is how can you.
Regarding the latter: break your problem down into manageable chunks. You ultimately want a pipeline that:
For 1: use a platform like OpenAlex and NCBI with MESH terms to identify papers that are relevant to your field, which makes 2 simple enough. For 3: use a pdf-parsing library like pdfminer.six to parse and identify images + text embedded in your document. Then, 4) use a text+vision->text language model to i) determine if it is a UMAP, and ii) "describe" the UMAP. Qwen3.5 or Gemma4 are hot right now and have relatively low-parameters that you can run locally. You'll never get a perfect parsing + identification of UMAPs without fiddling with some parameters: a little supervised fine tuning (SFT) with good/bad examples and some careful prompting will help.
Now back to the first question: should you?
Yeah I don't either. You should probably clarify this: visually describing a UMAP is a waste of time IMO. What are you describing, the local/global structure? The cell types identified? Trying to infer deep meaning from a UMAP is a fool's errand: they depend so heavily on input parameters and they ultimately distort local and global structures (by design). This paper is be required reading for anyone trying to read a UMAP: https://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1011288
Ask your boss why they want you to do this. Like what is the end goal of all of this; I guarantee there is a much better, more reproducible, and data-driven way to answer your ultimate question without going through all of this.
xylose@reddit
The only way you're likely to tell UMAP from tSNE / PCA is by parsing the text or maybe axis legends. There's nothing fundamentally different about these in terms of their data representation.
If your aim is to find examples of UMAP plots and their contents then you're going to have a much easier time using an LLM on the PDF text. That should find it pretty easy to say if there is a UMAP plot in the paper and provide the figure number. You'll probably even get a reasonable description of the contents from the associated figure legend.
PeakTurbulent5545@reddit (OP)
Thanks!
Hot-Improvement9260@reddit
You're trying to solve a really specific computer vision problem with general-purpose tools, and that's why you're hitting walls. YOLO is great for detecting objects but terrible at fine-grained classification of similar-looking plots, and Qwen2.5-VL is powerful but needs context and examples to be reliable at this level of specificity. The honest answer is that there probably isn't a pre-trained checkpoint that's already expert in distinguishing UMAPs from t-SNEs at the panel level because that's such a niche domain. PubLayNet and similar datasets are good for document layout but not for the visual nuances of specific plot types. Your boss might actually be onto something though, even if it feels like a stretch right now.
What you're describing is totally solvable, but it's going to need either fine-tuning on a smaller dataset of actual examples from your papers, or a hybrid approach where you combine detection with some metadata extraction from the paper text itself. If you've got access to maybe 100-200 manually labeled examples of UMAPs from your PDF collection, you could fine-tune something like a smaller vision model pretty quickly. Alternatively, you could build a two-stage system where you extract figures first, then use a combination of visual features and OCR on the axis labels and plot characteristics to classify them.
That's more work upfront but more reliable. If this is something your lab is going to be doing repeatedly and it's eating up time, it might be worth getting someone in to build this properly rather than trying to DIY it with off-the-shelf models. What's the actual bottleneck right now, the detection or the classification?
PeakTurbulent5545@reddit (OP)
I will go with detection IMO, because sometimes it completely decides that a barplot is a UMAP for some reason XD. Qwen knows when there is a UMAP in a multi-panel image, but when I ask it to give me the coordinates, it is not always reliable.