Ok, Claude is a beast in bioinformatics, it seems to be one of the few models that invest on it. They even created a benchmark for it. Is there an openweight model that approximate it? | TheaterFire

Ok, Claude is a beast in bioinformatics, it seems to be one of the few models that invest on it. They even created a benchmark for it. Is there an openweight model that approximate it?

Posted by Turbulent_Pin7635@reddit | LocalLLaMA | View on Reddit | 43 comments

How I said in the tittle. I tried several models, but the one that truly worked up to now was Claude. Unfortunately. Can you give me tips? I have access to big models.

[-]

numberwitch@reddit

You don’t even explain what you’re trying to do, so no one can help you. Having “llm do bioinformatics” isn’t a solvable problem as stated

[-]

Turbulent_Pin7635@reddit (OP)

You are right!

Normally, I run genomics analysis like: RNA-seq, ATAC-seq, scRNA-seq, Spatial transcriptomics, differential analysis, transcriptome assembly... and the combination of all it a single multi-omic analysis.

[-]

Adventurous_Cat_1559@reddit

All those things are fairly straightforward and just a few commands to run, are you struggling with something specific?

[-]

SadBBTumblrPizza@reddit

I mean they're straightforward until they're not. Which is always lol

[-]

Adventurous_Cat_1559@reddit

That’s usually a sign of misunderstanding the problem. Keeping things simple and then building on it is crucial. Especially in something like comp bio, what OP has described is usually done with chaining commands in a bash script.

[-]

SadBBTumblrPizza@reddit

I think you don't understand me. I mean in production, there's always some weird quirk, dirty formatting, a mismatched reference <> annotation, unexpected or missing SAM fields, a too-old or too-new bed track.... Etc. those things can't really be solved on the software side though, that's more about your data. In terms of pure software engineering you're right but it really isn't that simple in bioinformatics.

[-]

Turbulent_Pin7635@reddit (OP)

You make it clear now. I'm insisting, right now, following some tips around here, the results improved dramatically. But, it was one day configuring tools and knowledges.

The only downside right now, is time. Of course my domestic workstation cannot compete with Closed servers. But, the answers are very close, sometimes even better than the one in Claude Opus 4.6, but 10x slower and basically free.

[-]

Turbulent_Pin7635@reddit (OP)

My main problem is to make those tools work on Mac, implementing it on Nextflow is a nightmare, recently more tools were published for mac-users. But, due to the lack of a formal computational background, most of errors are hard to get around and AIs normally begin to delire before diagnostic the error (not the biological interpretation, but what is causing the pipeline error).

You are right, nothing unusual, but most of AIs were not trained in bioinfo, some details, parameters (and even if they are simple each tool has several parameters) makes it difficult. I work with non-model organisms, sometimes difficult ones like >50% of repeats in the genome due to transposon activity. The nuances, the abundance of parameters, the difficult to work with Mac are the problems.

Claude solve most of it, I'm trying to have the same quality or close to it at local-level. Not even other paid models gets close to the solutions performed by Claude. I want to improve my local setup to work on it.

[-]

Adventurous_Cat_1559@reddit

ah okay, so two things you could probably try:

Find the docs for the specific tool e.g., https://github.com/nextflow-io/training, clone it locally. Load up something like opencode then prompt it with "We want to build a tool which takes our data X and performs Y using nextflow. Using the documentation in , formulate a plan for this which will run on my mac"
Whilst I don't think what you've described is specifically just mac issues, could you try getting it to work on docker? It's usually good to have something which works and then prompting your LLM to translate from one system to another then you're just solving one problem and not multiple possible conflicting ones. Even if docker is superslow, build a workflow with a small test set of data to crack through quickly.

I'm currently running qwen/qwen3.6-27b at 8bit quant on my mac studio and it works well for this sort of thing.

[-]

Turbulent_Pin7635@reddit (OP)

It work on docker yes, but some analysis with big data sets just gets jammed. In my case the Nextflow/ATAC-seq, the test profile runs, but it get stuck with a big data set, this doesn't happen in an Linux/Intel machine.

[-]

Turbulent_Pin7635@reddit (OP)

Non-model species and some data that is needed for multi-omics analysis, some concatenation of data. I know the thing is "easy" the problem is when you need to take care of several easy things, with several parameters and combinate it in pipelines.

[-]

Adventurous_Cat_1559@reddit

Again, that’s just a simple bash script from what you’re describing. I’m really not trying to be rude, I want to be helpful.

Not to dox myself but I’m a long time contributor to samtools and worked on this stuff over a decade.

If you want to give some concrete examples of what you’re struggling with I’ll genuinely help.

[-]

ItilityMSP@reddit

Seperate the pipeline, get a cheap linux box run bioinformatics stuff there and feed the local llm on the mac.

[-]

SadBBTumblrPizza@reddit

I mean like, I'll say this about scRNA: it's as much art, voodoo, and snake oil as it is science. You'll get better results there by just reading Lior Pachter's papers, doesn't matter what LLM is helping you. My 2 cents on that is "almost all scrna clustering is complete hogwash".

But no, I'm a bioinformatician and Claude is still king here. Kimi is actually half decent at it, but honestly if you're an actually trained bioinformatician any of them will do the job because you should be driving it. I've gotten good results and even published a tool using open models as long as you're very careful about the harness: for example, I wanted to write a super fast rust implementation of a statistic that didn't really have any good packages out there, so I read a paper and pointed claude + kimi at it. Both missed a very crucial part of the equation I was trying to model, and I had to really force them to implement it properly. But both also caught a few little details here and there I missed.

[-]

Turbulent_Pin7635@reddit (OP)

Claude just multiplied what I can do with the little time I have. Because, I am not a pure bioinfomatician, I am in every part of the experiment design, from taking care of the animals to the bioinfo analysis and this as you know code consumes time. Thanks for the insights. I'll try Kimi! In you opinion, how close it gets from Claude?

[-]

SadBBTumblrPizza@reddit

I'd say 2.6 is like 80% - 90% as good. kimi 2.5 is a fair bit behind that but still usable with enough guidance. If you know your domain well and can spot the failure modes you can use anything vaguely frontier-level and do ok.

I'd bet if you spent enough time planning you could even get a Qwen3.6 to do the trick. I should try that when I have some spare time and a project to throw it at.

[-]

Turbulent_Pin7635@reddit (OP)

I use the 3.6 35b as daily use model, but it codes well but don't take the biologic details needed to code properly. I deal with non-models organisms and several shitty scaffold level chromosome, when I get luck, transcriptomes are more common.

[-]

JosephGenomics@reddit

You shouldn't be getting downvoted; 'bioinformatics' is far more specialized and a smaller problem than coding, and everyone here talks about coding models.

That said, I'm slowly generating training data for fine-tuning, esp. the first step of the bioinfo day: "Is there a new version of this tool? If so, grab it, use it."

[-]

SadBBTumblrPizza@reddit

Yeah I add an AGENTS.md instruction to the effect of "never, ever assume package numbers are right, your package versions are certainly wrong." and it works... 50% of the time if im lucky lol

[-]

Turbulent_Pin7635@reddit (OP)

I would say: if so, DON'T TOUCH IT!!!

Version conflicts is half of my life. 😆

[-]

JosephGenomics@reddit

Haha! Mine is "Your assembly has X problem, which was fixed in your assembler two years ago, but you chose to use a version from four years ago, despite starting this project last year...."

[-]

Environmental-Metal9@reddit

Rust 2018 calling from the grave lol

[-]

JosephGenomics@reddit

Lol! True. I'm finally not seeing any "new" python 2 tools as well

[-]

Enough_Big4191@reddit

for bioinformatics i’d be pretty strict with evals, because a model can sound fluent and still mess up a pipeline step or package assumption. i’d test open models on a few real nextflow/rnaseq tasks u already know, then score command correctness, missing QC steps, and whether it invents parameters. closest “good enough” may depend less on the model name and more on whether it stays grounded in ur exact pipeline docs.

[-]

SadBBTumblrPizza@reddit

Yeah this is it. And honestly in all scientific computing this holds: you NEED a testing harness, and that harness needs to be airtight, no shortcuts.

Two things about that though: 1) 99.9% of bioinformatics people are not software engineers and can't/won't/don't have time to write tests so they pretty much always get skipped. 2) thankfully, LLMs make writing good tests much easier and less tedious than ever.

[-]

Turbulent_Pin7635@reddit (OP)

I need to take care of colonies of animals, do molecular Biology, bioinformatics, find funds and write reports. 😆 I'm tired boss. If I can automotize any step trust me I'll do.

In one week I am a farmer, a farmaceutic, a salesman, a writer and a programmer. Any time got is a bless.

[-]

Turbulent_Pin7635@reddit (OP)

What I have done was feed it with RAG from the documentation of the softwares, like MACS, etc... Documents of Nextflow... But, yet, even with the biggest qwen3.5 (I try my best to keep the quantization around q6-q8) the answers are not good =/

[-]

Sooperooser@reddit

Medgemma maybe

[-]

ttkciar@reddit

Medgemma's quite good, but GLM-4.5-Air is better, and I would hope/expect GLM-5.1 would be even better than that.

[-]

miniocz@reddit

The issue I see is mainly that local smaller models lack knowledge. I am now trying to give local models a way to gegt such info, but I am stuck as I have limited time per day...

[-]

Turbulent_Pin7635@reddit (OP)

Same here. I have gave it a basic dataset with RAG. Gave it access to the full context of papers through specialized journals, but yet... Not exceptional happened.

[-]

miniocz@reddit

So you want to actually do the analysis? I want it to design and write code I then run or HPC or cluster and use proper tools and methods and advanced stuff I am curious about but have no time to dive into and learn in depth. But IMHO the dataset should be accessible through code interpreter not RAG. It never worked for me with tables (not surprising).

[-]

Turbulent_Pin7635@reddit (OP)

Nah! I want it to code. But, I deal with non-model species with shitty DNA. That are some details that is needed to catch otherwise the thing is worthless.

[-]

ttkciar@reddit

I can't speak to RNA sequencing, but GLM-4.5-Air (open-weights) has been quite good for assisting me with medical journal publications about autoimmune and endocrine disorders (mainly ulcerative colitis and T2D).

I do not know how it compares to commercial inference services because I do not use commercial inference services.

Because GLM-4.5-Air is good, I would assume GLM-5.1 (also open-weights) would be better, but have not been using it, so again I cannot say from direct experience. "On paper" GLM-5.1 is purported to be somewhere between Claude Sonnet and Claude Opus in competence. You might want to at least try it out and see how it compares for your specific application.

If you are looking to infer locally on your own hardware, GLM-4.5-Air at Q4_K_M and full 128K context needs about 128GB of memory (preferably VRAM), and GLM-5.1 at Q4_K_M and full 200K context would need about 512GB of memory.

[-]

Turbulent_Pin7635@reddit (OP)

For paper search you can get an API key from semantic scholar. =)

I have the M3ULTRA 512Gb. I am truly impressed by Claude power on bioinfo. And not many models are putting effort on it. That's a shame!

[-]

ttkciar@reddit

I have the M3ULTRA 512GB.

Then you should be able to self-host GLM-5.1 quantized to Q4_K_M, though you will find it slow, and might need to bump the context down a bit (especially if you are running other memory-hungry applications like a web browser).

[-]

Turbulent_Pin7635@reddit (OP)

From the big ones I normally use the Qwen 3.5 387b @q8. The speed normally is the same through the models something between 20-50tk/s. For daily use I go with the Qwen 3.6 35b @q8 because it has a very good pp.

But neither of this models has a very good knowledge on bioinfo. Even with RAG, tools and search. =\

And there is not a lot of benchmarks in bioinfo.

[-]

ttkciar@reddit

The best benchmark is your actual application.

My standard practice for evaluating a model's competence at a specific skill or domain is to ask it to solve the same real-world problem five times, see how many times it gets it right vs wrong (especially hallucination), and look at both what is common between its answers and what differs. That gives me an impression of what to expect from it, especially in terms of reliability.

Since you are already familiar with Qwen3.5-387B's competence, you can use it as a baseline to compare against other models. If there are particular problems where Qwen3.5 succeeded and other problems where it failed, repeating those with other models should give you an idea of how they compare.

You can also ask Claude Opus to solve those problems, of course, to get an idea of how those models compare to the best of the commercial services.

[-]

Turbulent_Pin7635@reddit (OP)

This is what I'm trying, Claude even launch a benchmark for it, but that's not something the models are testing against, or because it is relatively new.

[-]

ttkciar@reddit

My issue is not search; it's puzzling through the biochemistry described in those papers (I'm an engineer, so have sufficient math skills to understand them, but only one semester of organic chemistry), and putting ideas from two or more papers together so Air can sanity-check my conjectures.

My nephew is working in Computational Biochemistry (simulated protein-protein interactions), so whatever you find best suits your needs, I would love to hear about it, so I can pass it along to him.

[-]

Turbulent_Pin7635@reddit (OP)

Probably he is much more skilled in protein folding and interaction than me what I know is the classic alphafolding 2.

Are you using firecrawling or another tool to read through the entire document during the search?

[-]

ttkciar@reddit

Are you using firecrawling or another tool to read through the entire document during the search?

No, my workflow is to read the paper(s) in one window, run a scientific calculator (with biochem library bindings) in another window, and keep notes in a text file in another window. I cut+paste content from the paper(s) into my notes, and also relevant calculations and their results, along with my observations and conjectures.

When I want GLM-4.5-Air to provide critique or suggest relevant subjects, I invoke it via llama.cpp's llama-completion cli utility and interpolate my notes, like this:

$ bin/glm45air "Critique my notes and provide constructive criticism:\n\n`cat notes.2026-04-26.txt`"

.. where glm45air is a bash wrapper for llama-completion, tuned to run on my forty-core Xeon server: http://ciar.org/h/glm45air

[-]

Turbulent_Pin7635@reddit (OP)

Cool!