Tenstorrent Blackhole Cards

[-]

JockY@reddit

Woah. 32GB per card for $999?

These could be very, very interesting depending on the answer to: what inference software can these run? Llama.cpp? vLLM? Tenstorrent’s own stack? If the latter then can we use GGUFs or AWQ or…

So many questions.

[-]

tomz17@reddit

$1,400 for the 32gb variant, according to their website.

[-]

2roK@reddit

That's laughable really, it was a decent deal for 1k @32Gb, adding VAT and now these cards cost as much as a 3090 on launch and they perform worse and no CUDA..

Are these companies dumb or just greedy? THE ONLY REASON why people would start using these and deal with all the hassle that comes with it would be the price.

[-]

tomz17@reddit

cost as much as a 3090

Comparing these against a 3090 is laughable.

Nobody is building a brand new product to compete with a 3090 for some homelabber. These are enterprise products with a price-point designed to compete with instinct / tesla and go into large datacenter racks for applications that have a team of software engineers backing them.

The killer feature here is the modular stackability. You can interconnect these fairly easily (albeit at reduced performance) without the NVIDIA backplane.

[-]

2roK@reddit

I mean, technically 5090's aren't consumer cards either, we had to come up with the prosumer word for this situation.

[-]

tomz17@reddit

Technically no, but NVIDIA now markets these cards as the halo gaming product instead of the premier workstation product (i.e. the RTX PRO 6000 now took that spot), and so a lot of gamers shelled out for the 5900, 4900, 3900 (vs. previous generations where the Titans/Quadros were more rarely coveted by gamers)

[-]

mnt_brain@reddit

Needs to be a 10% of that

[-]

tomz17@reddit

Nah, as long as the scalability promises hold, this pricing is a bargain. Keep in mind that the majority of the "localllama" community is not the target customer for any of this stuff. You can't support a real AI business (esp. one rolling custom boards/silicon), by selling to homelabbers.

[-]

ikergarcia1996@reddit

True, you want to sell to big enterprise and get your GPUs in those huge datacenter that AWS, meta, openai... are building. But, Nvidia was for a long time a company whose customers were homelabbers and PhD students getting free GPUs from them. And these people, contributed to build the software ecosystem that made Nvidia dominate the AI market.

Give GPUs to students, research groups, homelabbers... and when these people transition from academia/their parents basement into startups, they will want to use the same hardware they were using before and know about. A few of them will be successful and end up buying a huge amount of GPUs from you.

[-]

moofunk@reddit

I don't know if their cloud solution is running yet, but they are meant to run one, so you can get access to TT hardware either for free or for typical cost of cloud compute.

Many of their contributors to the software stack are indeed homelabbers and expert volunteers, and some have gotten a free TT card that way, but their stack also being entirely open source under the Apache license might mean that enterprise customers aren't interested in sharing trade secrets for running their software on TT hardware.

[-]

JockY@reddit

32GB of performant VRAM for $140? I mean I’m all for optimism, but this might be going a little far.

[-]

mnt_brain@reddit

It’s not usable beyond specific utility. GPUs are relatively general purpose next to CPUs

[-]

JockY@reddit

Correct. They’ll suck for training and be great for inference, which is the correct order for most folks around here.

[-]

mnt_brain@reddit

Theyre not great for inference though. They're great for inference for very specific things.

[-]

JockY@reddit

Your unsubstantiated assertions don’t match what I’ve read elsewhere.

For example, Tenstorrent maintains a fork of vLLM to run inference on their cards. This should give us broad support for inference, albeit lagging behind in support for the latest models; I don’t think they’ve merged Qwen3 support yet.

This is vLLM for OP’s GPUs: https://github.com/tenstorrent/tt-inference-server/

What are the “specific things” you neglected to specify?

[-]

mnt_brain@reddit

This is a fork of vllm with a subset of vllms features.

[-]

JockY@reddit

Again with zero details. Can you actually inform us with real information to substantiate anything you’re saying?

What’s missing from vLLM? Can you cite a reference to the information?

Or is this all being pulled out of your butt?

[-]

frozen_tuna@reddit

GPUs are relatively general purpose next to CPUs

That conflicts with everything I've read about the subject, ever. Is everything I understand about hardware wrong?

[-]

mnt_brain@reddit

No- GPUs implement enough to be a desktop generic tool that improves your experience. These cards do not. They are not the same.

[-]

beryugyo619@reddit

$140 for 32GB/card is Alibaba MI50 prices!!!!!!!!!!

[-]

SashaUsesReddit@reddit (OP)

10% of that? Price wise? I don't understand... vs other options its incredibly well priced

[-]

mnt_brain@reddit

It’s not usable beyond a very specific instruction set and has no general purpose use.

[-]

arthurwolf@reddit

And ??

So ??

That's not how things are priced...

People need high performance AI compute hardware. A lot of it. More and more every week.

There's a lot of demand for this, there are quite a few startups going this kind of route.

For good reason. There's money to be made. Because massive amounts of companies have AI processing needs.

This fulfils that need, this is the supply to that demand, and it's priced at whatever price they expect their customers are ready to pay for it.

And in the AI space, "whatever customers are ready to pay for it" can be quite high.

"Is it general purpose" has very little to do with any of this. This is specialized hardware, specialized hardware can cost insane amounts of money.

When I purchase (or design) a CNC controller for a laser cutter, I don't worry whether or not it can control my living room TV. I worry whether it's good at it's CNC controlling job.

I've designed a CNC controller that has been used in hundreds of thousands of machines, and I don't think a single time, whether it's "general purpose" has even been brought up by anyone.

It's specialized hardware, it's supposed to do what it's specialty is, very well. That's it's purpose...

This notion that an AI accelerator is not "general purpose" enough is just incredibly weird...

[-]

remind_me_later@reddit

Needs to be a 10% of that

Delusional to the point of nonsense.

The GDDR6 RAM alone costs $8 per 2GB. That's already $128 right there.

Not to mention the other components on the board. For the p150a/b, each of the 4 QSFP-DD 800G connection ports add another $100 to the cost, at minimum.

The miscellaneous power supply & voltage regulator chips & capacitors tack on another $200.

[-]

JockY@reddit

If it’s fast then take my money.

[-]

SashaUsesReddit@reddit (OP)

They maintain a quite good fork of vllm. Quite performant on wormhole (haven't installed these yet)

[-]

BusRevolutionary9893@reddit

How's it compare to a 3090?

[-]

DepthHour1669@reddit

A bit better for inference, a lot worse for finetuning and other ML tasks (mostly because you want CUDA for that).

[-]

emsiem22@reddit

How is it better than 3090 when it has 512 GB/sec vs 936 GB/sec of 3090?

[-]

DepthHour1669@reddit

Much faster than nvlink though, and you get limited by that for any model bigger than 12b at BF16.

[-]

emsiem22@reddit

Why use any model of that size in BF16 on 3090. Quantized GGUF are so much value for VRAM (negligible quality loss up to 4-5b depending on model) and much faster inference using llama.cpp with fa.

[-]

DepthHour1669@reddit

Only if you're playing around at home.

If you're a director at a tech company, making decisions to justify to a VP about running inference locally, you do NOT go with Q4 (after spending your quarterly budget on training/finetuning your own model for internal use). That's just standard cover your ass business sense.

"It's good enough, home hobbyists do that" is NOT justification if the CEO comes down and asks why is it not working perfectly and notices that it's quantized. The next question invariably will be "ok, what does OpenAI/Google use?" and you do NOT want to be in that meeting when you get torn a new asshole.

[-]

emsiem22@reddit

So many things to comment here :) But most interesting to me is my assumption that OpenAI do use quantized models and that’s the reason behind people reporting model being stupid sometimes.

Second, it depends on usecase even in corp environment. And CEOs usually don’t know the difference.

Finally, we were talking about 3090 so not corp setting

[-]

moofunk@reddit

While the exact performance differences haven't been documented well between Nvidia GPUs and Tenstorrent chips, they have wildly different ways to move data, so it's hard to compare the bandwidth requirements directly.

TT chips can move data directly between Tensix cores on arbitrary other chips without ever touching main memory. On chip you never face the situation of having to move data through caches or main memory to get that data to another core.

On GPUs, you will have to do that, if you want to move data between compute units. Bandwidth becomes important, because there is simply more traffic.

Tensix cores are completely asynchronous in that they don't share any cache memory with other cores, so they don't have to wait for unrelated data for another core, before they can start working.

All data movement is also fully software configurable.

That can mean you can have several completely independent NNs working on the same chip with them not affecting each other and with full utilization.

[-]

emsiem22@reddit

OK, I get the architectural advantage of TT, especially direct core-to-core data movement. But even with that, you still need to run all layers in sequence during inference (unless MoE). So activations are still passed and 3090 higher memory bandwidth still matters (depending on model size and batching). Not even mentioning GGUF support with fa.

But OK, we need real life benchmark done by someone, hopefully OP reports back.

[-]

DepthHour1669@reddit

Because this is a professional device, not a home device. At-home inference is usually Q4, but actual AI companies usually do full BP16 (or Q8 inference at worst).

And I guarantee you a correctly set up Tenstorrent system will do inference on a 64gb model like Qwen3-32b or a 140gb model like Llama3.3 faster than a pile of 3090s, even if you nvlink them together. The 3090 will only win out for small 24b models or even tinier sizes.

[-]

BusRevolutionary9893@reddit

Is the memory bandwidth only 512 GB/s? No way inference is faster.

[-]

Yes_but_I_think@reddit

It's not faster but it allows running of larger models not possible without dual GPUs or exorbitantly priced ones. 512 GB/s is 10-15 tokens/s for a 30GB sized model file.

[-]

JockY@reddit

Double woah. Holy shit. I would love to see some numbers out of these.

[-]

SashaUsesReddit@reddit (OP)

Ill be sharing!

[-]

JockY@reddit

Take my upvote, good person/woman/man/camera/tv.

[-]

Ulterior-Motive_@reddit

That's not *super* impressive. You could get an MI100 for the same price and capacity, and better support.

[-]

JockY@reddit

The Ml100 doesn’t come with low-latency high-throughput interconnects though. These bad boys link into a giant fast pool of VRAM via 800GB/s Ethernet.

[-]

sdkgierjgioperjki0@reddit

Does the $1000 Blackhole cards really come with those interconnects? I thought they were sold separately and are extremely expensive.

Also I'm pretty sure it's 800Gb/s as in gigabit, not byte. So it would be 100GB/s which is slow for VRAM.

[-]

pulsating_star@reddit

No, it is 800GB/s (or 400GB/s in one direction, if you're looking for a VRAM like solution). There are 4 bi-directional 800Gb/s ports.

[-]

moofunk@reddit

So it would be 100GB/s which is slow for VRAM.

Data can take many paths via software configuration, so VRAM for one chip can still be bottlenecked.

[-]

JockY@reddit

Yes the card comes with those interconnects according to the specs. And you’re right, I typo’d bytes instead of bits, my bad.

[-]

SashaUsesReddit@reddit (OP)

MI100 can't do native BF16 or FP8 activations.. its way behind on workload.

The compute speed difference is exponential

[-]

xxPoLyGLoTxx@reddit

It seems 28gb is $999 and 32gb is $1399.

[-]

JockY@reddit

Yeah, if the 32GB variant even remotely performs like a 3090 or A6000 for inference then… wow…. take my money.

[-]

xxPoLyGLoTxx@reddit

I mean yeah it would definitely be a better deal.

Have you seen the AMD MI5 INSTINCT cards? They seem to have 32gb and go for around $500-$1k. I'm not sure how great they are though.

[-]

Direspark@reddit

Very interested after seeing that Maxsun is selling dual b60's for 3k a piece. Would be nice to be able to run some bigger models without emptying my bank account.

[-]

IngwiePhoenix@reddit

They are honestly just insanely pretty. xD

Been on their Discord for a while, still waiting to get my base platform sorted (= bought). Absolutely interested in doing some stuff with those - their t/s seems pretty good!

[-]

ready_to_fuck_yeahh@reddit

RemindMe! -4 day

[-]

redditor100101011101@reddit

ooooo....yours??

[-]

Double_Cause4609@reddit

Making reviews is really hard.

Tenstorrent's software is in a big state of flux ATM, so performance is pretty asymmetric (some things are well supported, some aren't, etc), and there's also a bit of nuance that the benefit of Tenstorrent isn't really the performance of a single card but the way they scale.

They scale way more gracefully than GPUs due to their native networking, so it's kind of like... "Well, how the hell do you even review this thing?" and it's almost its own category of product.

[-]

pier4r@reddit

Couldn't one start with benchmark? Those as portable as they could across platforms. Say llama token generation.

Test it with one card, then with two, then with 3-4 and so on. Do the same with other platforms.

It is not a full review (and which is?) but at least it gives an indication. Otherwise it feels like "everything is unknowable".

[-]

Double_Cause4609@reddit

Well, here's the issue:

There's too many independent variables.

So, let's say you're in the market for a 32GB card. Your options are basically Blackhole, RTX 5090, and MI60. None of those are directly comparable. MI60 is cheap and has software support issues. RTX 5090 is great, but draws a ton of power. Blackhole is reasonable in power use, and has a transparent software stack (so a person could reasonably be expected to fix issues or contribute to the stack), but it has lower bandwidth than the others (off the top of my head), but has most, all, or more compute.

So, single card:

At low token count, you'd expect the RTX 5090 to win by about 3x.

At high token counts, software implemented perfectly, you'd expect the Blackhole to be a bit faster (note that I believe it can hit a higher utilization with fewer software optimizations, to my knowledge). Note that this isn't as cut and dry as it sounds, because that's only without KV-Caching. With it, token decoding eventually becomes memory bound again, but it becomes hard to evaluate where the crossover point is that no KV-Cache starts losing to KV-Caching when you have two different devices that might operate at different points on a pareto curve.

But, there's an issue: What if you do two GPUs versus two Blackholes? The Nvidia GPUs would have some communication overhead, and how much depends on waaaaay more things than you'd think, while the Blackholes scale pretty faithfully. Especially if you're doing it on a consumer system (PCIe x8) it'll start to show favorably on the Blackholes. Then, if you go to four, I'm pretty sure at that point the Blackholes might very well be faster for tensor parallel or pipeline parallel operations.

But then, you could switch the LLM being tested (keep in mind there's hundreds nowadays), and you could randomly find an LLM where the Blackhole is 1/10 of the 5090, and there's also I think one random LLM where the Tenstorrent stack outperforms the theoretical max speed in token decoding for some gods forsaken reason I can't be bothered to dive into the implementation to figure out.

So, anyway, you're looking at this really weird manifold, where the cards trade back and forth, and you always have to consider each result against a variety of situations.

In the end, you just buy them for different reasons. It's not really a simple sort of calculus, and you kind of have to consider them holistically.

[-]

pier4r@reddit

What if you do two GPUs versus two Blackholes? The Nvidia GPUs would have some communication overhead, and how much depends on waaaaay more things than you'd think, while the Blackholes scale pretty faithfully.

I see your point but I think the review you talk about is the perfect one, and perfection is always the enemy of good (because it is rarely achieved).

One can simply bench things and add caveats. Further it is not that cards exist in a vacuum. All the software and hw stacks needed to let them work once multiple cards are there are part of the system to test. Sure there are plenty of possible configurations, but one tests ONE configuration and gives an idea.

Otherwise we won't be able to say anything. As you say there are a tons of LLMs. Can one say that every LLM works faster on one system rather than another? Unlikely, but (a) one has to start from something and (b) there are samples for this. If user A makes a review and then user B adds to it and user C adds to it again, slowly it builds up. If no one starts because "I cannot test all possible combinations", then there is never progress.

One can applies this to everything. Imagine a review of a car. It depends on so many things: how does one uses it (for shopping? For traveling? For commute? A mix of those? How does this mix look like? Etc..), the budget, the maintenance, the type of roads used, the cost of gas, the ability of the driver, the availability of good mechanics around the customer, the availability of parts, etc..... With your approach no one should ever do a review for cars, but one has to start somewhere. All reviews are partial, but not doing them at all because they aren't perfect is not better.

[-]

phormix@reddit

I agree with both of you in this regard. I'd much rather have a reasonable performance comparison than none, but there are issues with truly aligning the outside factors of supporting hardware, OS, software/library/driver versions, benchmark metrics, etc

For gaming (video) cards, an established company or reviewer might be getting demo cards from NVidia, AMD, Intel and others and then doing the review on the same OS+environment build, base hardware, and the testing suite/parameters/etc.

At the moment, I haven't really seen much in the way of a mainstream standardized reviews for AI on even just video cards (because not many people are buying multiples of the 5090, 7900XTX, 9070XT, B5xx etc) and there's also not really a well-standardized test-suite given the varying levels of software support for the hardware and many many different sizes and shapes of models out there.

Actually the latter would be a good place to start. I think as things advance we might start seeing bootable images or containers which handle the stack and have a common test-suite that should run on a fairly wide variety of common hardware and models.

[-]

moofunk@reddit

Tenstorrent do provide a few benchmarks for various hardware setups and their projected targets for performance with the finished software. Most benchmarks are for the older Wormhole, not Blackhole.

At the moment, a factor is that not all low level functions are implemented, which can strongly impact performance.

[-]

segmond@reddit

Reads like an excuse to me, it's all about price to performance ratio. The hardware implementation is completely irrelevant.

[-]

austhrowaway91919@reddit

You're not wrong, but the nuance is where do you sample the price to performance - against a single consumer GPU? Against a rack of them?

But ultimately I actually agree with you. It's not hard to review this card for a consumer. 'This is what it costs, this is the firmware we're running, this is some standard benchmarks.'

[-]

gtek_engineer66@reddit

Well you clearly do both. 1v1 2v2 4v4 to show the difference.

[-]

SashaUsesReddit@reddit (OP)

That's the plan! I have most GPUs avaliable in quantity

[-]

gtek_engineer66@reddit

And if you're bored of them you can send them to me, its really no problem! Id carry that burden for a friend

[-]

bick_nyers@reddit

I would be curious about the pytorch experience. Could be an avenue for opening up inference to more models e.g. MoE models.

[-]

MoffKalast@reddit

Looks like this is what's available: https://github.com/tenstorrent/tt-torch?tab=readme-ov-file

The 4xP150c setup apparently gets about 25.2 tg for QwQ which seems a bit low.

[-]

bick_nyers@reddit

I saw that re: tenstorrent repo.

I'm wondering if it's as simple as pip install & change device name in pytorch code and it "just works" even if perhaps unoptimized perf. wise.

[-]

HilLiedTroopsDied@reddit

geohot's videos made it clear that they need to refine the software and their docs a lot

[-]

redditor100101011101@reddit

out of curiosity, where did you buy them? direct?

[-]

TheRealMasonMac@reddit

https://tenstorrent.com/hardware/blackhole

[-]

hak8or@reddit

Looking at their specs page for these, looks like these are quite fixed function, meaning they don't support quants of INT4 or floating point smaller than 8 bits? The smallest quants are basically 8 bits from what I see?

https://docs.tenstorrent.com/aibs/blackhole/specifications.html#data-precision-formats

[-]

SashaUsesReddit@reddit (OP)

FP2, FP4 and FP8

Int quants aren't super popular in production inference... and I hope consumer inference catches up to this.

Fp8, and especially fp4 inference has the memory advantages to match INT while providing much faster activations.

Fp4 vs int4 speed on nvidia, for example, isn't even close

[-]

bick_nyers@reddit

Block FP4 could be super promising as a quant format for inference performance assuming it's similar to MXFP4/NVFP4.

[-]

pulsating_star@reddit

Block FP4 on Blackhole is essentially MXINT4, if such format existed in the MX* spec.

[-]

Sm0oth_kriminal@reddit

Tenstorrent is interesting, but I have yet to see any real use of them even from hobbyists. Please update when you are able to actually run it!

[-]

pulsating_star@reddit

You can try it out in cloud/datacenter: https://www.koyeb.com/solutions/tenstorrent

[-]

TailorImaginary3629@reddit

"You can watch this https://youtu.be/lNGFAI7R0PE where George Hotz is tinkering with the Blackhole card. Final verdict — promising architecture with great potential, but not usable ATM, and not great value for the money. The 4090 and even the 7900XT are better choices

[-]

pinkfreude@reddit

What are you planning on running with it?

[-]

lqstuart@reddit

lmk when they support flash attention

[-]

randomqhacker@reddit

Tokens/second on Qwen3-32B and Mistral Large or it didn't happen!

[-]

xxPoLyGLoTxx@reddit

Yes post some performance measures!

[-]

vulcan4d@reddit

Never heard of these.

[-]

Comfortable-Rock-498@reddit

the most interesting thing about the company is their ceo - a legendary figure https://en.wikipedia.org/wiki/Jim_Keller_(engineer)

[-]

InsideYork@reddit

I love how x64 is just Jim Keller vs Jim Keller. Intel core, and now new Intel e cores vs Ryzen.

[-]

ikergarcia1996@reddit

Not only x64, before that, he worked for Apple and designed some of the first Apple Silicon chips. The man competes against himself.

[-]

InsideYork@reddit

I didn’t know how to throw in the A4 or the Tesla AI vision.

[-]

beryugyo619@reddit

Basically Batman if processor development was Batman movies

[-]

MrAlienOverLord@reddit

perf is worse then a 3090 tho .. so nice as a research platform .. but you are super limited and bound to there eco system

[-]

SashaUsesReddit@reddit (OP)

Not in my findings based on wormhole.. I was able to match 3090 perf to the previous gen

[-]

ThisWillPass@reddit

Can’t wait to see the prompt processing speeds on these thangs.

[-]

MrAlienOverLord@reddit

you will miss most of the eco system .. and for inference odds are a used 3090 is more efficient .. power draw is the same and you actually have TP support - i like what they are doing but tt needs 2-3 gens (aka at least 2 years) till that becomes actually somewhat useable for most.

unless you only want inference .. at a high price point .. vs used 3090's .. i dont think its a viable option rn .. also the eco system is small vs nvidia

[-]

Long-Shine-3701@reddit

Do they work in Mac Pro, and roughly what single card performance is in the same ballpark? I am currently considering 4x 3090 turbo or 4x Radeon Pro VII (both with appropriate links).

[-]

Prudent_Sentence@reddit

I think these are supported on linux only:
Operating System: Ubuntu version 22.04 (Jammy Jellyfish)***

[-]

MoffKalast@reddit

CPU: x86_64

Wouldn't work on any ARM Mac even if you could run Ubuntu.

[-]

moofunk@reddit

Software to do the same as with 4 3090s isn't going to be ready for a while, maybe a year or more. Their current weakness is a software ecosystem in flux. Too much stuff that doesn't work yet and too much stuff that isn't documented.

[-]

SashaUsesReddit@reddit (OP)

Have you run them? Not sure i agree with that based on my work with wormhole

[-]

moofunk@reddit

I have not, but as far as I understand, Wormhole and Blackhole have different maturity and support levels and Wormhole is still the most mature of the chips.

[-]

My_Unbiased_Opinion@reddit

If this thing gets supported on Ollama or llama.cpp, this might be something I buy. I only really care about inference.

[-]

SashaUsesReddit@reddit (OP)

Why not vllm?

[-]

MoffKalast@reddit

Impossible to get working, poor integration with frontends, support for new models takes a lot longer. Plus the high batch failing of llama.cpp is irrelevant for personal use.

[-]

wh33t@reddit

I'm still waiting for a frontend like KCPP to exist that can use vllm. To my knowledge nothing like this exists.

[-]

SashaUsesReddit@reddit (OP)

Any front-end that uses openai compliant api works! I.e. openweb ui

[-]

wh33t@reddit

Hrm. https://github.com/LostRuins/koboldcpp/issues/300

Doesn't appear it's a priority at all.

[-]

SashaUsesReddit@reddit (OP)

Not really sure what that is trying to convey exactly...

[-]

My_Unbiased_Opinion@reddit

A lot of day one support for models land on Ollama these days. I haven't used vllm so I don't know how easy it is to use these days. I'm a nurse by trade, not IT.

I might try VLLM one of these days. I hear it's faster.

[-]

MoffKalast@reddit

Featuring 16 big RISC-V cores

System Requirements

CPU: x86_64 architecture*

Ironic.

[-]

AnomalyNexus@reddit

Nice! Hope you make another post when you've tested them a bit

[-]

StyMaar@reddit

The p300 still isn't out unfortunately :/

[-]

JockY@reddit

I see these are passive variants. How are you going to cool them?

[-]

SashaUsesReddit@reddit (OP)

Server chassis cooling, proper GPU server

[-]

DunklerErpel@reddit

I am really looking forward to your review!

We were SOOOO close to buying several TT Galaxies but had quite a bad experience with their sales; the price of one TT Galaxy jumped in the sales meeting from $40k to $120k due to some calculation error. Among other... mishaps...

But I'm still mulling over buying a Quietbox for myself :)

[-]

Dizzy_Season_9270@reddit

they kinda look beautiful

[-]

Thick-Protection-458@reddit

RemindMe! 3 days

[-]

additionalpylon1@reddit

RemindMe! -1 day

[-]

RemindMeBot@reddit

I will be messaging you in 1 day on 2025-07-03 03:16:32 UTC to remind you of this link

CLICK THIS LINK to send a PM to also be reminded and to reduce spam.

^(Parent commenter can ) ^(delete this message to hide from others.)

^(Info)	^(Custom)	^(Your Reminders)	^(Feedback)

[-]

lompocus@reddit

Nvidia has some interesting features that eventually became the nvgpu dialect of llvm mlir, where's w good place to read about the equivalent for these cards?

[-]

Active_Change9423@reddit

They did open source their compiler.

[-]

Affectionate-Cap-600@reddit

what architecture does those cards use?

[-]

moofunk@reddit

Each chip is organized like a grid of what's called Tensix cores, each with SRAM and 5 RISC-V baby cores. 3 cores interface with a vector engine that interfaces with SRAM.

2 remaining cores move data to and from each Tensix core to others. There's also a range of 5 or 10 full size RISC-V cores to help move data off and on chip from DRAM.

Each chip gets 28-32 GB of GDDR6 DRAM.

Data moves around like chess pieces on a board. Much attention is paid to economic data movement.

On the edge of each chip is 10 800 Gbit ethernet interfaces to connect to other chips, other cards and other computers, and that's how they scale currently up to 32 chips and later 256 chips. No additional hardware is needed for scaling other than ethernet ports and cables.

The software sees a collection of chips as adjacent chess boards and coordinate them all as one big chip.

Their Tenstorrent Galaxy server contains 32 chips and 1 TB of aggregate memory.

In theory, they should be able to scale to thousands of chips, but there are some big software challenges to overcome.

[-]

Affectionate-Cap-600@reddit

really interesting, thanks!

[-]

saucepan-ai@reddit

Would be super excited to see throughput benchmarks for these on the Tenstorrent vLLM fork!

[-]

SashaUsesReddit@reddit (OP)

Ill share!

[-]

JockY@reddit

lol wen eta

[-]

SashaUsesReddit@reddit (OP)

Probably a few days to make sure I have everything setup correctly to have accurate data

[-]

JockY@reddit

Haha I’m just messin, but I appreciate the responses

[-]

SashaUsesReddit@reddit (OP)

ofc! I'm super excited to see the data too haha

[-]

SnooRecipes3536@reddit

CONSUMPTION

[-]

SashaUsesReddit@reddit (OP)

What does it mean

[-]

SnooRecipes3536@reddit

well seing that i have to elaborate, basically access to high speed N.I.Cs is normaly pretty bad over the speed of 100gbps (mellanox x4 used) wich has a moderate secondhand price of arround 75-120 bucks, while 200gbps *mellanox x7* USED starts at 500 and goes all the way up to a thousand, you'd ask, how? well, 100gbps cards are from pcie gen 3 genertion. 200gbps are from pcie gen 4, and currently most servers still use pcie gen 5, while 800gbps ones are reserved for the hyper high end pcie gen 6 wich somehow already released a couple years ago, main trick these cards use is a kind of DPU that comes in the card that allows the speeds of these cards to go so far while still having relatively low gbps for their x16 pcie slots, so, now seing brand new, cards capable of 800gbps per portwith SEVERAL ports for the 1400 dolar variant? you have any idea how much of a game changer this is ?

[-]

SashaUsesReddit@reddit (OP)

Its really what excites me so much on these. The scalability potential is fantastic

[-]

SnooRecipes3536@reddit

exactly dude, we just got actually worth it network cards after having to suffer through so many bad old cards being overpriced for no reason for YEARS

[-]

LicensedTerrapin@reddit

I assume wattage under load?

[-]

SashaUsesReddit@reddit (OP)

I can't tell if its a nothing post, and actual question for power consumption, or a political statement on buying things

[-]

LicensedTerrapin@reddit

Let's just assume it's about power consumption

[-]

SashaUsesReddit@reddit (OP)

Im good with that haha, I'll post after I get them running

[-]

tommy1691@reddit

Is this the poor mans Nvidia A100?