First time using the MareNostrum V Supercomputer, writeup of what actually surprised me coming from cloud
Posted by Georgiou1226@reddit | programming | View on Reddit | 35 comments
__calcalcal__@reddit
Very beautiful, but when the Barcelona Supercomputing Center tried to create LLMs for the languages of Spain, it failed delivering something of value.
https://www.xataka.com/robotica-e-ia/arranque-alia-modelo-ia-espanol-ha-sido-erratico-decepcionante-ahora-sabemos-que
The BSC has been involved in some legal cases about misusing of funds.
https://caliber.az/en/post/eu-prosecutors-probe-spain-s-first-quantum-computer-over-suspected-fund-misuse
axonxorz@reddit
Very relevant points, in r/programming, indeed.
__calcalcal__@reddit
The same relevance than describing the building where the computer is.
axonxorz@reddit
So none, understood.
__calcalcal__@reddit
Context to understand that the organization is being the center of some scandals. If for you that’s not relevant, what can I say, you need to think from time to time.
axonxorz@reddit
I'm failing to see how the political funding drama at the heart of this organization is relevant to HPC scheduler architecture. Perhaps I'm just thinking about HPC scheduler architecture too much to care about everything an org has done wrong.
shellac@reddit
There's a small mistake in the air gap diagram, 'HPC Environment'. Login nodes can often access the internet, but compute nodes can't. A lot of my (wall) processing time seems to be getting data on and off of the compute storage. Once there everything flies, but scratch space isn't safe (despite what users think).
BTW slurm is amazing capable. For example one trick I discovered fairly recently was its ability to run up cloud compute nodes as required. It will then shut them down when no longer needed.
Georgiou1226@reddit (OP)
You're right for most clusters, and I assumed so too, but MareNostrum V is unusually strict and both compute and login nodes are fully airgapped. Everything needs to be pushed/pulled from your local machine.
shellac@reddit
Their users must have really irritated the admins. Probably running jobs on the login nodes. I can sympathise.
victotronics@reddit
"A fat-tree topology [...] guarantees non-blocking bandwidth: any of the 8,000 nodes can talk to any other node at exactly the same minimal latency."
That is slightly optimistic. You can still have contention, and since networks typically have over-subscription that is quite likely. InfinitBand never quite sorted out dynamic routing, at least we never got it to work convincingly. Hence static routine, hence contention. But the resulting bandwidth is pretty impressive anyway.
IllllIIlIllIllllIIIl@reddit
As an HPC engineer, I enjoyed this. I thought it was a pretty good little intro into the world of HPC, and I appreciated hearing the perspective of someone new to it. I've been in HPC for over a decade now and it's easy to forget how unusual it can feel to new users.
Georgiou1226@reddit (OP)
Thanks so much! I'm really glad my perspective on it was interesting to read
JustOneAvailableName@reddit
module was a big letdown for me, it feels easy but actually took way more time than using Docker via Apptainer. I kinda assumed that the linked binaries would be cached locally (or in memory) and it was very surprising that a mysterious slowdown was connected to ffmpeg calls.
Overall, I liked Slurm a lot more than K8s for job-based computation.
tecedu@reddit
modules are bar none the worst things. They should be enforcing apptainer on any new projects/builds yet people are like oh no docker is too complex. Its infinitely less complex than modules on nfs and apptainer is much more performent.
AstroworldMC@reddit
Reading this made me feel better about wrangling a modded Minecraft server. HPC folks deal with modules, schedulers and hundreds of cores, I'm just happy when my TPS stays above 15. I tried to run a big modpack on a tiny VM once and it felt like waiting for MareNostrum.
Zulban@reddit
I worked in HPC IT ops for 6 years. This was an unusually good technical introduction.
Georgiou1226@reddit (OP)
Thank you! That's a massive compliment coming from a professional.
fgorina@reddit
It reminds me when at the UAB we had to use a Univac in Mdrid. We did something similar in a Job Control Language with instructions of what to do. Put JCL + program + data in perforated cards let it in a tray and waited for tomorrow. We get a listing with errors or the result . Of course it wasn’t a supercomputer but the idea….
cinyar@reddit
I mean ... does any shared computing platform tolerate loitering? Jenkins will kill my job when it reaches the timeout, I wouldn't expect anything less from a supercomputer.
atxgossiphound@reddit
Yes? AWS makes a fortune from people "loitering" (forgetting to turn off their instances). That platform loves loitering. :)
A decent devops team will protect an application from loitering, but the platforms don't care as long as the meter is running.
I go back and forth between HPC and cloud and every year or two go through the exercise of moving a client's scientific code off the cloud and onto a local HPC system. It usually pays for itself within the first 6 months. And contrary to popular belief, the administration once an HPC system is up and running is not expensive. You do need a room with a good AC and fire suppression, though!
In these cases, the cost savings aren't from idle nodes, it's just the huge premiums cloud providers charge for compute. If you run it all the time behind a firewall, it's cheaper to own it.
tecedu@reddit
And forgot not about the storage, its extortionate but such shitter performance stoage
atxgossiphound@reddit
Don't get me started on FSX for Lustre...
Here's a fun story: I worked with a client that used to brag that were one of AWS' top 10 customers. This was a gene sequencing company, sequencing at scale, so their cost was almost entirely storage. Their original CIO had them be a cloud-first company, so they had dozens of petabytes of data in AWS.
They needed to get off AWS before they burned through all their cash.
They were cagey about the costs at first, but I did a quick calculation based on how many sequencing runs they did. I causally said, "say you're spending $5M/month (not the real number, but general ballpark) and you have 25 PB of data currently. That's about $25M in egress to get off AWS, or 5 months of staying on AWS."
They got a little uncomfortable until someone just admitted that those numbers were about right and they didn't know what to do.
They ultimately started a migration project to a different cloud vendor to do a private cloud, but before they made it too far the market had changed enough that they pivoted and didn't really pursue that part of the business. I have no idea what happened to all their data.
Successful-Money4995@reddit
Part of the advantage of renting from Amazon is that you don't have to upgrade your hardware. My clients are always looking for new applications for their "aging" hardware that's only five years old.
atxgossiphound@reddit
I've seen 10 year old clusters humming along nicely. I've never seen a computational scientist turn down hardware.
That said, 5 years is the number I always use when building out clusters. They just usually end up running much longer.
Successful-Money4995@reddit
GPUs for AI age extra fast because of how quickly the new technology is evolving.
My clients want the new chipa because their clients want the new chips and their clients have boatloads of cash to afford it because AI money is plentiful.
tecedu@reddit
a bunch of resellers and vendors nowadays have "leasing" schemes where you pay opex instead of capex and they automatically upgrade, deprivision and do all of those things
bargle0@reddit
That’s the pinch, since that’s not cheap.
atxgossiphound@reddit
But it really isn't a gotcha. It's just part of the cost tradeoff. Most sites with lab space already have capacity to add a server room (or they have one that was decommissioned when someone else moved things to the cloud :) ).
tecedu@reddit
Doing it all from scratch no, but doing it in already a data centre or a colo space its very cheap
slaymaker1907@reddit
The big advantage of cloud is that you have capacity as soon as you have budget. On-prem compute generally requires a huge lead time and that lead time encourages teams to hoard and massive over provisioning. By massive over provisioning, I don’t mean provisioning for huge compute days like Black Friday, I mean things like getting twice as many dev machines as you actually need “just in case” your dev demand increases in the medium to long term.
atxgossiphound@reddit
You can do that without the cloud, too. I've never had a problem calling my rep at Dell to get more hardware when the budget becomes available. Sure, the lead time is a little longer, but the budget is spent on hardware that's going to be used 24/7 instead of burned quickly renting hardware.
This also hits on another disconnect between HPC and commodity cloud hardware. HPC systems are optimized for system-level performance, be that compute, storage, bandwidth or some combination of those:
Cloud infrastructure isn't optimized for these use cases (with the caveat that the AI push is changing the landscape a bit), especially on the network and storage side of things. And for just straight compute, running simulations 24/7 on AWS can easily lead to bills in the $50k range for simple studies. Once people hit that number, they're better off buying hardware (well, that was true until we decided AI is the only application that matters anymore ;) ).
tecedu@reddit
Not for this type of workload atleast, unless you have quota reserved you cannot actually spin up easily. Especially for high performance compute.
Define huge lead time, but the machines in cloud quota, ie normal machines can be delivered within 2 weeks, faster ones more time. Even with overprovisioning, buying physical is cheaper
sailing67@reddit
ngl the jump from cloud to HPC is always wild. what surprised you most?
Irregular_Person@reddit
bad bot
ng37779a@reddit
Its interesting how cloud has trained us to throw resources at problems and optimize later; but supercomputers force you to think in terms of fixed allocations and queue time costs upfront. I think many cloud developers would struggle with the constraint-first thinking HPC demands, but good to learn!