Polars code runs slower on 128-core EC2
Posted by Popular-Sand-3185@reddit | Python | View on Reddit | 61 comments
Disclaimer: I am not sure this post is appropriate for r/LearnPython since it's not a question of "how to do something in Python", rather I am looking for a lower-level discussion for why my Python application performs poorly on a significantly more powerful server. Hence I'm posting it here.
The problem:
I have a relatively complex data pipeline that is written in Polars. On my local machine with 12 cores, the pipeline finishes in about 1200ms. On my 128-core EC2, it takes 13000ms to complete. I have tried setting the POLARS_MAX_THREADS parameter to 12 on the EC2, and it's still slower.
I am using a TMPFS partition on both machines to read the data into the pipeline directly from RAM. Both my machine and the EC2 have DDR5 RAM so I think they should be comparable.
Anyone have any ideas why the pipeline would run much slower on the EC2?
john_crimson81@reddit
numa. almost definitely. pin to a single node with numactl first, see if you reproduce local performance. if yes you have your answer. if no it's something else but i'd start there
Henry_old@reddit
you are hitting the wall of amdahls law and memory bandwidth. throwing 128 cores at a single dataframe operation is just creating massive overhead for thread synchronization and cache coherence traffic. polars is fast because of vectorization and efficient memory layout not because it can magically scale linearly to a supercomputer. 128 cores is for distributed workloads not for a single in-memory table operation bottlenecked by ram throughput.
s4nc3s@reddit
what I have success with, in a cpu bound process, is to split the inputs across as many cpus as are available (os.cpucount) and then use concurrent.futures to spread the work across whatever is availabLe, then combine them all at the end… and yes this is with polars lazyframes and pyarrow/parquet data.
GrouchyManner5949@reddit
NUMA is almost certainly the culprit. a c8i.32xlarge has multiple NUMA nodes and polars distributes threads across all of them by default. the moment dataframe memory spans nodes, every cross-node access adds latency and you pay it constantly. your laptop wins because everything lives in one memory domain. POLARS_MAX_THREADS=12 doesn't fix it because the threads still get scheduled across nodes. numactl with both --cpubindnode and --membind set to the same node id is the real fix. also the maintainer's point about the streaming engine is worth doing separately. I've been tracking benchmark configs as tasks in Zencoder when debugging environment-specific perf issues like this, makes it way easier to not re-test hypotheses you already ruled out.
Popular-Sand-3185@reddit (OP)
Alright so I did figure out why the pipeline was taking so long. Essentially, the code was reading 128 separate files and then concatenating them as part of the pipeline. This as you would expect took \~10x longer on the 128 core EC2 than on my 12 core workstation. I fixed it by concatenating all the files into one before loading it into polars instead of reading 128 files separately. Particularly, I used the following function:
engineerofsoftware@reddit
This is embarrassing. This is why engineers should learn how low-level systems work. How you did not identify this as the problem immediately is beyond me. You make Python engineers look bad.
Popular-Sand-3185@reddit (OP)
They must love you at your job
engineerofsoftware@reddit
Fortunate enough not to work with idiots.
Popular-Sand-3185@reddit (OP)
Ok boss meet me in the parking lot lol
artofthenunchaku@reddit
Calling a subprocess to do text processing via bash script rather than just reading and writing the files in Python is an insane solution.
max123246@reddit
In particular, the reason why is you pay process launch startup cost for each shell command in that string. Head, echo, xargs means 3 process launches that were completely unnecessary
A process launch may make sense in python when the large majority of work is computation and since Python is interpreted, it may very well be faster to call a program written in a compiled language.
I'd be very surprised if the current solution is faster than doing it in python though with its built in std library file operations
artofthenunchaku@reddit
in fact there's 4 process launches, since it's executing the command through a shell
Popular-Sand-3185@reddit (OP)
Ok so how would you do it faster in python? We're talking gigabytes of data here. The overhead of launching a subprocess is miniscule when you get the benefits from using the coreutils which are already optimized for this task
artofthenunchaku@reddit
You're optimizing for the wrong thing. If all you're doing is reading and writing a file, the bottleneck isn't the CPU, it's the disk I/O. There's going to be very little difference in performance between Python vs C/C++ in that case.
Popular-Sand-3185@reddit (OP)
As mentioned in the OP the files are being read from a ramdisk so it's not disk IO. I found the bottleneck was calling pl.concat on 128 separate dataframes, causing a CPU bottleneck. In my solution I just concatenate all the text, and then load it into a single dataframe.
I would think that if I optimized for the wrong thing, the solution wouldn't have made the script run 10 seconds faster
SuspiciousScript@reddit
os.sendfileis a better solution here since it lets you skip userspace entirely.venterce@reddit
Thank you for posting the update, it's great to see the actual issue.
commandlineluser@reddit
If you missed it, the comment asking if you can provide code is the original author of Polars:
https://reddit.com/r/Python/comments/1td1790/polars_code_runs_slower_on_128core_ec2/ols2829
They will be able to provide specialized help, and are likely interested in finding out if this is a case where Polars can perhaps "do better".
ritchie46@reddit
Can you share the code you are running?
Popular-Sand-3185@reddit (OP)
Hey thanks for checking this out. So I was able to get some significant performance improvements, but it's still a bit slower on the large EC2 than on my workstation:
ritchie46@reddit
There is a lot happening here that's not Polars. I saw there were many subprocesses being spawned, which is certainly not cheap. What kind of files are you reading?
Best would be to let Polars handle the concurrency and parallelism.
I cannot see how you execute/collect the Lazyframe, but if you are on 128 cores you should definitely use the streaming engine here.
321159@reddit
Judt so you are aware: This is the maintainer for polars
Free-Cheek-9440@reddit
Also worth checking if Polars is parallelizing too aggressively for the size of the dataset. Sometimes the coordination overhead becomes more expensive than the actual compute. I’ve had cases where limiting threads actually improved runtime because the tasks were too small to shard efficiently.
poopoutmybuttk@reddit
Are you using steaming engine?
Popular-Sand-3185@reddit (OP)
yup
KandevDev@reddit
polars defaults assume cache-friendly working sets. 128-core boxes have NUMA, which means the moment your dataframe spans multiple memory nodes, you pay an enormous latency penalty per cross-node access. the laptop "wins" because everything fits in one memory domain. set POLARS_MAX_THREADS to 16 or pin to one socket, see if that recovers perf.
sirfz@reddit
Yep, run your script with numactl and pin it to close by cores, something like:
numactl --cpubindnode=0 --membind -C 0-11 python script.py
This'll bind your process to numa node 0 cpus 0 to 11 inclusive (12 cores). That's usually the structure but you can run numactl --hardware to check
KandevDev@reddit
numactl is the right call. one caveat: --cpubindnode is one flag, --membind is another, and they need to match the same node id to avoid cross-node memory access. the "12 cores per node" assumption holds on most EC2 c-series but check
numactl --hardwarefirst since some instance types have unusual topologies (especially the bare-metal ones).Popular-Sand-3185@reddit (OP)
Interesting, I will try this!
Cynyr36@reddit
Possibly a disk io speed issue in here as well. Locally it's likely a gen4x4 or faster nvme. Your cloud instance could be much slower.
Popular-Sand-3185@reddit (OP)
I would not think disk IO is the issue here, everything is running on a ramdisk
meatmick@reddit
Any chance there's CPU cache in effect? You said it's a Ryzen 7; is it an X3D variant? That could impact the performance.
Lba5s@reddit
Are the cores on your machine faster? Or your code/data might not be large enough to deal with the overhead of distributing over 128 cores
Popular-Sand-3185@reddit (OP)
They are faster (5700mhz on mine vs 3900mhz on the ec2). Would cause it to be off by a factor of 10? I did try reducing the core count on the EC2 with POLARS_MAX_THREADS=12.
axonxorz@reddit
Which EC2 sku are you running this on?
Popular-Sand-3185@reddit (OP)
c8i.32xlarge. It's a shared instance, right now I'm trying to get my service quota increased so I can try a dedicated instance
axonxorz@reddit
Shared instance means my firstline assumptions are that you're having CPU time stolen and memory access overhead.
Run your workload and run
topat the same time. Thestfield represents CPU time 'stolen' from your VM (against the theoretical 100%) for other VMs on the shared host.That instance type runs a NUMA architecture, memory access latencies will be inconsistent between cores. This alone is a source of great overhead compared to your local CPU, and unfortunately not one you can directly affect when running in a VM. You can at least see some stats using numastat
I think your biggest boon will be getting that dedicated host. Once you have that,
numactlcan be used to ensure your workload is constrained to appropriate NUMA nodes.One offside point of comparison would be to run the workload on your local workstation in a VM to see what kind of overhead a simple VM layer brings.
beragis@reddit
It might be the file system that was mounted. I saw similar performance differences, but not as extreme as yours, when I was testing a python program running on my local laptop, an EC2 instance and a kubernetes pod running in AWS.
Using my local pc as a baseline. The same process with the same number of threads in EC2 was about 80% of the speed as my PC. Kubernetes was about 15 percent faster. The only thing I could think of was Kubernetes was running on a dedicated instance. The cpu specs on the EC2 and Kubernetes were about the same as the PC. It ended up being the file system on the EC2 instance
Popular-Sand-3185@reddit (OP)
The filesystem I'm using is mounted to a tmpfs partition both locally and on the EC2, so everything is being read directly from RAM
Not sure i'm following. Are you saying my EC2 might be backed by S3 somehow?
beragis@reddit
Not necessarily. I have seen S3 buckets mounted on linux, which is what we use to both upload files to process and also store the results back to which are then read into a snowflake database.
The program itself pulls the files from the S3 to temp. It was once the file was in temp where we ran the comparisons. The underlying EBS on kubernetes was a faster throughput than our EC2 instance. Something like 4 to 8 times as fast.
Since you are using a tempfs then you are using EBS. Tempfs uses virtually memory which will utilize disk space if it needs to swap to disk.
Since it’s a shared instance it is highly likely swapping memory to disk and back to memory.
thuiop1@reddit
Frequency is not the only factor for the speed of a CPU...
Deto@reddit
Yeah but regardless it's unlikely that they single core performance is going to be 10x different
Popular-Sand-3185@reddit (OP)
This response would probably be more helpful if you explained what those other factors are, and how I can check them, and whether there's any course of action that I can do to improve it
thuiop1@reddit
Giving the exact CPU models would help, it is already hard to diagnose remotely.
artofthenunchaku@reddit
"Throw more threads at it" is rarely the solution to performance improvements, there's many factors that come into place; cache size, lock contention, speculative execution. More I'm forgetting or not personally familiar with.
I'd suggest running multiple multiple benchmarks at different thread counts (1, 2, 4, 8, ..,) to understand where you see the biggest improvements or regressions.
ottawadeveloper@reddit
threading comes with a performance overhead. what if you try it with max threads = 1 on both? if so, the CPU difference will make a bigger difference
Popular-Sand-3185@reddit (OP)
Still \~15second runtime on the EC2 and 220ms locally with with 1 thread
ottawadeveloper@reddit
Weird. Can you do any more specific monitoring and share it? Like CPU load per core, memory usage?
is there a lot of network I/O or disk I/O? A SSD vs HDD can be ten fold issues. Or differences in outgoing networking speed.
finding the bottleneck that the EC2 machine is hitting would probably be my first step in solving this - it'll narrow it down to disk, memory, CPU, or network hopefully.
also might be worth checking the Python configuration itself? like PYTHONOPTIMIZE, whether the GIL is enabled on EC2 (the GIL makes multithreading kinda useless for CPU bound tasks that are written in Python, but I'd expect to see more similar performance with one thread then).
thrope@reddit
What architecture is your local machine? Apple silicon or intel ? Apple silicon is much faster for numerical work
Popular-Sand-3185@reddit (OP)
AMD Ryzen 7 i believe running fedora
afnanenayet1@reddit
Run the polars query profiler to see which steps are the bottleneck. Which engine are you using? The regular one, streaming? How much memory is available on these machines?
I’ve found that polars doesn’t do as well with really high core counts, in particular bc the rayon (which polars uses for its thread pools) isn’t NUMA aware and thrashes CPU caches.
You can use py-spy and perf to get more details on what the bottleneck is.
timpkmn89@reddit
What's the processor utilization at while running it?
Deto@reddit
Yeah - I'm wondering if maybe it's not actually respecting the 12 core limit they're setting. Looking at top while it's running could reveal that
RedEyed__@reddit
What hardware?
Could be because of SIMD. Fir instance, local machine has AVX512 and seever doesn't.
gdchinacat@reddit
There has been a concerted effort over the past few years to improve python performance. Have you verified you are using the same version of python on both machines? Same version of polars?
Popular-Sand-3185@reddit (OP)
Upgrading python version sadly didn't change the performance at all :/
Popular-Sand-3185@reddit (OP)
It looks like I'm running 3.14 on my local machine and 3.12 on the server. Let me update this and see if it fixes things
wxtrails@reddit
Good call. This is an interesting one - I'll be keen to see how it turns out!
Regular_Effect_1307@reddit
!remind me in 2 days
RemindMeBot@reddit
I will be messaging you in 2 days on 2026-05-16 16:07:26 UTC to remind you of this link
CLICK THIS LINK to send a PM to also be reminded and to reduce spam.
^(Parent commenter can ) ^(delete this message to hide from others.)
carnoworky@reddit
It could be related to this: https://youtu.be/tND-wBBZ8RY?si=PlnNvCgj2iPq-2yL
Without seeing the operations we can't really be sure, but my guess is sharing data across all of those cores. What happens if you set max threads to 1?
There's also this guide I just found that might be useful to you: https://pytutorial.com/polars-multi-threading-performance-tuning/