Polars code runs slower on 128-core EC2 | TheaterFire

Polars code runs slower on 128-core EC2

Posted by Popular-Sand-3185@reddit | Python | View on Reddit | 61 comments

Disclaimer: I am not sure this post is appropriate for r/LearnPython since it's not a question of "how to do something in Python", rather I am looking for a lower-level discussion for why my Python application performs poorly on a significantly more powerful server. Hence I'm posting it here.

The problem:

I have a relatively complex data pipeline that is written in Polars. On my local machine with 12 cores, the pipeline finishes in about 1200ms. On my 128-core EC2, it takes 13000ms to complete. I have tried setting the POLARS_MAX_THREADS parameter to 12 on the EC2, and it's still slower.

I am using a TMPFS partition on both machines to read the data into the pipeline directly from RAM. Both my machine and the EC2 have DDR5 RAM so I think they should be comparable.

Anyone have any ideas why the pipeline would run much slower on the EC2?

[-]

john_crimson81@reddit

numa. almost definitely. pin to a single node with numactl first, see if you reproduce local performance. if yes you have your answer. if no it's something else but i'd start there

[-]

Henry_old@reddit

you are hitting the wall of amdahls law and memory bandwidth. throwing 128 cores at a single dataframe operation is just creating massive overhead for thread synchronization and cache coherence traffic. polars is fast because of vectorization and efficient memory layout not because it can magically scale linearly to a supercomputer. 128 cores is for distributed workloads not for a single in-memory table operation bottlenecked by ram throughput.

[-]

s4nc3s@reddit

what I have success with, in a cpu bound process, is to split the inputs across as many cpus as are available (os.cpucount) and then use concurrent.futures to spread the work across whatever is availabLe, then combine them all at the end… and yes this is with polars lazyframes and pyarrow/parquet data.

[-]

GrouchyManner5949@reddit

NUMA is almost certainly the culprit. a c8i.32xlarge has multiple NUMA nodes and polars distributes threads across all of them by default. the moment dataframe memory spans nodes, every cross-node access adds latency and you pay it constantly. your laptop wins because everything lives in one memory domain. POLARS_MAX_THREADS=12 doesn't fix it because the threads still get scheduled across nodes. numactl with both --cpubindnode and --membind set to the same node id is the real fix. also the maintainer's point about the streaming engine is worth doing separately. I've been tracking benchmark configs as tasks in Zencoder when debugging environment-specific perf issues like this, makes it way easier to not re-test hypotheses you already ruled out.

[-]

Popular-Sand-3185@reddit (OP)

Alright so I did figure out why the pipeline was taking so long. Essentially, the code was reading 128 separate files and then concatenating them as part of the pipeline. This as you would expect took \~10x longer on the 128 core EC2 than on my 12 core workstation. I fixed it by concatenating all the files into one before loading it into polars instead of reading 128 files separately. Particularly, I used the following function:

def concat_files_with_header(
    file_paths: list[str],
    output_filename: str,
    start_from_line_num: int = 0
) -> None:
    """Concat all files in list of filepaths saving result to output_filenamne.
    start_from_line_num indicates which line the content starts at, aka where the header ends"""
    filename_str = "\n".join(file_paths).replace(" ", "").strip()
    cmd = """
        head -n {header_length} {first_file} > {output_filename} \
        && echo "{filename_str}" | xargs tail -q -n +{n} >> {output_filename}
    """.strip().format(
        header_length=start_from_line_num - 1,
        first_file=file_paths[0],
        filename_str=filename_str,
        n=start_from_line_num,
        output_filename=output_filename
    )
    LOGGER.debug(f"command to execute: {str(cmd)}")
    p = subprocess.run(
        cmd,
        capture_output=True,
        text=True,
        shell=True,
        encoding='utf-8'
    )
    LOGGER.debug("Captured stdout from run command: " + p.stdout)
    if len(p.stderr.strip()) > 0:
        raise OSError("Captured stderr from run command: " + p.stderr)
    LOGGER.debug(f"finished executing {str(cmd)} with wine")

[-]

engineerofsoftware@reddit

This is embarrassing. This is why engineers should learn how low-level systems work. How you did not identify this as the problem immediately is beyond me. You make Python engineers look bad.

[-]

Popular-Sand-3185@reddit (OP)

They must love you at your job

[-]

engineerofsoftware@reddit

Fortunate enough not to work with idiots.

[-]

Popular-Sand-3185@reddit (OP)

Ok boss meet me in the parking lot lol

[-]

artofthenunchaku@reddit

Calling a subprocess to do text processing via bash script rather than just reading and writing the files in Python is an insane solution.

[-]

max123246@reddit

In particular, the reason why is you pay process launch startup cost for each shell command in that string. Head, echo, xargs means 3 process launches that were completely unnecessary

A process launch may make sense in python when the large majority of work is computation and since Python is interpreted, it may very well be faster to call a program written in a compiled language.

I'd be very surprised if the current solution is faster than doing it in python though with its built in std library file operations

[-]

artofthenunchaku@reddit

in fact there's 4 process launches, since it's executing the command through a shell

[-]

Popular-Sand-3185@reddit (OP)

Ok so how would you do it faster in python? We're talking gigabytes of data here. The overhead of launching a subprocess is miniscule when you get the benefits from using the coreutils which are already optimized for this task

[-]

artofthenunchaku@reddit

You're optimizing for the wrong thing. If all you're doing is reading and writing a file, the bottleneck isn't the CPU, it's the disk I/O. There's going to be very little difference in performance between Python vs C/C++ in that case.

[-]

Popular-Sand-3185@reddit (OP)

As mentioned in the OP the files are being read from a ramdisk so it's not disk IO. I found the bottleneck was calling pl.concat on 128 separate dataframes, causing a CPU bottleneck. In my solution I just concatenate all the text, and then load it into a single dataframe.

I would think that if I optimized for the wrong thing, the solution wouldn't have made the script run 10 seconds faster

[-]

SuspiciousScript@reddit

os.sendfile is a better solution here since it lets you skip userspace entirely.

[-]

venterce@reddit

Thank you for posting the update, it's great to see the actual issue.

[-]

commandlineluser@reddit

If you missed it, the comment asking if you can provide code is the original author of Polars:

https://reddit.com/r/Python/comments/1td1790/polars_code_runs_slower_on_128core_ec2/ols2829

They will be able to provide specialized help, and are likely interested in finding out if this is a case where Polars can perhaps "do better".

[-]

ritchie46@reddit

Can you share the code you are running?

[-]

Popular-Sand-3185@reddit (OP)

Hey thanks for checking this out. So I was able to get some significant performance improvements, but it's still a bit slower on the large EC2 than on my workstation:

def extract_many_postfiles_to_df(
    run_id: int,
    file_paths: list[str]
) -> pl.LazyFrame:
    """Generic function to extract data from PLOTFILE, POSTFILE,
    MAXIFILE, RANKFILE, EVALFILE, SEASONHR, MAXDCONT, MAXDAILY, 
    and MAXDAILYBY-R

    Args:
        run_id: database ID of the run
        file_paths: path to the output file 
        output_table_types: POSTFILE, PLOTFILE, etc.

    Result is returned as a string-only df with headers consistent with AermodResult
    db model
    """
    concatenated_file = tempfile.NamedTemporaryFile(delete=False)
    concat_files_with_header(
        file_paths,
        concatenated_file.name,
        start_from_line_num=9
    )

    with open(concatenated_file.name) as f:
        split = list(islice(f, 8))
    colnames, widths = _extract_output_file_headers_widths(
        split[6],
        split[7]
    )
    colnames = [_POSTFILE_COLNAME_MAPPING[c] for c in colnames]
    df = read_output_file_from_disk_polars(
        concatenated_file.name,
        colnames,
        widths
    ) \
        .with_columns(
            output_table_type=pl.lit("POSTFILE"),
            run_id=pl.lit(run_id),
            averaging_period=pl.when(
                pl.col("averaging_period").str.contains(r"^[0-9]+$")
            ).then(
                pl.col("averaging_period") + "-HR"
            ).when(
                pl.col("averaging_period").str.contains(r"^[0-9]+$").not_()
            ).then(
                pl.col("averaging_period")
            ), # Converts numeric only averaging pd to text (e.g. '8' to '8-HR'),
            ranking=pl.col("ranking").str.replace("[^0-9]+", "").cast(pl.Int32) if "ranking" in colnames else pl.lit(0),
            occurence_datetime=(
                    pl.col("occurence_datetime").fill_null(
                    datetime(1969, 1, 1, 1).strftime("%y%m%d%H")
                ).replace(
                    "0",
                    datetime(1969, 1, 1, 1).strftime("%y%m%d%H")
                )
                    .str.slice(0, 6)
                    .add(
                        pl.col("occurence_datetime")
                            .str.slice(6, 7)
                            .cast(pl.Int8)
                            .sub(1) # Subtract 1 from datetime hour so it can be parsed with the %H format
                            .cast(pl.String)
                            .str
                            .zfill(2)
                    )
                    .add("00")
                    .str.to_datetime("%y%m%d%H%M")
                    .cast(pl.Datetime)
                    .dt.truncate("1s")
                    .dt.strftime("%Y-%m-%d %H:%M:%S") # truncate 1s allows clickhouse compatability
            ) if "occurence_datetime" in colnames else pl.lit("1900-01-01 00:00:00")
        ).cast({k: v for k, v in _POSTFILE_DTYPES.items() if k in colnames})
    LOGGER.debug("read all output file texts into polars")
    LOGGER.debug(f"file loaded to df: ")
    return df

[-]

ritchie46@reddit

There is a lot happening here that's not Polars. I saw there were many subprocesses being spawned, which is certainly not cheap. What kind of files are you reading?

Best would be to let Polars handle the concurrency and parallelism.

I cannot see how you execute/collect the Lazyframe, but if you are on 128 cores you should definitely use the streaming engine here.

[-]

321159@reddit

Judt so you are aware: This is the maintainer for polars

[-]

Free-Cheek-9440@reddit

Also worth checking if Polars is parallelizing too aggressively for the size of the dataset. Sometimes the coordination overhead becomes more expensive than the actual compute. I’ve had cases where limiting threads actually improved runtime because the tasks were too small to shard efficiently.

[-]

poopoutmybuttk@reddit

Are you using steaming engine?

[-]

Popular-Sand-3185@reddit (OP)

yup

[-]

KandevDev@reddit

polars defaults assume cache-friendly working sets. 128-core boxes have NUMA, which means the moment your dataframe spans multiple memory nodes, you pay an enormous latency penalty per cross-node access. the laptop "wins" because everything fits in one memory domain. set POLARS_MAX_THREADS to 16 or pin to one socket, see if that recovers perf.

[-]

sirfz@reddit

Yep, run your script with numactl and pin it to close by cores, something like:

numactl --cpubindnode=0 --membind -C 0-11 python script.py

This'll bind your process to numa node 0 cpus 0 to 11 inclusive (12 cores). That's usually the structure but you can run numactl --hardware to check

[-]

KandevDev@reddit

numactl is the right call. one caveat: --cpubindnode is one flag, --membind is another, and they need to match the same node id to avoid cross-node memory access. the "12 cores per node" assumption holds on most EC2 c-series but check numactl --hardware first since some instance types have unusual topologies (especially the bare-metal ones).

[-]

Popular-Sand-3185@reddit (OP)

Interesting, I will try this!

[-]

Cynyr36@reddit

Possibly a disk io speed issue in here as well. Locally it's likely a gen4x4 or faster nvme. Your cloud instance could be much slower.

[-]

Popular-Sand-3185@reddit (OP)

I would not think disk IO is the issue here, everything is running on a ramdisk

[-]

meatmick@reddit

Any chance there's CPU cache in effect? You said it's a Ryzen 7; is it an X3D variant? That could impact the performance.

[-]

Lba5s@reddit

Are the cores on your machine faster? Or your code/data might not be large enough to deal with the overhead of distributing over 128 cores

[-]

Popular-Sand-3185@reddit (OP)

They are faster (5700mhz on mine vs 3900mhz on the ec2). Would cause it to be off by a factor of 10? I did try reducing the core count on the EC2 with POLARS_MAX_THREADS=12.

[-]

axonxorz@reddit

Which EC2 sku are you running this on?

[-]

Popular-Sand-3185@reddit (OP)

c8i.32xlarge. It's a shared instance, right now I'm trying to get my service quota increased so I can try a dedicated instance

[-]

axonxorz@reddit

Shared instance means my firstline assumptions are that you're having CPU time stolen and memory access overhead.

Run your workload and run top at the same time. The st field represents CPU time 'stolen' from your VM (against the theoretical 100%) for other VMs on the shared host.

That instance type runs a NUMA architecture, memory access latencies will be inconsistent between cores. This alone is a source of great overhead compared to your local CPU, and unfortunately not one you can directly affect when running in a VM. You can at least see some stats using numastat

I think your biggest boon will be getting that dedicated host. Once you have that, numactl can be used to ensure your workload is constrained to appropriate NUMA nodes.

One offside point of comparison would be to run the workload on your local workstation in a VM to see what kind of overhead a simple VM layer brings.

[-]

beragis@reddit

It might be the file system that was mounted. I saw similar performance differences, but not as extreme as yours, when I was testing a python program running on my local laptop, an EC2 instance and a kubernetes pod running in AWS.

Using my local pc as a baseline. The same process with the same number of threads in EC2 was about 80% of the speed as my PC. Kubernetes was about 15 percent faster. The only thing I could think of was Kubernetes was running on a dedicated instance. The cpu specs on the EC2 and Kubernetes were about the same as the PC. It ended up being the file system on the EC2 instance

[-]

Popular-Sand-3185@reddit (OP)

The filesystem I'm using is mounted to a tmpfs partition both locally and on the EC2, so everything is being read directly from RAM

Not sure i'm following. Are you saying my EC2 might be backed by S3 somehow?

[-]

beragis@reddit

Not necessarily. I have seen S3 buckets mounted on linux, which is what we use to both upload files to process and also store the results back to which are then read into a snowflake database.

The program itself pulls the files from the S3 to temp. It was once the file was in temp where we ran the comparisons. The underlying EBS on kubernetes was a faster throughput than our EC2 instance. Something like 4 to 8 times as fast.

Since you are using a tempfs then you are using EBS. Tempfs uses virtually memory which will utilize disk space if it needs to swap to disk.

Since it’s a shared instance it is highly likely swapping memory to disk and back to memory.

[-]

thuiop1@reddit

Frequency is not the only factor for the speed of a CPU...

[-]

Deto@reddit

Yeah but regardless it's unlikely that they single core performance is going to be 10x different

[-]

Popular-Sand-3185@reddit (OP)

This response would probably be more helpful if you explained what those other factors are, and how I can check them, and whether there's any course of action that I can do to improve it

[-]

thuiop1@reddit

Giving the exact CPU models would help, it is already hard to diagnose remotely.

[-]

artofthenunchaku@reddit

"Throw more threads at it" is rarely the solution to performance improvements, there's many factors that come into place; cache size, lock contention, speculative execution. More I'm forgetting or not personally familiar with.

I'd suggest running multiple multiple benchmarks at different thread counts (1, 2, 4, 8, ..,) to understand where you see the biggest improvements or regressions.

[-]

ottawadeveloper@reddit

threading comes with a performance overhead. what if you try it with max threads = 1 on both? if so, the CPU difference will make a bigger difference

[-]

Popular-Sand-3185@reddit (OP)

Still \~15second runtime on the EC2 and 220ms locally with with 1 thread

[-]

ottawadeveloper@reddit

Weird. Can you do any more specific monitoring and share it? Like CPU load per core, memory usage?

is there a lot of network I/O or disk I/O? A SSD vs HDD can be ten fold issues. Or differences in outgoing networking speed.

finding the bottleneck that the EC2 machine is hitting would probably be my first step in solving this - it'll narrow it down to disk, memory, CPU, or network hopefully.

also might be worth checking the Python configuration itself? like PYTHONOPTIMIZE, whether the GIL is enabled on EC2 (the GIL makes multithreading kinda useless for CPU bound tasks that are written in Python, but I'd expect to see more similar performance with one thread then).

[-]

thrope@reddit

What architecture is your local machine? Apple silicon or intel ? Apple silicon is much faster for numerical work

[-]

Popular-Sand-3185@reddit (OP)

AMD Ryzen 7 i believe running fedora

[-]

afnanenayet1@reddit

Run the polars query profiler to see which steps are the bottleneck. Which engine are you using? The regular one, streaming? How much memory is available on these machines?

I’ve found that polars doesn’t do as well with really high core counts, in particular bc the rayon (which polars uses for its thread pools) isn’t NUMA aware and thrashes CPU caches.

You can use py-spy and perf to get more details on what the bottleneck is.

[-]

timpkmn89@reddit

What's the processor utilization at while running it?

[-]

Deto@reddit

Yeah - I'm wondering if maybe it's not actually respecting the 12 core limit they're setting. Looking at top while it's running could reveal that

[-]

RedEyed__@reddit

What hardware?
Could be because of SIMD. Fir instance, local machine has AVX512 and seever doesn't.

[-]

gdchinacat@reddit

There has been a concerted effort over the past few years to improve python performance. Have you verified you are using the same version of python on both machines? Same version of polars?

[-]

Popular-Sand-3185@reddit (OP)

Upgrading python version sadly didn't change the performance at all :/

[-]

Popular-Sand-3185@reddit (OP)

It looks like I'm running 3.14 on my local machine and 3.12 on the server. Let me update this and see if it fixes things

[-]

wxtrails@reddit

Good call. This is an interesting one - I'll be keen to see how it turns out!

[-]

Regular_Effect_1307@reddit

!remind me in 2 days

[-]

RemindMeBot@reddit

I will be messaging you in 2 days on 2026-05-16 16:07:26 UTC to remind you of this link

CLICK THIS LINK to send a PM to also be reminded and to reduce spam.

^(Parent commenter can ) ^(delete this message to hide from others.)

^(Info)	^(Custom)	^(Your Reminders)	^(Feedback)

[-]

carnoworky@reddit

It could be related to this: https://youtu.be/tND-wBBZ8RY?si=PlnNvCgj2iPq-2yL

Without seeing the operations we can't really be sure, but my guess is sharing data across all of those cores. What happens if you set max threads to 1?

There's also this guide I just found that might be useful to you: https://pytutorial.com/polars-multi-threading-performance-tuning/