.pipe() in pandas changed how I write data pipelines

Posted by Economy-Concert-641@reddit | Python | View on Reddit | 45 comments

Been using .pipe() in pandas lately and it's been a game changer — anyone else?

I was writing some data transformation code the other day and stumbled across .pipe(). Honestly didn't expect much, but it completely changed how I structure my pipelines.

Instead of this mess:

df_final = sort_by_total(calculate_total(filter_by_price(df)))

You just write it top to bottom like a recipe:

df_final = (

df

.pipe(filter_by_price)

.pipe(calculate_total)

.pipe(sort_by_total)

)

Same result, way more readable. Each function takes a DataFrame and returns a DataFrame — that's the only rule.

Full example if you want to try it:

import pandas as pd

df = pd.DataFrame({

"product": ["Product A", "Product B", "Product C", "Product D"],

"price": [20, 150, 230, 100],

"quantity": [10, 5, 3, 8]

})

def filter_by_price(df):

return df[df["price"] > 100]

def calculate_total(df):

return df.assign(total_value=df["price"] * df["quantity"])

def sort_by_total(df):

return df.sort_values("total_value", ascending=False)

df_final = (

df

.pipe(filter_by_price)

.pipe(calculate_total)

.pipe(sort_by_total)

)

Been using it a lot for ETL and data cleaning workflows. Makes debugging way easier too — just comment out one .pipe() step and you see exactly where things go wrong.

Anyone else using this regularly? Any patterns you've found useful with it?

[-]

FrickinLazerBeams@reddit

As a technical programmer (aerospace, optics, typically single-programmer projects) I default to non-OO most of the time. It always looks funny to me when some new OO construct basically recovers the code you'd have written without using OO in the first place. Like, I could have just written

df_final = filter_by_price(df) df_final = calculate_total(df_final) df_final = sort_by_total(df_final)

I mean I'm not saying they're exactly equivalent (especially as it relates to intermediate values), and I acknowledge that in environments other than mine, OO has benefits that it doesn't (always) have for me. It's just funny how long a journey it's been to wind up essentially back where we started, but with a lot more code to say it.

[-]

ChebyshevsBeard@reddit

This is the right way. Easier to read, easier to debug.

[-]

DoubleDoube@reddit

Potentially easier to write test cases for

[-]

Economy-Concert-641@reddit (OP)

Fair enough! I guess it comes down to personal preference and how you like to visualize the flow. I enjoy the 'clean' look of a single chain, but I can't deny that having explicit intermediate variables makes debugging much easier. It's interesting to see how different backgrounds lead to different coding styles. Thanks for the input!

[-]

Ahhhhrg@reddit

This hurts my eyes, I have to say. This is fine if you’re actually using the intermediate values for something (with different names), but if you’re not, and especially just overwriting the same value over and over again, I don’t see the point. If you do this in many places it’s impossible to tell what df_final is supposed to be.

[-]

Nater5000@reddit

I prefer this approach since it allows me to inspect the intermediate values easily. Or, more naturally, this is how I actually construct the final data frame as I'm applying these operations. I'm not just throwing everything at it all at once, but rather applying an operation, checking that it worked as intended, applying the next, etc.

Granted, this is what I do in a notebook context and I'll clean things up a bit by the end. But keeping it separated helps with debugging down the line as well.

[-]

Golden_Age_Fallacy@reddit

To me, athletically, it just looks less “eloquent” to save the same variable from 3 function returns.

Simple and readable in my subjective eyes is far more the .pipe() chained dot functions as they all become a part of a single operation.

To me, df_final is only ever (accessibly) one thing.. the sum of all 3 operations. Whereas yours there are lines of code where it mutates between.

[-]

FrickinLazerBeams@reddit

To me, athletically, it just looks less “eloquent” to save the same variable from 3 function returns.

Sure, that's what I meant about the intermediate value.

Simple and readable in my subjective eyes is far more the .pipe() chained dot functions as they all become a part of a single operation.

Of course, it's not actually a single operation, it's just hidden all the steps.

To me, df_final is only ever (accessibly) one thing.. the sum of all 3 operations. Whereas yours there are lines of code where it mutates between.

Yeah, but again those intermediate states exist in either case, they're simply not explicit in OPs version. This is an aesthetic difference, but practically it's the same - nobody is going to somehow use those intermediate states by mistake or something.

[-]

heartofcoal@reddit

You would love R

[-]

Ex-Gen-Wintergreen@reddit

Most of these — filter by price, sort by total — are just simple method calls so I’d chain them instead of def + pipe I’d even consider the same for calculate total but appreciate that lambdas aren’t always liked. The other two should be chained though

[-]

End0rphinJunkie@reddit

Yeah standard chaining is definately better for the simple stuff. I mostly just save pipe for when I need to inject logging or data validation between steps without breaking the chain.

[-]

Economy-Concert-641@reddit (OP)

That’s a very fair point! I agree that for simple operations like sorting or basic filtering, native method chaining or df.query() is much cleaner and avoids the overhead of defining a new function. I'm still exploring the best balance between using .pipe() for complex business logic and keeping it simple with built-in methods for the basics. I'll definitely look more into df.query() as well, it seems like a great way to keep the code readable. Thanks for the insight!

[-]

4_nsfwy@reddit

df.query is another easy way to get this done

[-]

Ex-Gen-Wintergreen@reddit

Yeah I’d do that for the filter for sure. The sort values no reason to have that in a function

[-]

Impressive_Job8321@reddit

Could have used lambda, since you’re only using each function exactly once?

[-]

RustyTheDed@reddit

It makes 0 difference to performance and with function names you don't have to add comments to know what it's supposed to do.

[-]

hai_wim@reddit

Yea, in a topic about readability, lambdas would definitely NOT be the way to go.

[-]

Impressive_Job8321@reddit

Readability to me means following the logic exactly where it is used. Named function can be anywhere in a code base. Do you really need two editors side by side, one with where the function is called, next to one with what the function does? If that’s readability to you then, then named function ask the way. But small atomic transformations like in the original post, named functions are, for educational purposes only.

[-]

RustyTheDed@reddit

Do you really need to see the actual code all the time though?

If a function does what it's name suggests it does, you don't need its contents.

If you do have to see the code, your IDE most likely has a "Go to definition -> Go hack" or even "Peek".

Lambdas are great until you forget what they're supposed to do. Then on top of figuring out the high level stuff, you need to figure out and then remember what the individual lambdas do. Even if they're small, it adds up.

Maybe for you working on your code it's not a problem, but when working in a team it just leads to tribal knowledge and bugs.

[-]

ePaint@reddit

I get your point but lamdas are a code smell. Maybe if Python had better syntax for them, like JS does, sure. As it is right now, they're really hard to read at a glance.

[-]

Impressive_Job8321@reddit

Look at the original post for the problems at hand and the code that should be lamdmdafied. nobody asking you to implement quick sort in lambda.

[-]

JonathanMovement@reddit

so I’m not the only one having a hard time reading lambdas even if I have the syntax right in my face 😭

[-]

ePaint@reddit

They're banned in my company's codebase for a reason, same as recursion

[-]

JonathanMovement@reddit

holy shit that’s crazy, I’d love to work in your company 😃

[-]

lottspot@reddit

Readability to me means following the logic exactly where it is used

This has nothing to do with lambda vs not lambda. A named function can be declared just as locally as a lambda, and like others have pointed out, has the readability benefit of actually having a name.

You truly need lambdas for very few reasons, and "locality of logic" is not one of them.

[-]

Economy-Concert-641@reddit (OP)

That’s a fair point! Lambda works perfectly here since each function is used once. I went with named functions intentionally to make each step self-explanatory — especially for anyone learning the pattern for the first time. But yeah, for quick one-off transformations, lambda inline is cleaner. Good call!”

[-]

37b@reddit

I voted you just because it’s nice to see rational friendly discussion instead of the usual acerbic back-and-forth

[-]

JambaJuiceIsAverage@reddit

Dude you're talking to a robot

[-]

37b@reddit

Oh God, I am an idiot.

[-]

marr75@reddit

Some day, when we're all conscripts in the war against AI drones, I'll shout something like, "Don't fire until you see the em-dashes of their eyes!" and thinking of simpler days when 37b was having a nice discussion about lambdas with one of them.

[-]

JambaJuiceIsAverage@reddit

Lol well you aren't technically wrong. Also idk maybe OP is just using it for translation and I'm the ass.

[-]

lolcrunchy@reddit

An experimental alternative that is agnostic to third party libraries:

from itertools import accumulate

df_final = accumulate(
    [
        filter_by_price,
        calculate_total,
        sort_by_total
    ],
    lambda x, f: f(x),
    initial=df
)

[-]

adam-kortis-dg-data@reddit

I have never used .pipe() before but I had similar experience of when I finally discovered the .shift(). (I guess I should read the documentation more).

I think the one thing I am confused about is why are the functions needed here for single lines of code that are hardcoded?

Wouldn't this work the same:

df_final = df[df["price"] > 100]
df_final["total_value"] = df_final["price"] * df_final["quantity"]
df_final = df_final.sort_values(by="total_value", ascending=False]

# or keeping the old dataframe in tact
df_final = df[df["price"] > 100].copy()
df_final["total_value"] = df_final["price"] * df_final["quantity"]
df_final = df_final.sort_values(by="total_value", ascending=False]

You can still see line by line what is happening and you don't have to trace back to functions located somewhere else. I know this is an example, but are the ETL functions more complex that are stored in a separate file/module?

I know aesthetics are subjective, so I won't argue that debate on what people prefer. I can say I prefer just writing the three lines instead of searching through functions to figure out what they do (unless a function is truly required). If you have issues you can still comment out a single line of code and debug.

To me, .pipe() might be more useful if it were passing in arguments that could change, or if you were modifying multiple parts of the dataframe. I am thinking if all these steps were in one function, or if parameters needed to be passed into the function

def my_pipeline(df):
  df = df[df["price"] > 100]
  df["total_value"] = df["price"] * df["quantity"]
  df = df.sort_values(by="total_value", ascending=False]
  return df

def filter_by_price(df, price):
  return df[df["price"] > price]

# or even more genearlized
# filter by greater then value for any column
df filter_column_gt_value(df, column_name, price):
  return df[df[column_name] > price]

# and you could do the same for the other comparison operaators
df filter_column_lt_value(df, column_name, price):
  return df[df[column_name] < price]

# now these functions are more generalized and portable
# but can also be more difficult to parse through and read
# it's a spaghetti!

Otherwise, I like to use apply if I am applying a function to a single column. For example, I don't like the way pandas calculates years between today's date and a previous date, so I created my own and use .apply(), mainly for calculating ages.

def calculate_years_between(end_date: pd.Timesamp) -> int:
  today = pd.Timestamp.today()
  if today.month < end_date.month:
    return today.year - end_date.year - 1
  if today.month == end_date.month & today.day < end_date.day:
    return today.year - end_date.year - 1
  return today.year - end_date.year

df['age'] = df['birthday'].apply(calculate_years_between)

I don't know if there are performance benefits to using .pipe() over other ways (memory or speed wise)? So, if anyone can shed some light on that, it would be great.

[-]

manecamaneco@reddit

How cool, it feels as R language

[-]

fasnoosh@reddit

Yep, I thought the same thing. Might be more elegant to define a pipe operator instead of calling the .pipe() every time

Like this:

df_final = (

df |> filter_by_price |> calculate_total |> sort_by_total

)

But honestly, SQL is better, more declarative, and likely more efficient. If you’re able to do this type of thing in-database / data platform, it’s usually better. And easier to mantain

[-]

SearchAtlantis@reddit

Only because these are simple columnar operations. The problem with SQL is re-usability and test-ability.

[-]

marr75@reddit

The general term is a fluent interface. It's mildly degenerative to use pipe to call a single method of dataframe, though (just wraps and unwraps the method without improving readability).

It's a nice way to see the order of named operations at a glance but there are other ways to do it and it's not worth other sacrifices (keeping external data around for long operations, hiding dependencies, etc) just to force the pipe pattern to work.

[-]

Salfiiii@reddit

Could you elaborate a little more why you consider it „degenerative“ and how you would solve „keeping external data around for longer and hiding dependencies“?

[-]

marr75@reddit

*mildly degenerative. In the very simple case OP used, you can already call .sort_values and .assign fluently, inline, without introducing the function and calling it in the pipe method. So you're adding 2 calls to the stack, a function definition, and a pipe call to get... Symmetry?

Keeping external data around: I wouldn't force long chained pipe calls to the exclusion of other concerns. If I needed to load some data as a compliment to the main frame, I would do so tightly scoped and not let fluent aesthetics get in the way. If I had hidden dependencies in my functions I would promote them to input variables which would in many cases stop me using pipe 🤷 (I could fix with partial application but I wouldn't pursue just to keep piping). I'm not accusing OP of these specifically of course - they are just pitfalls of getting tied to specific aesthetics over functional quantities.

[-]

Salfiiii@reddit

Thanks, sounds reasonable.

I like the pipe approach as well but agree with the caveats mentioned and others.

Its good to have a „default“ approach but it should always be possible to go other ways if needed.

It especially useful for testing when most reusable functions have the same parameters.

[-]

marr75@reddit

Frankly, if you do this stuff as the main point of a lot of projects, you should just pick a pipelining framework with pandas support rather than trying to have pandas manage the pipeline. There are some good ones that flexibly support DAGs with SQL, pandas, and polars compatibility built in.

[-]

Salfiiii@reddit

Could you give me some names?

We looked at Kedro for example but all those projects have the same problem for me, they make to much assumptions and force you to do everything „their way“. They abstract too much and are not flexible enough in the end.

The python code is run via airflow und k8s, we have the DAGs for some kind of abstractions but one task usually consists of more than one transformation till data needs/should be persisted.

[-]