Would a Pandas-compatible API powered by Polars be useful?
Posted by try-except-finally@reddit | Python | View on Reddit | 75 comments
Hello, I don't know if already exists but I believe that would be great if there is a library that gives you the same API of pandas but uses Polars under the hood when possible.
I saw how powerful is Polars but still data scientists use a lot of pandas and it’s difficult to change habits. What do you think?
andy4015@reddit
Narwhals might be of interest to you
https://github.com/narwhals-dev/narwhals
SneekyRussian@reddit
How is this compared to Ibis?
britishbanana@reddit
It's meant as a compatibility layer, not something that your average analyst / dataframe user would use. It's meant to be used by libraries that have APIs which accept pandas dataframes, to make it so the library can take advantage of polars while maintaining backwards compatibility with their pandas API. It also helps give a more static dataframe API so library maintainers are less likely to have to update their usage of polars as the API changes - they just write against the narwhals API.
SneekyRussian@reddit
Thank you. Hopefully changes to the polars api will slow down now that they are past version 1. Would love to see something like Dask get support for the polars api. Pandas api is just painful.
britishbanana@reddit
> Pandas api is just painful.
it's so bad, I've never understood how some people claim to love it... masochists who don't know any better
aexia@reddit
It's a pretty tremendous improvement if you're coming from R.
SneekyRussian@reddit
Just what you’re used to I guess. The people who made it probably like it lol
marcogorelli@reddit
Dask is supported in Narwhals, at least to the point that we're able to execute all 22 TPC-H queries in Narwhals with the Dask backend
So if you want to write Polars syntax and have Dask as the engine, you might be interested in looking into Narwhals (especially now that Ibis have dropped Dask as an engine) https://github.com/narwhals-dev/narwhals
tutuca_@reddit
Seems to be able to use ibis as backend too. Interesting little library.
BaggiPonte@reddit
it's also becoming quite popular. lots of projects are adopting it. altair has it. there's work in progress for plotly as well as nixtla. and I am missing out on some of them still!
ritchie46@reddit
Narwhals uses a subset of the Polars API. It would not help OP write pandas code.
marcogorelli@reddit
Agree, but I'll take the free publicity
jjolla888@reddit
how does this answer OP? narwhals doesnt translate pandas to anything
anentropic@reddit
I have a big chunk of complicated pandas code with negligible test coverage that I would love to convert to polars, if anyone knows of such a library
Failing that, can anyone share experience of switching to modin to get multicore?
Shakakai@reddit
Have you tried using cursor with Claude to do this conversion for you? I bet it would do a solid job.
anentropic@reddit
I would really love a well tested library explicitly designed to have same API and behaviour though
LLM can help a lot rewriting the code but I don't have much confidence there aren't subtle differences that would go unnoticed
anentropic@reddit
Guess what showed up in my news feed today...?!
https://hwisnu.bearblog.dev/fireducks-pandas-but-100x-faster/
BidWestern1056@reddit
the multicore should work fine out of the box as long as youre not passing class objects in application procedures since it has difficulty serializing them or w.e
Ok_Raspberry5383@reddit
Came for the speed, stayed for the syntax is what most Polars converts say. In short, no
arden13@reddit
What do you like most about the syntax? I would have a hard time giving up multi indexing
marsupiq@reddit
Polars doesn’t have indexes, in particular no multiindexes… But can you tell me one case where you actually need it? When you need to access a single element, just apply .filter() and access the row. But for chains of transformations, I’ve always felt like pandas indexes are a mess, where you just end up resetting and setting the index a thousand times… chances are you can formulate your operations much more concisely in polars using window functions.
If you give me a pandas example, I would be happy to think about a polars solution.
B-r-e-t-brit@reddit
Let’s say you have 2 dataframes with min and max temps by state county and town
With multiindexes you can get the avg temp like this:
(mindf + maxdf) / 2
In polars it would look like this (on mobile so pseudo code)
mindf.join(maxdf, on=[state, county, town) .with_columns(((pl.col(tmp) + pl.col(tmp_right)) / 2).alias(tmp)) .select([state, county, town, tmp])
arden13@reddit
I work with a lot of scientific data so for me it's handy to have the multi index. I'm working with a dataset now from an instrument that always outputs a consistent data file of 24 capillaries. Sheet 1 contains sample metadata (name, etc) while sheet 2 contains 24 sets of 3 columns each concatenated horizontally.
Use case 1: parsing awfully structured data from the instrument.
For me it's easier to parse sheet 2 information by taking the columns as a 2-layer multi index and then melt it down. Afterwards I can join to the first sheet on capillary to scrape any of the metadata I need.
Use case 2: accessing data with a known name
With the above dataset I may want to access based on a known filename + capillary number. I can do that with the multiindex.
robotoast@reddit
I noticed you melt your data in use case 1. For use case 2, have you thought about sticking with melted data there as well?
You can achieve the same lookup functionality by filtering rows using columns like Filename, Capillary, and Measurement. This keeps things explicit and avoids the extra step of managing indexes. I think Polars would be a great fit for this and might feel intuitive once you get the hang of its syntax.
arden13@reddit
For later modeling the data must be re-pivoted. Otherwise the melted data is fine to work with
rosecurry@reddit
You don't like chaining three or four .reset_index(drop=true) in your transformations?
PurepointDog@reddit
Multi-indexing comes in handy for a very, very small subset of problems (namely, generating dense tables for scientific reports).
I have never otherwise come across a problem that I couldn't solve using regular table semantics exactly how I wanted
unixtreme@reddit
Sadly for my usecases no index is a big no no. But I should give polars a chance next time I spin up some weekend pet project.
PurepointDog@reddit
You just pick the column to act as the index though? It's not that there's "no" index, it's that every column is an index
arden13@reddit
I do indeed work in a scientific space. Id argue it's also handy for computation using groupby functions but polars has to have that right?
PurepointDog@reddit
Yes, obviously polars has group-by
jjrreett@reddit
There are less foot guns and ambiguity. The api is simpler and more well thought out.
Years of bloat vs a fresh clean interface
pythosynthesis@reddit
This is the kind of argument Esperanto advocates were making.
Not advocating strongly either way, if polars is truly superior, not just as speed, it will emerge dominant. But arguments based on "new, clean and shiny approach to [insert your favorite problematic issue]" are a coin flip at best.
jjrreett@reddit
i haven’t been using long enough to have very strong technical arguments. but it’s got the vibe. There are a few small hurdles you have to jump. but after that it’s great. No more bugs about selecting axis 1, .loc vs bare getitem. It’s very declarative and fast.
yrubooingmeimryte@reddit
fewer*
Ok_Raspberry5383@reddit
I hate the way that changing the index changes is the outcome of operations. If I write a function that accepts a
df
, I need to know the index otherwise my function is nondeterministic, this can't be communicated through typing and requires me knowing the columns in thedf
which is not ideal if I want me func to be very generic, I'd argue it's plain un-pythonic.nraw@reddit
The syntax of polars? I feel like it's heavily verbose to do the most basic of stuff or am I doing something wrong? All the with_columns and pl.col feel much more verbose than just the pandas assignments
maltedcoffee@reddit
Generally all the with_columns, filter and select contexts can be combined into single blocks:
with_columns(
first = col('c1').str.slice(0,1),
last=col('c1').str.slice(0,-1),
)
I find this pretty readable and it helps organize my code blocks.
As for pl.col, I do a "from polars import col, lit" at the top to make things just a little less verbose.
nraw@reddit
Would you know if there's a list of best practices with polars?
I'm not sure I'm a fan of stacking more commands into a single line. Debugging sounds messy that way.
maltedcoffee@reddit
Are you talking about method chaining? See for example this video which shows a couple examples where it can make code (subjectively) more readable than a long line of 'df = df.foo()'. For eager computations it may also be more performant than assigning back to the variable each computation, but that shouldn't matter in Lazyspace.
In my experience I find method chaining has made my code more readable and that I 'vibe' with it better, but consider it more of a style choice. The comments in the video point out some drawbacks such as how logging intermediate results is more difficult. Some people aren't fond of method chaining and that's okay.
For a more general treatise, I cut my teeth on Modern Polars which I think is a great "10 minutes to polars" tutorial, but it's opinionated to the point of being off-putting, and considers method chaining to be self-obviously superior, which imo it ain't.
nraw@reddit
Thanks for the detailed answer! I'll take a look at the resources tomorrow.
I feel like method chaining brings me back to my R era, back when my software engineering practices were way lower compared to what they are now. I had these pretty chains that would be very readable but when something went wrong it was quite the surgery procedure to understand what and where was off.
PurepointDog@reddit
You ever rename a pandas dataframe and miss one of the references, and then all hell breaks loose as you assign a misaligned column from one dataframe to another?
It's an insane problem that only pandas has.
ColdPorridge@reddit
Pandas was a great step forward for data science in Python but it’s the past, not the future.
Woah-Dawg@reddit
Think this is a bit over exaggerated. Pandas is a mature product
Verochio@reddit
Had to bug-fix some legacy pandas code this week. I’ve been a polars convert for so long it was horribly jarring going back. What do you mean I have to specify “axis=1”?! Why is “reset_index” in pretty much every step? 🤮
Valuable-Benefit-524@reddit
Isn’t the point of polars that it doesn’t have an inconsistent, pandas-like API.
marr75@reddit
Maybe prior to GitHub Copilot et al. Most conversions are pretty trivial today, and tests (or manual inspections if you don't have tests) can handle the rest.
There's also "come as you are" libraries like Ibis that support just about any backend you might want and let you drop-in/drop-out of pandas, polars, SQL, etc. as you feel like it.
trial_and_err@reddit
I second Ibis. If you know SQL well you know ibis. Ibis basically serves a SQL builder providing a nice Python API. And SQL has already solved the problem of how to do complex aggregations with a simple declarative syntax. No need to reinvent relational algebra and analytic functions.
marr75@reddit
My teams were using pandas for so long. A new project came up that some flexibility between persistent and in-memory data was desirable, checked out ibis, never planning to start new projects in pandas or polars (ibis can input from and output to both). Switching backends for free, faster execution, simpler persistence, less memory usage, better dev experience.
Duckdb being the default backend has been the hidden bonus we didn't know we needed, too. It's on the leading edge of performance and I wouldn't be surprised if they expanded their vectorized execution engine from SIMD to CUDA.
Why perform complex set and ETL in python memory when an in-memory database can do them faster and with less memory churn?
trial_and_err@reddit
Also works great for testing. We store a local DuckDB database with some test data in our repo and use that one in our tests instead of BigQuery / Snowflake.
I also find it easy to debug as I can always check out the raw SQL (I recommend using the .alias() method for readability if you’re generating large queries as this will split your query in CTE‘s).
The official Ibis docs are good but could be better (took me for example a while to find out how to generate JSON columns - it’s in the docs, but you won’t find it by just searching for „JSON“ or „Map“)
marr75@reddit
We've got very similar patterns. Also, very easy to get your data out of duckdb and into Snowflake, Bigquery, or pg later. Parquet files is your worst case and that ain't bad.
The docs are really for getting started. I've had to read the source pretty frequently to get further but, that's why I love Python. Easiest to read source in the world.
BaggiPonte@reddit
I noticed AI assistants struggle a bit with Polars but if you just use even just one example in the prompt everything works much more smoothly.
marr75@reddit
This is generally true of MANY tasks. There's even a fun paper about how a frontier model had good knowledge of the Pokemon battle system but if you talk it through a few types and newer Pokemon, it's performance increases dramatically. In-Context-Learning creates task vectors and How to Think Step by Step are fantastic foundational papers for understanding how LLMs "solve problems" and how context participates.
try-except-finally@reddit (OP)
Use Cursor + Claude + indexing Polars API is the best now
BaggiPonte@reddit
Neat!
No_Departure_1878@reddit
https://ibis-project.org/
TesNikola@reddit
No. Just no.
try-except-finally@reddit (OP)
From the feedbacks below:
“We are happy with Polars, don't need a Pandas-API wrapper on top of it”
Thank you, Reddit people
unfair_pandah@reddit
Hasn't the syntax/API been one of the things people traditionally complain the most about regarding Pandas?
Personally I've never been bothered by it but I think it's some sort of Stockholm syndrome. I find Polars so much more enjoyable to write. You just got to dive off the deep end and fully transition to polars to get change your Pandas habits!
try-except-finally@reddit (OP)
I'm good with Polars, just see data scientists still using pandas a lot, despite Polars being there for years
marsupiq@reddit
If that’s what they prefer to be doing… frankly, my days were a lot more relaxed when I had to wait for results and tolerate crashes and freezes. 😅
unfair_pandah@reddit
If it works, gets the job done, and people are happy than all the power to them for using Pandas!
try-except-finally@reddit (OP)
The problem is that is code that I have to deploy in production and often is too slow or uses too much memory, so I have to rewrite everything in Polars
marsupiq@reddit
Yes, been there. I had written a prototype in pandas and XGBoost that I had only tested on a small dataset. It required around 100GB of memory to run with the production workload, and it was terribly slow. Replacing pandas by polars and XGBoost by LightGBM, I was able to reduce it to 10GB, and also make it much faster.
But I should say that at my company we don’t make a distinction (in most teams at least) between Data Scientists and Machine Learning Engineers. So if my code is inefficient, that’s my problem and not someone else’s. Not sure what I would do in your case...
big_data_mike@reddit
Pandas is changing some things under the hood in the latest versions to save memory. They are borrowing stuff from polars.
ArabicLawrence@reddit
have you tried modin? https://github.com/modin-project/modin
try-except-finally@reddit (OP)
Yes, is not nearly as fast and efficient as polars if you don't use a back-end like Dask or Ray
sinnayre@reddit
My first thought too. Only need to change one line.
import modin.pandas as pd
marsupiq@reddit
I doubt it would be possible (at least not without significant loss of performance). Because pandas relies on eager evaluation, whereas polars is inherently lazy (in fact, the eager API uses the lazy API under the hood, but expressions are still always evaluated lazily, even in eager mode). Perhaps you would be able to come up with some adapter layer that would be compatible 80% of the time (but it would still have to evaluate everything in horribly inefficient ways). But in the end, I’ve seen people using pandas in some pretty “creative” ways…
It would be easy, on the other hand, to provide a polars-compatible interface based on pandas. But then again, it would be completely useless.
It’s easy to change habits. Many people have adopted polars over the last 2+ years (and I’m proud to say: I was using polars before it was cool 😎). But in the end, everybody is free to choose what they want. And there are other ways to speed up your processing, including pandas-compatible ones like dask or rapids/cuDF…
ReadyAndSalted@reddit
I understand the sentiment behind this, maybe more people would get the speed advantage of polars if it was more accessible to pandas users out of the box. However: 1. Polars doesn't use index columns (thank god), so you'd have to think about how to design around that 1. Polars syntax (while more verbose) is almost universally appreciated for how much easier it is to learn and to read.
So I think it would be a mistake to try and force the pandas API onto polars, when the polar's API is so much better, and when it would require so much rethinking of the pandas API to even make it work.
commandlineluser@reddit
It's not Polars based, but
fireducks
has a similar end goal.DataPastor@reddit
Absolutely not. Pandas’ syntax was a mistake.
pool007@reddit
One of polars benefit that made me convert was clean api, though.
VovaViliReddit@reddit
No, pandas' syntax is terribly outdated.
cocomaiki@reddit
You could check out `Narwhals`
One of the recent episodes of `RealPython` covered `Narwhals`, and you get plenty of information there.
From 11'th minute: https://realpython.com/podcasts/rpp/224/