I wrote on post on why you should start using polars in 2025 based on personal experiences
Posted by lrtDam@reddit | Python | View on Reddit | 42 comments
There has been some discussions about pandas and polars on and off, I have been working in data analytics and machine learning for 8 years, most of the times I've been using python and pandas.
After trying polars in last year, I strongly suggest you to use polars in your next analytical projects, this post explains why.
tldr:
1. faster performance
2. no inplace=true
and reset_index
3. better type system
I'm still very new to writing such technical post, English is also not my native language, please let me know if and how you think the content/tone/writing can be improved.
unhinged_peasant@reddit
I did my first project on polars this week and I had hard time for basic stuff. I guess pandas is more forgiving in some way? Not sure. But I need to write a "Quick start" for Polars as I did with Pandas
spurius_tadius@reddit
The good news is polars docs are excellent and the tool itself is consistent and predictable. The trade-off is that it's a bit turgid with syntax, especially for those of us who are coming from R-Tidyverse.
I am hoping that the LLM's get better at Polars, the library has seen some rapid changes and it takes a while for the LLM's to get good at it.
Doomtrain86@reddit
R data.table is the best data handling syntax ever invented. Succinct, fast, clear. The more I have to use python the more i appreciate how amazing it was
BrisklyBrusque@reddit
I heard the polars website has its own LLM for exactly this reason.
spurius_tadius@reddit
Wait, what ?
That would be awesome, but I can't seem to find it. All I see is this: https://docs.pola.rs/user-guide/misc/polars_llms/
They do give some advice on getting help for Polars from LLM's but it's not their own LLM.
I do expect that in the future software projects like libraries, big API's and frameworks will end up training LLM's to help their users. Haven't seen that yet, but I hope it's coming.
commandlineluser@reddit
It's the "Ask AI" button on the bottom right of the Python API reference pages.
LNGBandit77@reddit
Not needed unless your Facebook level of datasets.
whoEvenAreYouAnyway@reddit
You should use Ibis instead. That way you can use any query engine you want, including polars, and you only ever need to manage one interface and syntax.
commandlineluser@reddit
How does that help you use Polars features?
e.g. how would you do
pl.sum_horizontal()
in ibis?marr75@reddit
You can materialize a polars frame anytime, but just express sum horizontal in ibis expressions is another answer (quickest I can think of is a column wise reduction using addition).
commandlineluser@reddit
Thank you for the reply.
I just don't understand why that workflow would be suggested over using Polars directly.
techwizrd@reddit
I would like those features in Ibis, personally.
No_Dig_7017@reddit
Agree with OP. Polars is a much superior tabular data library than pandas.
Speed is the most visible factor but for me the most important difference is the clarity of the api. Polars is built to perform complex operations by combining a few well defined buildings blocks as opposed to having separate methods with their own parameter naming convention for each specific task.
This makes it so you need to go to the documentation a lot less frequently since you only need to remember those building blocks and in turn you can be more productive.
I find this invaluable when working with data where are deep in thought and any distraction can make you lose track.
astrok0_0@reddit
I have the misery of having to go back to Pandas in my new job after switching to Polars in my previous place for like 2 years ago. Just wtf man. My daily frustration level been so high ever since. Speed really does not matter, I would choose Polars even if it's slower than Pandas, just for its superior API. Fighting with Pandas' nonsense in a legacy codebase is driving me crazy
lrtDam@reddit (OP)
I think your summary is better than mine, working with polars is less mentally taxing for me. Most of the operation just works as how I intuitively thought it should.
spookytomtom@reddit
Whats the matter with inplace True? You dont even need to use it if you dont want to.
marr75@reddit
It's inconsistent as hell, for one thing (sometimes it avoids copying, sometimes it does not). It's rough design that all of your methods are both queries and mutators for a second.
BrisklyBrusque@reddit
Yes I feel like it’s a violation of the core Python principle “Explicit rather than implicit.”
pandas does a lot. Copies vs. inplace modification, not to mention Views.
Unhappy_Papaya_1506@reddit
I lost interest in Polars pretty much instantly after trying DuckDB.
_snif@reddit
Have you tried ibis?
marr75@reddit
To spell it out for people, Ibis is a python data frame library that abstracts different execution backends so the same portion code can use most major SQL dbs, pandas, and polars as interchangeable execution backends. As an even bigger advantage, you are mostly leaving the data in the SQL database and not serializing it over the wire.
improbabble@reddit
I keep wanting to like duckdb as an old MobetDB user, but it’s always been really slow in all of my testing. Substantially slower than pandas
marr75@reddit
Ibis used to use pandas as their default backend and recommended duckdb for the speed. They maintain extensive benchmarks on all of their execution backends. Duckdb is generally the fastest (polars is very competitive, especially for mid-size data) so I would have to assume there was a problem in your setup.
Unhappy_Papaya_1506@reddit
That makes absolutely no sense
commandlineluser@reddit
That seems strange - my experience has been the complete opposite.
Do you maybe have an example of such a test?
If I take a 1_000_000 row parquet file with 1 string column, extract a substring and cast to date.
For 10_000_000 rows.
maigpy@reddit
how do you df.apply() in duckdb?
Unhappy_Papaya_1506@reddit
It's not really a data frame way of thinking. You need to be relatively comfortable with SQL.
maigpy@reddit
Sometimes i have to carry out transflrmations that require me to run python code and sql doesnt cut it. What do you do in those cases?
Say starting with a list of urls from a sitemap, scrape some data and then create folders and files based on the content of some of the scraped data. This works very well with keeping all the data in a dataframe, itd be much more cumbersome to bring in and out of sql tables in duckdb. And I'm a sql lover. Id rather spin up a postgres container if i need sql and i have the freedom to do that. if i dont, i see the use for duckdb.
Unhappy_Papaya_1506@reddit
You're probably not working with larger than memory datasets I'm guessing
maigpy@reddit
what does larger than memory has to do with it? you still need that data, whether paged or not, in memory, to perform some actions on it.
Dr_Quacksworth@reddit
Sorry if I'm missing something, but don't most SQL flavors support an apply command?
BrisklyBrusque@reddit
R has a library called duckplyr that runs tidyverse commands on a duckdb backend.
Python has a library called Ibis that has yet another API, reminiscent of both SQL and tidyverse, against a duckdb backend.
Frankly I am surprised there is no library yet that combines a pandas frontend with a duckdb backend. I am sure it’s on the way.
guycalledsrijan@reddit
Can we use tracer that ai in office vs code, will it be legal, asper client data law
hugthemachines@reddit
Is this what you meant to ask?
"Is it legal to use AI-based tools like tracers or code assistants in VS Code, considering client data privacy laws?"
and in that case, why ask that comment on this post?
commandlineluser@reddit
With regards to your complaints:
Attribute notation is supported for valid Python identifiers e.g.
pl.col.event_date
ispl.col("event_date")
Some people seem to be using
from polars import col as c
so they can just writec.event_date
Not sure if I understand your code for your date filter correctly.
From the text description it sounds like you want something like:
The
pl.Int8
type for the.dt
methods can be a bit of a footgun.lrtDam@reddit (OP)
Thanks for the advice! I do use
c=pl.col
sometimes orsome_col = pl.col(column_name)
if that column is frequently used.First time seeing
pl.any_horizontal
, will check that outcommandlineluser@reddit
It's an alternative way of expressing
|
chains.pl.any_horizontal(foo, bar)
isfoo | bar
- but it also allows you to create the chains "programatically".I also find it cleaner for larger expressions that would require lots of parens.
pl.all_horizontal()
is the same but for&
chains.BidWestern1056@reddit
nah why learn something new when old thing works just fine
missurunha@reddit
For people who work with devops and such type of task, learning the tool is the interesting part of the job so they switch as fast as they can between different libs/frameworks.
BidWestern1056@reddit
yea i know im just being pessimistically sarcastic
internerd91@reddit
Hey, thanks for your post. I started learning it this week, actually.
chat-lu@reddit
People with perfect / near perfect English need to stop apologizing for their English level. Do you see the unilinguals apologizing?