How do I optimize SQL queries for large datasets as a beginner data analyst?
Posted by Only-Economist1887@reddit | learnprogramming | View on Reddit | 6 comments
I'm a beginner data analyst (about 6 months into learning SQL) and I'm working with a retail sales dataset that has around 500,000 rows. My queries are running quite slow and I'm not sure where to start with optimization.
Here's a typical query I'm running:
SELECT product_category, SUM(sales_amount) as total_sales
FROM sales_data
WHERE sale_date BETWEEN '2024-01-01' AND '2024-12-31'
GROUP BY product_category
ORDER BY total_sales DESC;
This takes about 8-10 seconds to run. I've heard about indexing but I'm not sure how and where to apply it. Any tips or resources would be really helpful!
abbasovdev@reddit
Add “Explain” keyword at the beginning of the query. It will show execution details, based on that you will see what is the real bottleneck.
Regarding indexing, I would recommend this free resource: https://use-the-index-luke.com/
Aggressive_Ad_5454@reddit
This is the way. Learn to read execution plans and read the Luke book.
born_zynner@reddit
Matthew mark and john not relevant?
omfghi2u@reddit
Yeah you're on the right track, you need to index the table on the date column. Searching for dates within a range is the only thing here that is doing anything remotely heavy. You have to do that to the table itself. It's not something you do in a query, it's database optimization. If you don't have access to modify the table definition, there's not much you can do.
BNfreelance@reddit
I would guess the bottleneck is occurring from the
WHERE sale_date BETWEEN, as it’s running that against all 500k rowsyou would benefit from creating an index, or better still, a composite index
The above aligns with how you run your query atm: it allows the database to filter by
sale_dateand then work with rows already ordered byproduct_categoryfor the grouping.i would index
WHEREcolumns first, and thenGROUP BYFor sanity sake you’d want to check that
sale_dateis correctly stored as a proper formatted date / time, and not a stringMeLittleThing@reddit
The query looks fine, 8 - 10 secs is huge for such query. What type is
sale_date? isproduct_categorya foreign key?