Optimizing query

Hi!

I’m trying to speed up some queries in my app, running Rails 3.2 and need some help. I’m running the app at Heroku on postgresql. I’m new to postgresql and need some help to optimize a query so it effectively uses indices.

This is the query I’m currently working on:

http://snipt.net/Linuus/rails-query?key=ee73173643e8d21a5a487d8a329c7a26

In the link above you can also see the full EXPLAIN (query plan) as well as my current indices. It doesn’t seem to use any of my indices at the moment…

So, any help is very appreciated.

Regards

Linus

Would probably help if you posted the generated SQL as well as the output of EXPLAIN ANALYZE, which shows not just what the planner thinks the costs are, but what the costs actually were.

You could also try running ANALYZE on the tables in question, or ANALYZE VERBOSE and checking if the estimates are pretty close to reality, e.g.

ANALYZE VERBOSE categories;

Jim Crate

Thank you. I have added the generated SQL to the snipt in my first post.

Regarding the EXPLAIN ANALYZE, can I somehow run that on Heroku? I’m using their shared database so I don’t have access to any psql console…

I did run it on my local machine though. The thing is that it differs in database version (Local is PostgreSQL 9.1.2 while Heroku still uses 8.3 or something) and also in size.

Anyway, the EXPLAIN ANALYZE from my local machine can be found here:

http://snipt.net/Linuus/postgresql?key=e4728f444b24839e3f80adf3829bcba9

Regards

Linus

Thank you. I have added the generated SQL to the snipt in my first post.

Regarding the EXPLAIN ANALYZE, can I somehow run that on Heroku? I’m using their shared database so I don’t have access to any psql console…

I did run it on my local machine though. The thing is that it differs in database version (Local is PostgreSQL 9.1.2 while Heroku still uses 8.3 or something) and also in size.

On Debian/Ubuntu you typically have a choice of an 8.4 or a 9.1 …

Try

$ apt-cache search postgresql-8.4

and then you could install that alongside 9.1.

It will run on a different port

(one on 5432 and one on 5433 IIRC, check $ netstat -atn).

That said, I have no information that the optimization would be

different for such a “simple” query.

Anyway, the EXPLAIN ANALYZE from my local machine can be found here:

http://snipt.net/Linuus/postgresql?key=e4728f444b24839e3f80adf3829bcba9

Just out of curiosity … what problem are you trying to solve ? Did you

actually see any slow queries ? I don’t see obvious problems with the

query … Not sure if this is the area that deserves your first attention

wrt performance optimization … there is this whole story about

“premature optimization” (Google it to find many rants about it …).

HTH,

Peter

Well, maybe it’s not necessary… It is the slowest of my queries as far as I can see anyway.

I’m using Rails 3.2 and I have enabled automatic EXPLAIN on slow queries (>0.5s). This query is currently being “explained” in my console when I run it on my localhost. “SQL (556.5ms) …”

556.5ms seems a bit slow to me, isn’t it?

Then a bunch of other stuff is happening and I get this:

“Completed 200 OK in 2737ms (Views: 813.5ms | ActiveRecord: 703.2ms)”

So, 550ms of the total 703ms is the above query.

Maybe I’m just picky? :slight_smile:

Indeed.

TL;DR

I had a similar database structure here, filled it with 10,000 records (you seem to have around 2,200 ?),

postgresql 9.1 on Ubuntu and ran a similar query:

cl1 = ContactLine.includes(:contact) ; nil

cl2 = cl1.includes(:contact => :person) ; nil

cl3 = cl2.where(“people.first_name = ‘M’ OR people.first_name = ‘longer_word’”) ; nil # 8 times ‘M’ and 8 times ‘longer_word’

cl4 = cl3.order(“contacts.email”) ; nil

167:0> puts cl4.explain

SQL (16.9ms) SELECT “contact_lines”.“id” AS …

FROM “contact_lines”

LEFT OUTER JOIN “contacts” ON “contacts”.“id” = “contact_lines”.“contact_id”

LEFT OUTER JOIN “people” ON “people”.“id” = “contacts”.“person_id”

WHERE (people.first_name = ‘M’ OR people.first_name = ‘longer_word’)

ORDER BY contacts.email

EXPLAIN (1.4ms) EXPLAIN SELECT “contact_lines”.“id” …

EXPLAIN for: SELECT “contact_lines”.“id” …

QUERY PLAN

Hi

I tested to remove the .order(…) part and indeed, the query time goes down to ~100ms. However, it doesn’t help to add indices, at least not as I did :slight_smile:

add_index :categories, :name

add_index :subcategories, :name

Did some more testing and if I keep the .order… but don’t join the products table I get a query that runs at about ~55ms. So the bottleneck seems to be the products table.

The query that I’m running looks like this:

Category.eager_load(:subcategories)

.joins(“INNER JOIN products AS p ON resellercategories.id = p.resellercategory_id”)

.order(“categories.name ASC, subcategories.name ASC”)

(Skipping the gender here…)

What I have is Categories and Subcategories. They are related to each other through a Resellercategories table. Products are related to Resellercategories.

So, the reason that I want to join the products as well is because I only want to show categories and subcategories that actually have some products (there are some empty categories/subcategories still).

So the above query is what we came up with in another thread here in the group.

  • Maybe there is a better way to check if a category/subcategory has products without joining the entire products table?

BTW, in my dev database there is about 8700 rows in products table, 2200 rows in resellercategories, 5 rows in categories and 45 rows in subcategories.

Also, another reason for the query that I forgot to mention is that if a user filters the products for female products only for instance, it should only show categories and subcategories that contains products for that gender.

The gender is specified in the products table…

It is possible to add a :counter_cache , but then you need to make sure you use the proper methods for each product that you add or remove from the association.

Alternative to the default counter cache (from Rails), you could build your own

logic as in:

  • has_male_products

changing you query to …

Category.eager_load(:subcategories). where(:has_male_products => true). order(…)

Then you would need to set the cache on the appropriate categories in

an after_save on the product you are creating/updating/deactivating/(deleting ?).

Both ideas would probably be faster for querying, but certainly more

complex for making sure that cache is always correct.

HTH,

Peter

Sorry to reply to my own post.

TL;DR Is there pagination? Then a smaller set may return much faster.

I was thinking over my reply and may have forgotten a fundamental

aspect … If you say 2200 categories, 8000 products.

How many entries does you query return ?

(replace .add with .count at the end).

How many do you need ?

What happens when you add .limit(20) to your query ?

By which “primary object” do you want to sort and paginate ?

(I will assume ‘Product’ in the discussion below).

With the includes that are currently implemented, you may

have to redo the whole query into 2 qeuries …

  1. for fetching the “primary objects”

(e.g. exactly 20 Products, no additional “has_many” data, because

that would increase the number of returned rows for 1 product and

make proper pagination in the database impossible; including

“belongs_to” here is no problem)

  1. a second query for fetching eventual “has_many” data

on those 20 “primary products” (is that “Reification” ?)

If the performance problem could be solved by taking the

pagination into account, that would be a much better

solution that building cache columns in this early phase

of your project.

HTH,

Peter

This query is only to get the appropriate Categories & Subcategories. There are 5 Categories and 45 Subcategories.

My products are imported from webshops which are using different categories for the same things (“tshirt”, “t-shirts”, “t-shirt”, “short sleeved shirts” may all be the same). To cope with this issue I have the “Resellercategories” that I relate to a specific Category and Subcategory.

Category/Subcategory → Resellercategory → Product

(Resellercategories are never seen in the frontend)

If I filter the products for “male” I only want to display (non-empty) Categories/Subcategories that have male products.

To answer your question, there is no pagination on the Categories and Subcategories that I’m fetching here with this query. The query only returns the appropriate Categories/Subcategories. So if filtered by “male” it returns 3 Categories and 12 Subcategories (In my dev environment).

I am using pagination for the actual products though, but that is a separate query.

OK, I understand (small amount of categories, much more products).

If you find no other way, then the caching the “has_male_products” etc

in the Category/Subcategory may be the best remaining way.

I was also thinking, maybe it is acceptable that the set of Categories

with “male_products” etc. is only update every hour or so. That would

avoid the complexity of real-time update of that cache column. On the

other hand, an after_save on products is not that difficult.

Just as a test, does your performance improve significantly if you add

a “has_male_products” column on categories and filter on that?

HTH,

Peter