blip

blip : Blog of Isaac & Jason :

The nuances of JOIN

August 1st, 2003 by Jason · No Comments

Gerg and I were banging our heads one day because our application had become extremely slow. We had added a bunch of test data (100,000+ rows) and now our list page was taking 30 seconds to load. So we began trying to re-formulate our query in all the possible ways to find out what part of it was slow. In the end it came down to one word: INNER, which is one of the several ways to join two tables together.

Take the following example DB schema for a basic shopping cart application. There is a carts table that lists each user’s cart, when it was created, etc. Then there is a cart_items table which has the items that are in the cart. Our page was to list what carts used a particular item (e.g. find out all the people that had put Linux (it em id of 5) in their cart ;). The original query was:


SELECT t1.id, t1.user
FROM carts AS t1
LEFT JOIN cart_items AS t2 ON t1.id = t2.cart_id
WHERE t1.status = 'ACTIVE' AND t2.item_id = 5
GROUP BY t1.id

With 100,000 carts and 10 times that in the cart_items table this starts to crawl, EVEN if item 5 has only been i n 1 cart. The slownewss is in the LEFT join. Here is where the nuances of different joins comes in. A LEFT join must return all rows from the left table, and fill them with NULL values if it can’t find a matching row on the right. So, in our example the database engine goes through the 100,000 carts trying to join them up with the some kind of item on the right. If you try do this by hand, you’ll begin to see why this is such a slow operation ;).

The solution is found in an INNER join. An INNER join is a much more basic join and thus much faster. An INNER join will not return all the rows from the left table; it will only return those rows that match between the two tables. So, if you need all the rows from the left table, even when they don’t have a corresponding row on the right table you have to use a LEFT join. But in our case we don’t care about carts that don’t have matching entries in the right table. We just want the ones that have used item 5. So when we replace the LEFT JOIN with an INNER JOIN the query becomes blazingly fast. The reason is that the database engine doesn’t have to worry about returning all the rows from the left table so it is free to chop out rows using the WHERE clause before doing the join. So, in our example it is able to remove all but 1 row from the right table (assuming there is only one cart that uses item 5). Now, the left table only has to join on one row!

You might think, why couldn’t the LEFT JOIN have known to pare down the right table based on the WHERE statement? It’s because it has no way of knowing if you are relying on the fact that NULL values will be present for the unmatched rows on the left table. The key is that a LEFT (or RIGHT) join can only prune from the primary table before beginning the join process, while an INNER can prune rows from both tables. It’s these one word nuances that keep programmers in business!

Tags: technology