Introductory Transact-SQL, Page 6
JoinsA query that can access all the data it needs in a single table is a pretty rare one. John Donne said "no man is an island," and, in relational databases, no table is, either. Usually, a query will have to go to two or more tables to find all the information it requires. This is the way of things with relational databases. Data is intentionally spread out to keep it as modular as possible. There are lots of good reasons for this modularization (formally known as normalization) that I won't go into here, but one of its downsides is that what might be a single conceptual entity (an invoice, for example) is often split into multiple physical entities when constructed in a relational database.
Dealing with this fragmentation is where joins come in. A join consolidates the data in two tables into a single result set. The tables aren't actually merged; they just appear to be in the rows returned by the query. Multiple joins can consolidate multiple tables—it's quite common to see joins that are multiple levels deep involving scads of tables.
A join between two tables is established by linking a column or columns in one table with those in another (CROSS JOINs are an exception, but more on them later). The expression used to join the two tables constitutes the join condition or join criterion. When the join is successful, data in the second table is combined with the first to form a composite result set—a set of rows containing data from both tables. In short, the two tables have a baby, albeit an evanescent one.
There are two basic types of joins, inner joins and outer joins. The key difference between them is that outer joins include rows in the result set even when the join condition isn't met, while an inner join doesn't. How is this? What data ends up in the result set when the join condition fails? When the join criteria in an outer join aren't met, columns in the first table are returned normally, but columns from the second table are returned with no value—as NULLs. This is handy for finding missing values and broken links between tables.
There are two families of syntax for constructing joins—legacy and ANSI /ISO SQL-92 compliant. The legacy syntax dates back to SQL Server's days as a joint venture between Sybase and Microsoft. It's more succinct than the ANSI syntax and looks like this:
SELECT customers.CustomerNumber, orders.AmountFROM customers, ordersWHERE customers.CustomerNumber=orders.CustomerNumberCustomerNumber Amount-------------- --------1 123.452 678.903 86753.091 678.90
Note the use of the WHERE clause to join the customers and orders tables together. This is an inner join. If an order doesn't exist for a given customer, that customer is omitted completely from the list. Here's the ANSI version of the same query:
SELECT customers.CustomerNumber, orders.AmountFROM customers JOIN ordersON (customers.CustomerNumber=orders.CustomerNumber)
This one's a bit loquacious, but the end result is the same: customers and orders are merged using their respective CustomerNumber columns.
As I mentioned earlier, it's common for queries to construct multilevel joins. Here's an example of a multilevel join that uses the legacy syntax:
SELECT customers.CustomerNumber, orders.Amount, items.DescriptionFROM customers, orders, itemsWHERE customers.CustomerNumber=orders.CustomerNumberAND orders.ItemNumber=items.ItemNumberCustomerNumber Amount Description-------------- -------- -----------1 123.45 WIDGET A2 678.90 WIDGET B3 86753.09 WIDGET C1 678.90 WIDGET B
This query joins the composite of the customers table and the orders table with the items table. Note that the exact ordering of the WHERE clause is unimportant. In order to allow servers to fully optimize queries, SQL requires that the ordering of the predicates in a WHERE clause must not affect the result set. They must be associative—the query must return the same result regardless of the order in which they're processed.
As with the two-table join, the ANSI syntax for multitable inner joins is similar to the legacy syntax. Here's the ANSI syntax for the multitable join above:
SELECT customers.CustomerNumber, orders.Amount, items.DescriptionFROM customers JOIN orders ON (customers.CustomerNumber=orders.CustomerNumber)JOIN items ON (orders.ItemNumber=items.ItemNumber)
Again, it's a bit wordier, but it performs the same function.
Thus far, there hasn't been a stark contrast between the ANSI and legacy join syntaxes. Though not syntactically identical, they seem to be functionally equivalent.
This all changes with outer joins. The ANSI outer join syntax addresses ambiguities inherent in using the WHERE clause—whose terms are by definition associative—to perform table joins. Here's an example of the legacy syntax that contains such ambiguities:
-- Bad SQL - Don't runSELECT customers.CustomerNumber, orders.Amount, items.DescriptionFROM customers, orders, itemsWHERE customers.CustomerNumber*=orders.CustomerNumberAND orders.ItemNumber*=items.ItemNumber
Don't bother trying to run this—SQL Server won't allow it. Why? Because WHERE clause terms are required to be associative, but these aren't. If customers and orders are joined first, those rows where a customer exists but has no orders will be impossible to join with the items table since their ItemNumber column will be NULL. On the other hand, if orders and items are joined first, the result set will include ITEM records it likely would have otherwise missed. So the order of the terms in the WHERE clause is significant when constructing multilevel joins using the legacy syntax.
It's precisely because of this ambiguity—whether the ordering of WHERE clause predicates is significant—that the SQL-92 standard moved join construction to the FROM clause. Here's the above query rewritten using valid ANSI join syntax:
SELECT customers.CustomerNumber, orders.Amount, items.DescriptionFROM customers LEFT OUTER JOIN orders ON(customers.CustomerNumber=orders.CustomerNumber)LEFT OUTER JOIN items ON (orders.ItemNumber=items.ItemNumber)CustomerNumber Amount Description-------------- -------- -----------1 123.45 WIDGET A1 678.90 WIDGET B2 678.90 WIDGET B3 86753.09 WIDGET C
Here, the ambiguities are gone, and it's clear that the query is first supposed to join the customers and orders tables, then join the result with the items table. (Note that the OUTER keyword is optional.)
To understand how this shortcoming in the legacy syntax can affect query results, consider the following query. We'll set it up initially so that the outer join works as expected:
SELECT customers.CustomerNumber, orders.AmountFROM customers, ordersWHERE customers.CustomerNumber*=orders.CustomerNumberAND orders.Amount>600CustomerNumber Amount-------------- --------1 678.902 678.903 86753.09
Since every row in customers finds a match in orders, the problem isn't obvious. Now let's change the query so that there are a few mismatches between the tables, like so:
SELECT customers.CustomerNumber+2, orders.AmountFROM customers, ordersWHERE customers.CustomerNumber+2*=orders.CustomerNumberAND orders.Amount>600
This version simply adds 2 to CustomerNumber to ensure that at least a few of the joins will fail and the columns in orders will be returned as NULLs. Here's the result set:
CustomerNumber Amount-------------- --------3 86753.094 NULL5 NULL
See the problem? Those last two rows shouldn't be there. Amount is NULL in those rows (because there are no orders for customers 4 and 5), and whether it exceeds $600 is unknown. The query is supposed to return only those rows whose Amount column is known to exceed $600, but that's not the case. Here's the ANSI version of the same query:
SELECT customers.CustomerNumber+2, orders.AmountFROM customers LEFT OUTER JOIN orders ON(customers.CustomerNumber+2=orders.CustomerNumber)WHERE orders.Amount>600CustomerNumber Amount-------------- --------3 86753.09
The SQL-92 syntax correctly omits the rows with a NULL Amount. The reason the legacy query fails here is that the predicates in its WHERE clause are evaluated together. When Amount is checked against the >600 predicate, it has not yet been returned as NULL, so it's erroneously included in the result set. By the time it's set to NULL, it's already in the result set, effectively negating the >600 predicate.
Though the inner join syntax you choose is largely a matter a preference, you should still use the SQL-92 syntax whenever possible. It's hard enough keeping up with a single way of joining tables, let alone two different ways. And, as we've seen, there are some real problems with the legacy outer join syntax. Moreover, Microsoft strongly recommends the use of the ANSI syntax and has publicly stated that the legacy outer join syntax will be dropped in a future release of the product. Jumping on the ANSI /ISO bandwagon also makes sense from another perspective: interoperability. Given the way in which the DBMS world—like the real world—is shrinking, it's not unusual for an application to communicate with or rely upon more than one vendor's DBMS. Heterogeneous joins, passthrough queries, and vendor-to-vendor replication are now commonplace. Knowing this, it makes sense to abandon proprietary syntax elements in favor of those that play well with others.