Joining Data Tables: The Basics
07/16/2018 • by Nicole Hitner • 0 comments
A relational database is ultimately just a collection of data tables, each of which is useful on its own but exponentially more useful when joined to its neighbors. Data analysts will have a leg up core reporting concepts if they begin their BI training with a basic understanding of the four essential join types, which not only merge data on output but can also filter it to display a specific subset of records. Before we explore how joins define the relationships between tables, however, we first need to understand what a key is.
We will discuss two types of keys: primary keys and foreign keys. A primary key is a field or group of fields that uniquely identifies each row or record in the table. A table must have exactly one primary key to qualify as relational, but that key can be composed of multiple fields/columns.
A foreign key, by contrast, is one or more fields or columns that corresponds to the primary key of another table. Foreign keys are what make it possible to join tables to each other.
In the set of tables below, for example, course_id is the primary key for COURSE and rm_id is a foreign key corresponding to a field of the same name in ROOM. Note that ROOM does not have a foreign key for COURSE.
You might wonder why course isn’t the primary key for the COURSE table. Its values are indeed unique per record, so it could be used as the primary key, but the practice of employing data values as keys (called “natural keys”) is generally discouraged. What happens if we begin adding records to the table? As soon as another section of Biology 1 opens up, course name stops being unique per record. If the course name is our primary key, we are forced to give the new Biology 1 class a slightly different name, perhaps “Biology 1B,” and then update this key in all tables that reference it. This method is labor intensive and is apt to cause confusion. (If there’s no difference between Biology 1 and Biology 1B, why do they have different names? And whatever happened to Biology 1A?) It is for this reason that synthetic keys are typically used instead of natural keys. This way, the key has only one job: to locate records.
Even if course_id remains the primary key for the COURSE table, we are presented with a problem when a new section of Biology 1 is added to the table, and that is that both classes share the same course number. Universities resolve this problem by identifying individual classes by both their course numbers and their section numbers. In our example, then, we would need to add a course_section column and create a composite primary key out of the course number and section number columns. Alternatively, we could adjust the course_id to reflect sections as well:
Because COURSE has a foreign key corresponding to a primary key in ROOM (rm_id), we know that we can join these two tables together to make one. How that happens, however, will depend on the join type.
Full Outer Joins
The most basic join type is what’s known as a full outer join. The output contains all the primary ids in both tables and all fields in both tables, regardless of whether the records have matches in the opposite table. If, in the diagram below, we represent data that will appear in the output using the color blue, we can see that all records from both tables will be returned.
Let’s refer back to our original university coursework tables. If we do a full outer join on rm_id, keeping all primary ids in both tables as well as all columns, ordering by course_id ascending, we end up with the following output:
Because Developmental Psychology 2 has not yet been assigned a room, it does not have a match in the ROOM table. HUM311 has likewise not been assigned a course, so it has no match in the COURSE table. As a result, there are lots of null or empty values in the output, and this is fairly typical of full outer joins. The goal is to be all-inclusive, which isn’t the case for other join types. If, for example, we were only interested in courses with rooms, we might use a left outer join instead.
Left and Right Outer Joins
Left outer joins use all the primary ids of the left table and all fields of both tables. In the diagram below, we see that all records of the left table appear in the output, along with those in the right table as long as they have matches in the left.
Referring back to our collegiate example, a left outer join of the two original tables would yield the following:
Note that HUM311 is nowhere to be found on this table. That is because it neither has a primary key in the left table (because it’s a room, not a course), nor is it associated with a course. We’re beginning to see how joins can act as filters by restricting which records return from the query.
The mirror opposite of a left outer join is, as you might have guessed, a right outer join. In a right outer join, all the primary ids of the right table appear in the output along with all columns from both tables.
This time, it’s the PSY302 class that doesn’t make the cut.
All the rooms automatically make it into the output, but the only courses that will be returned, in this case, are those with rooms assigned. PSY302 has not yet been assigned a room and is therefore excluded from the resulting table.
The fourth and final basic join type is an inner join. Here, the only records we want returned on output are those referenced in both tables.
So in our courses and rooms example, courses without rooms are out, as are rooms without courses. This leaves us with just the science courses and their respective rooms:
Semi Joins and Anti Joins
Inner, left outer, right outer, and full outer are the four basic join types to know when you’re just getting into SQL, but there are other less common joins to explore as well. Semi joins, unlike the joins we’ve looked at so far, don’t return all columns of both tables. Instead, they just return the columns of the left table and records with matches in the right table. If we were to diagram this concept, it might look something like this:
The output would look similar as the output resulting from an inner join, except columns specific to Table B (ROOM) would be missing:
An anti join, by contrast, is the opposite of a semi join: we only return records from Table A that do not have matches in Table B.
Since only the Psychology course has no assigned room, it would be the only record to appear on output:
Once you’ve got these basic join types down pat, you can explore formulaic joins, such as those containing conditional statements, ranges, and calculations! SQL is such a versatile language that joining options are virtually endless. Understanding these four primary concepts will help inform your use of business intelligence solutions and other analytics tools.