The goal of this content is to provide A good foundational understanding of Oracle SQL for querying databases to get you at a level to perform analysis on data.
It won't cover management of data within databases, or any way to automate your scripts and code, that is beyond the scope of this content.
A free software produced by Oracle it has a basic and clean interface but more complex features for power users.
An alternative SQL IDE, a license is required for usage.
SQL is just instructions that are sent to the database, so you can edit that plain text code in any text editor, and many tools offer database connections, including VSCode, RStudio, and many more!
A comment is the most important thing in any language.
They let you write notes in and around your code that the computer ignores.
A comment is the most important thing in any language.
They let you write notes in and around your code that the computer ignores.
-- Comments in SQL are done like this
A comment is the most important thing in any language.
They let you write notes in and around your code that the computer ignores.
-- Comments in SQL are done like this
/* Or multi-line comments
are done like this */
A comment is the most important thing in any language.
They let you write notes in and around your code that the computer ignores.
-- Comments in SQL are done like this
/* Or multi-line comments
are done like this */
select * from table; -- you can comment at the end of a line
select * /* or even in the middle*/ table;
Keywords in SQL are case insensitive so you can type them however you like.
Keywords in SQL are case insensitive so you can type them however you like.
SELECT ORIGIN, DEST FROM FLIGHTS;
Keywords in SQL are case insensitive so you can type them however you like.
Whiteshape is also ignored in SQL, so you can layout your code however you like.
SELECT ORIGIN, DEST FROM FLIGHTS;
Keywords in SQL are case insensitive so you can type them however you like.
Whiteshape is also ignored in SQL, so you can layout your code however you like.
SELECT ORIGIN, DEST FROM FLIGHTS;
select
origin,
dest
from
flights;
However, being consistent and making your code easy to read is important.
SELECT ORIGIN, DEST FROM FLIGHTS;
select
origin,
dest
from
flights;
However, being consistent and making your code easy to read is important.
/* Three equivalent queries despite case and whitespace */
SELECT ORIGIN, DEST FROM FLIGHTS;
select
origin,
dest
from
flights;
SeLeCt OrIgIn,DeSt fRoM fLiGhTs;
Error messages try to tell you what you've done wrong in your code when it doesn't work.
Learning to read them and understand what they mean is invaluable, and we will see some as we go through.
Finally, empty values in SQL have a special type call null.
This isn't an empty string or 0 or NA, it is a special thing all by itself.
When you write a query, you are asking the database to produce you a table. This table need to have columns, so we need to tell the database which variables we need.
Some of these variables might already exist, some might need to be created - we'll learn how to do both.
All SQL queries take the following minimum form.
SELECT
variables
FROM
DATABASE.SCHEMA.TABLE;
All SQL queries take the following minimum form.
Every query has a FROM keyword followed by a reference to a table, this tells the database where to look for our variables.
SELECT
variables
FROM
DATABASE.SCHEMA.TABLE;
All SQL queries take the following minimum form.
Every query has a SELECT keyword followed by list of variables; these might be existing columns in the table or ones created.
SELECT
variables
FROM
DATABASE.SCHEMA.TABLE;
All SQL queries take the following minimum form.
Every query ends with a semi-colon. This is like a full stop for the database so it knows where one query stops and the next one begins.
SELECT
variables
FROM
DATABASE.SCHEMA.TABLE;
We can select every variable from our table using the * special character.
SELECT
*
FROM
schema_name.flights;
We can select every veriable from our table using the * special character.
Or we can swap out the * and list the column names ourselves, seperated by a comma
SELECT
*
FROM
schema_name.flights;
Or we can swap out the * and list the column names ourselves, seperated by a comma
SELECT
year,
month,
day,
dep_time,
sched_dep_time,
dep_delay,
arr_time,
sched_arr_time,
arr_delay,
carrier,
flight,
tailnum,
origin,
dest,
air_time,
distance,
hour,
minute,
time_hour
FROM
schema_name.flights;
Or we can just select some of them
SELECT
year,
month,
day,
dep_time,
sched_dep_time,
carrier,
flight,
tailnum,
origin,
dest
FROM
schema_name.flights;
Run some of these queries so far and see the results.
We can ask the database for variables that aren't in our table, we just need to tell it what the data is. In this case, that data is the value 123.
SELECT
year,
month,
day,
dep_time,
sched_dep_time,
carrier,
flight,
tailnum,
origin,
dest,
123
FROM
schema_name.flights;
Note that we had to add a comma to the end of the list before our new variable.
It would be nice to give this variable a name, we can do that using the as keyword and call it whatever we like (within reason). This is known as aliasing.
SELECT
year,
month,
day,
dep_time,
sched_dep_time,
carrier,
flight,
tailnum,
origin,
dest,
123 as new_column
FROM
schema_name.flights;
Column names should start with a letter, have only _ as punctuation and be less than 30 characters long. Some of these rules can but should not be broken.
Technically we don't need to use as but for column names it is good practice.
SELECT
year,
month,
day,
dep_time,
sched_dep_time,
carrier,
flight,
tailnum,
origin,
dest,
123 new_column
FROM
schema_name.flights;
We'll see in a second you can alias a table as well, but you cannot use the as word there!
Run some of these queries so far and see the results.
NOTE that we can't mix * and extra variables.
-- This code will error
SELECT
*,
123 as new_column
FROM
schema_name.flights;
NOTE that we can't mix * and extra variables.
This is because * doesn't mean everything from our table, it means EVERYTHING, and we can't ask for more than everything (it's like taking infinity+1).
-- This code will error
SELECT
*,
123 as new_column
FROM
schema_name.flights;
What we need to do it tell the database we want everything from the table and note the universe. So first, it's important to understand the following are the same query, one is shorthand for the other.
SELECT
tailnum
FROM
schema_name.flights;
SELECT
schema_name.flights.tailnum
FROM
schema_name.flights;
So if we want everything from the table, we could write the following, and this will work, but is pretty annoying to type.
SELECT
schema_name.flights.*,
123 as new_column
FROM
schema_name.flights;
So if we want everything from the table, we could write the following, and this will work, but is pretty annoying to type.
What if the same way we aliased columns we alias our table? Let's call it a
SELECT
a.*,
123 as new_column
FROM
schema_name.flights a;
SIDENOTE: You can break some of those column name rules using double quotes ("), these are used when you talk about database objects (tables, columns etc). Use single quotes (') for text data in the database.
SELECT
a.*,
123 as "new column"
FROM
schema_name.flights a;
Run some of these queries so far and see the results.
Attempt questions 1-4.
Having a column with all the same value isn't really that useful.
Let's create create some variables using existing ones next.
From now on for simplicity we will only show the relevant parts of the code, not all the variables as shown in the training file.We can start by doing some basic maths on a numeric column.
SELECT
month,
month + 1 AS next_month,
...
FROM
schema_name.flights;
We can start by doing some basic maths on a numeric column.
We don't even have to keep the original column.
SELECT
month + 1 AS next_month,
...
FROM
schema_name.flights;
We CAN'T use our new column in another calculation (we'll see way to do this later).
-- This code will error
SELECT
month + 1 AS next_month,
next_month + 1 AS next_next_month
...
FROM
schema_name.flights;
Now let's take a look at some string functions that will be useful for when we're working with text.
SELECT
carrier,
...
FROM
schema_name.flights;
lower takes a string and makes it lowercase.
SELECT
carrier,
lower(carrier),
...
FROM
schema_name.flights;
upper takes a string and makes it uppercase.
SELECT
carrier,
lower(carrier),
upper(lower(carrier)),
...
FROM
schema_name.flights;
Notice how we've nested functions here, this is a useful trick.
substr takes 3 arguments, the first 2 are required and are the string to cut and the start position. The third argument is optional and is the length of the substring.
SELECT
carrier,
lower(carrier),
upper(lower(carrier)),
tailnum,
substr(tailnum, 3),
substr(tailnum, 1, 3),
substr(tailnum, 3, 2),
...
FROM
schema_name.flights;
This is like LEFT, MID, and RIGHT from Excel all in one.
length tells you have many characters are in a string.
SELECT
carrier,
lower(carrier),
upper(lower(carrier)),
tailnum,
substr(tailnum, 3),
substr(tailnum, 1, 3),
substr(tailnum, 3, 2),
length(tailnum)
...
FROM
schema_name.flights;
Run some of these queries so far and see the results.
Attempt questions 5-7.
Sometimes we don't want to do the same thing to every record, we might want to do different calculations based on some conditions.
We do this by something called a case statement.
Case statements take this form:
CASE
WHEN condition THEN value_1
WHEN condition THEN value_2
...
WHEN condition THEN value_n
ELSE value_default
END
Case statements take this form:
CASE
WHEN condition THEN value_1
WHEN condition THEN value_2
...
WHEN condition THEN value_n
ELSE value_default
END
Case statements take this form:
CASE
WHEN condition THEN value_1
WHEN condition THEN value_2
...
WHEN condition THEN value_n
ELSE value_default
END
Case statements take this form:
CASE
WHEN condition THEN value_1
WHEN condition THEN value_2
...
WHEN condition THEN value_n
ELSE value_default
END
Case statements take this form:
CASE
WHEN condition THEN value_1
WHEN condition THEN value_2
...
WHEN condition THEN value_n
ELSE value_default
END
REMEMBER: every CASE must have an END and every WHEN must have a THEN.
CASE
WHEN condition THEN value_1
WHEN condition THEN value_2
...
WHEN condition THEN value_n
ELSE value_default
END
To write these boolean/logical tests we use comparison operators like the following:
Symbol | Comparison |
---|---|
= | Equal |
!=, <>, ^= | Unequal (not equal) |
<, <= | Less than, Less than or equal to |
>, >= | Greater than, Greater than or equal to |
[NOT] IN | Set membership (is the element in the set/list provided). Set provided in ('Elem1', 'Elem2', ..., 'ElemN') format |
[NOT] LIKE | Pattern match (is the element like the provided pattern). Uses % as 0-Inf character wildcard, _ for single character wildcard |
[NOT] BETWEEN | Shorthand for x <= y and y <= z becomes y between x and z |
And we can combine them using the logical operators:
Operation | Meaning |
---|---|
AND | |
OR | |
NOT | |
() |
A | B | A AND B | A OR B | NOT A |
---|---|---|---|---|
TRUE | TRUE | TRUE | TRUE | FALSE |
TRUE | FALSE | FALSE | TRUE | - |
FALSE | TRUE | FALSE | TRUE | TRUE |
FALSE | FALSE | FALSE | FALSE | - |
And we can combine them using the logical operators:
Operation | Meaning |
---|---|
AND | Logical AND, both statements must be TRUE |
OR | Logical OR, at least one statement must be TRUE |
NOT | Logical negation, converts TRUE to FALSE and vice versa |
() | Brackets, force the evaluation of the logical statement inside before any outside, i.e. (2+7)*5 != 2+7*5 |
Let's look at some examples
CASE
WHEN sched_dep_time < 1200 THEN 'AM'
ELSE 'PM'
END AS DEP_AM_PM,
CASE
WHEN sched_dep_time BETWEEN 0 AND 0800 THEN 'Early Morning'
WHEN sched_dep_time BETWEEN 0801 AND 1159 THEN 'Morning'
WHEN sched_dep_time BETWEEN 1200 AND 1800 THEN 'Afternoon'
WHEN sched_dep_time BETWEEN 1801 AND 2100 THEN 'Evening'
WHEN sched_dep_time BETWEEN 2100 AND 2359 THEN 'Night'
END AS DEP_TIME_DETAIL, -- Notice no ELSE
CASE
WHEN length(sched_arr_time) != 4 THEN 'AM'
WHEN substr(sched_arr_time, 1, 2) < 12 THEN 'AM'
WHEN substr(sched_arr_time, 1, 2) > 12 THEN 'PM'
ELSE 'Midday'
END AS ARR_AM_PM,
...
Now examples using in
CASE
WHEN dest in ('JFK', 'LGA', 'EWR') THEN 'NYC'
ELSE 'Other'
END AS dest_nyc_check_in,
-- see how much easier it is than writing ors
CASE
WHEN dest = 'JFK' or dest = 'LGA' or dest = 'EWR' THEN 'NYC'
ELSE 'Other'
END AS dest_nyc_check_ors
...
Notice that comparisons always have to have a left and a right hand side, even when it's the same LHS
String values are case sensitive
CASE
WHEN origin in ('JFK', 'LGA', 'EWR') THEN 'NYC'
ELSE 'Other'
END AS origin_nyc_check1,
CASE
WHEN lower(origin) in ('JFK', 'LGA', 'EWR') THEN 'NYC'
ELSE 'Other'
END AS origin_nyc_check2,
CASE
WHEN origin in ('JFK', 'lga', 'ewr') THEN 'NYC'
ELSE 'Other'
END AS origin_nyc_check3
...
And finally the like matches.
-- Any number of character (including 0) before, and after a 7
CASE
WHEN tailnum LIKE '%7%' THEN 1
ELSE 0
END AS contains_7,
-- Any number of characters before a capital Q
CASE
WHEN tailnum LIKE '%Q' THEN 1
ELSE 0
END AS ends_in_q,
-- A NOT acting on the output of a logical statement
CASE
WHEN NOT (tailnum LIKE '%Q') THEN 1
ELSE 0
END AS ends_in_not_q,
-- A NOT LIKE, specifically any record that doesn't match the pattern,
-- slightly different to above
CASE WHEN tailnum NOT LIKE 'N5_5%' THEN 1
ELSE 0
END AS weird_test
...
Run some of these queries so far and see the results.
Attempt questions 8-11.
Finally let's look at dates in Oracle SQL.
Date data types include both a date and a time, there is no time only value in Oracle.
There is another type called a timestamp that contains milliseconds and timezones, most functions that work on dates also work on timestamps.
Just like in Excel, dates in Oracle are just numbers - meaning a difference of 24 hours is 1, 12 hours is 0.5 etc.
However unlike excel because they are a specific datatype there's some specific functions to work with them.
Function | Description |
---|---|
TO_DATE(variable, fmt) | Converts the variable (or string) into a date by giving it the format the variable is in |
TRUNC(date [, fmt]) | Sets the time component of the date to 00:00:00 by default, or will round down to the format if provided |
sysdate | A system variable that is the current (local) date and time |
NEXT_DAY(date, day) | Returns the next date of the given day, after the date provided e.g. takes a Saturday at 00:00:01 to the NEXT Saturday |
TO_CHAR(variable, fmt) | Converts the variable (usually a date or a number) into the given format. This is the method to format a date in a given way NOT substr |
EXTRACT(value FROM date) | Returns the given value e.g. year from the date and returns it as a number. To extract units less than a day you must use a timestamp |
TO_TIMESTAMP(variable, fmt) | Converts the variable (or string) into a timestamp by giving it the format the variable is in |
SELECT
time_hour,
time_hour + 1, -- add 1 day
time_hour + 5/24/60, -- add 5 minutes
TRUNC(time_hour), -- set time component to 0
TRUNC(time_hour, 'MM'), -- set to first of month
NEXT_DAY(time_hour, 'Saturday') - 1, -- Set to Friday inclusive for a week
TO_CHAR(time_hour, 'HH24:MI DAY DD MONTH YYYY'), -- format date in a specific format
TO_CHAR(time_hour, 'HH24:MI Day dd month yyyy'), -- notice how case changes the output
-- convert our columns into a date
TO_DATE(year||'-'||month||'-'||day||' '||hour||':'||minute, 'YYYY-MM-DD HH24:MI') -- convert our columns into a date
FROM
schema_name.flights;
Using TO_DATE every time can be annoying, so we can specify a date in another way. This is easier to read and more efficient for the database as well.
SELECT
DATE '2019-01-20' AS fixed_date
FROM
schema_name.flights;
The format is ALWAYS YYYY-MM-DD and is a string.
You should NEVER use substr on a date because the result you gets depend on how the user has their system set up.
-- Set a specific date format for your local session and sub-string our dates
ALTER SESSION SET NLS_DATE_FORMAT = 'DD-MM-YYYY HH24:MI:SS';
SELECT
time_hour,
substr(time_hour, 9, 2)
FROM
schema_name.flights;
-- returns 2 digit year
-- Change the format and check the results
ALTER SESSION SET NLS_DATE_FORMAT = 'YYYY-DD-MM HH24:MI:SS';
SELECT
time_hour,
substr(time_hour, 9, 2)
FROM
schema_name.flights;
-- returns 2 digit month
Always use to_char instead.
Run some of these queries so far and see the results.
Attempt questions 12-15.
We will start by learning how to filter records, we do this using a WHERE clause, which goes after the FROM clause.
SELECT
columns
FROM
tables
WHERE
condition
;
The condition in this clause is the same type as those we saw in the CASE statements.
Let's take a look at some examples:
SELECT
*
FROM
schema_name.flights
WHERE
arr_delay = 0
AND dep_delay = 0
;
Let's take a look at some examples:
Bonus tip: You can use an always true condition to make it easier to comment out individual checks when you're still developing your code
SELECT
*
FROM
schema_name.flights
WHERE
1 = 1
--AND arr_delay = 0
AND dep_delay = 0;
;
Let's take a look at some examples:
and of course we can use our favourite boolean logic!
SELECT
*
FROM
schema_name.flights
WHERE
1 = 1
-- flights scheduled to depart in the first 10 or last 5 hours of a day
AND (hour < 10 OR hour > 19)
AND dep_delay = 0 -- left on time
;
Let's take a look at some examples:
A reminder that we can use IS [NOT] NULL as a comparison. Useful when we want to check how many records have missing (or not) data
SELECT
*
FROM
schema_name.flights
WHERE
dep_delay IS NULL
;
Just like before, we can't use a renamed column in a WHERE clause.
-- Both of these will error
SELECT
TRUNC(time_hour) AS flight_day
FROM
schema_name.flights
WHERE
flight_day > DATE '2013-06-01';
SELECT
tailnum AS tailnumber
FROM
schema_name.flights
WHERE
tailnumber LIKE 'N%';
Just like before, we can't use a renamed column in a WHERE clause.
The reason for this is because the WHERE clause runs before the SELECT does (even though it's written after).
An upside of this is that you can filter on a column that you don't actually output.
SELECT
tailnum
FROM
schema_name.flights
WHERE
origin = 'JFK'; -- only flights out of JFK airport
Run some of these queries so far and see the results.
Attempt questions 16-19.
Next let's look at how we can order our data
Ordering data is one of the most inefficient things a computer can do and databases are not designed to be fast at it, try to avoid ordering unless you have to, only do it once, or consider doing it in some other tool after exporting the data
SELECT
columns
FROM
tables
WHERE
condition
ORDER BY
columns;
We can order by a single column
SELECT
*
FROM
schema_name.flights
WHERE
dep_delay > 0
ORDER BY
dep_delay;
Or multiple columns for when there is a tie
SELECT
*
FROM
schema_name.flights
WHERE
dep_delay > 0
ORDER BY
-- sort by departure delay THEN arrival delay if there is a tie.
dep_delay,
arr_delay;
By default records are ordered in ascending order
SELECT
*
FROM
schema_name.flights
WHERE
dep_delay > 0
ORDER BY
-- Least delayed departing, ties are sorted by least delayed when arriving.
dep_delay,
arr_delay;
By default records are ordered in ascending order
but we can change that
SELECT
*
FROM
schema_name.flights
WHERE
dep_delay > 0
ORDER BY
-- MOST delayed departing, ties are sorted by least delayed when arriving.
dep_delay DESC,
arr_delay;
By default records are ordered in ascending order
but we can change that
and we have to specify it for each column.
SELECT
*
FROM
schema_name.flights
WHERE
dep_delay > 0
ORDER BY
-- MOST delayed departing, ties are sorted by MOST delayed when arriving.
dep_delay DESC,
arr_delay DESC;
Finally, we can be a bit lazy and because the ORDER BY runs after the SELECT we can actually just use numbers instead of column names.
This is also why we can use column aliases in the ORDER BY and can't use columns we didn't SELECT
SELECT
flight,
dep_delay,
arr_delay
FROM
schema_name.flights
WHERE
dep_delay > 0
ORDER BY
2 desc,
3;
Generally this is a bad idea, it's hard to read and if you change the order in your SELECT it will break.
It's rare that you want duplicates in your data. Usually it means you've done a join (next section) wrong, or your source data has an issue, or you removed too many variables.
If you are comfortable as to why you have duplicates but want to remove them you can use a SELECT DISTINCT to just get unique records.
SELECT DISTINCT
origin,
dest
FROM
schema_name.flights;
DISTINCT acts on all selected columns, not just the first one.
Run some of these queries so far and see the results.
Attempt questions 20-21.
Sometimes we want to combine our records in one way or another, if they are from the same table there is often a clever way to do this using ORs and CASE statements, but not always. Other times we might want to:
/* UNION records without duplicates */
SELECT
*
FROM
schema_name.airlines
WHERE
carrier in ('9E', 'AA', 'AS')
UNION
SELECT
*
FROM
schema_name.airlines
WHERE
carrier in ('AS', 'B6', 'DL');
/* UNION ALL records with duplicates */
SELECT
*
FROM
schema_name.airlines
WHERE
carrier in ('9E', 'AA', 'AS')
UNION ALL
SELECT
*
FROM
schema_name.airlines
WHERE
carrier in ('AS', 'B6', 'DL');
/* Find INTERESECTion of records */
SELECT
*
FROM
schema_name.airlines
WHERE
carrier in ('9E', 'AA', 'AS')
INTERSECT
SELECT
*
FROM
schema_name.airlines
WHERE
carrier in ('AS', 'B6', 'DL');
/* MINUS any matching records from the second table from the first */
SELECT
*
FROM
schema_name.airlines
WHERE
carrier in ('9E', 'AA', 'AS')
MINUS
SELECT
*
FROM
schema_name.airlines
WHERE
carrier in ('AS', 'B6', 'DL');
Though it is very rare, sometimes you might only want to return a specific number of rows.
For this we use the keyword FETCH and it's associated terms.
FETCH is the very last thing a database will run as it has to complete ordering first, so it would be any quicker to just return 5 rows instead of 500
SELECT
columns
FROM
tables
WHERE
condition
ORDER BY
columns
[OFFSET number ROWS]
FETCH NEXT number [PERCENT] ROWS ONLY/WITH TIES;
Though it is very rare, sometimes you might only want to return a specific number of rows.
For this we use the keyword FETCH and it's associated terms.
SELECT
*
FROM
schema_name.flights
ORDER BY
distance desc
OFFSET 5 ROWS
FETCH NEXT 5 ROWS WITH TIES;
This will return more than 5 records because of the option with ties.
Though it is very rare, sometimes you might only want to return a specific number of rows.
For this we use the keyword FETCH and it's associated terms.
SELECT
*
FROM
schema_name.flights
ORDER BY
distance desc
OFFSET 5 ROWS
FETCH NEXT 5 ROWS ONLY;
Run some of these queries so far and see the results.
Attempt questions 22-24
Sometimes we want to test our query or a function and don't want to query a whole table to do it.
To do this we can use a special table called dual which has a single row and a single column, with a value of 'X'.
SELECT
*
FROM
dual;
One good use for this is often to test out some date or string functions on dummy data to make sure you get it right
SELECT
--Some random date manipulation as a test
TRUNC(SYSDATE) - 5 + 1/24 + 14/24/60,
-- should return Friday
NEXT_DAY(TRUNC(SYSDATE), 'SAT') - 1 AS week_end_date
FROM
dual;
We've seen multiple times that we can't use a column we create/alias in a query within the same query (except in ordering). This was because of the order the query actualy runs in vs how we see it written.
But we know that a query returns a table, so we could use that table as the input to another query!
SELECT
columns
FROM
table
WHERE
condition;
We've seen multiple times that we can't use a column we create/alias in a query within the same query (except in ordering). This was because of the order the query actualy runs in vs how we see it written.
But we know that a query returns a table, so we could use that table as the input to another query!
SELECT
columns
FROM
(SELECT
columns
FROM
tables
WHERE
condition
)
WHERE
condition;
This is known as a nested query because the inner query is nested entirely within the outer one.
We can do this as many times as we like, but the database does have to complete an inner query before it can run the outer one, so filter as early (more nested) as possible, and order as late as possible.
SELECT
columns
FROM
(SELECT
columns
FROM
tables
WHERE
condition
)
WHERE
condition;
We need a way to refer to the attributes in the output of this inner query, we do that by aliasing the inner query - just like we would a table.
SELECT
b.*
FROM
(
SELECT
a.*,
distance * 1.60934 AS distance_km
FROM
schema_name.flights a
) b
WHERE
distance_km > 1000;
People often use a/b/c etc. for aliases - for short code this is fine but for longer code try to use a short but descriptive alias for each table instead.
Nested queries are great when the inner query is short and you only need to use it once. But what about if you want to use that inner query in multiple places, or the nesting gets very deep and complex?
The solution is Common Table Expressions (CTEs) which allow you to alias a whole query and refer to it at any later point in your code without having to write it out in full every time. These use the WITH keyboard
WITH alias AS (QUERY),
alias2 AS (QUERY),
alias3 AS (QUERY)
SELECT
columns
FROM
tables;
WITH km_table as
(
SELECT
a.*,
distance * 1.60934 AS distance_km
FROM
schema_name.flights a
)
SELECT
*
FROM
km_table
WHERE
distance_km > 1000
UNION
SELECT
*
FROM
km_table
WHERE
distance_km < 1000;
Run some of these queries so far and see the results.
Attempt questions 25-28
Finally we can keep a table for a longer period by storing it permanently in the database. To do this we use the CREATE TABLE statement.
Remember that a statement is different to a a query - a query reads the database and returns a table, a statement can interact with the database in other ways.
-- Nologging optionally turns off logs for the table creation
CREATE TABLE your_table_name [NOLOGGING] AS
query;
It is possible to use CREATE TABLE to create tables without a query by just specifying column names and types, but that is beyond the scope of this training.
Once created you can query the table like any other. Because you created it in your schema, you don't have to specify the schema name.
Remember only you can see this table, to share it with others you have to grant them access.
CREATE TABLE flights_km NOLOGGING AS
SELECT
a.*,
distance * 1.60934 AS distance_km
FROM
schema_name.flights a;
SELECT
*
FROM
flights_km;
Once we're done we want to DROP the table to delete it from the database.
DROP TABLE your_table_name [PURGE];
PURGE removes it from your recycling bin, you can leave it out but make sure to empty the bin in the case!
DROP TABLE flights_km PURGE;
SELECT
*
FROM
flights_km; -- the table no longer exist.
There are plenty of other statements in SQL that interact with tables, you can use TRUNCATE to empty a table, INSERT to add new records, DELETE to remove specific records, and UPDATE to alter existing records.
However, these are beyond the scope of this training.
Run some of these queries so far and see the results.
Attempt questions 29
We now come to one of the most powerful parts of using a relational databases, joining tables.
Joins take place in the FROM part of the query and have the following form:
SELECT
columns
FROM
table1
join_type JOIN table2
ON condition;
There are 4 types of joins you can do:
Join type | Description |
---|---|
INNER JOIN | Only return records where they exist in both tables |
LEFT/RIGHT JOIN | Return all records in the left/right table and any match in the right/left one if it exists |
FULL OUTER JOIN | Return all records in both tables and any match if it exists |
CROSS JOIN | Return all combinations of rows across both tables (the Cartesian product of the rows). Not often useful. |
There are 4 types of joins you can do:
When you specify a condition in your join that defines where a match is returned to the record or not, when you specify it in the WHERE this specifies whether the whole record is returned or not.
In the case of an inner join these are the same thing, but for other joins they are not.
First define your join logic, then define your filter criteria separately.
For the join examples we're going to use a slim version of the planes table so we can see how each join behaves; let's create that first.
CREATE TABLE planes_short NOLOGGING AS
SELECT
*
FROM
schema_name.planes
WHERE
tailnum = 'N108UW';
First let's look at an inner join, where only records that find a match are returned
SELECT
a.*,
b.*
FROM
schema_name.flights a
INNER JOIN planes_short b
/* In this case the columns had the same names,
but that won't always (in fact will rarely) be the case */
ON a.tailnum = b.tailnum;
Next the left join, where all records from the first table are returned, plus any matching data from the second
SELECT
a.*,
b.*
FROM
schema_name.flights a
LEFT JOIN planes_short b
ON a.tailnum = b.tailnum;
And a right join, just the other way round to the left.
SELECT
a.*,
b.*
FROM
schema_name.flights a
RIGHT JOIN planes_short b
ON a.tailnum = b.tailnum;
Despire the planes_short table having 1 row per plane, because of the join we now have multiple - we'll talk about this in a bit!
And a full outer join - we'll use a dummy table to create a fake plane to show that it returns all records from both tables regardless of if they match.
SELECT
a.*,
b.*
FROM
schema_name.flights a
FULL OUTER JOIN (SELECT 'FAKETAIL' AS tailnum_2 FROM dual) b
ON a.tailnum = b.tailnum_2
-- Defaults to nulls first for descending.
ORDER BY a.flight DESC;
Lastly the cross join - the least used of them all. We'll use 2 tables with 3 rows each to show we get all matches back in 9 rows (3x3).
SELECT
*
FROM
(SELECT '1' AS num FROM DUAL
UNION
SELECT '2' AS num FROM DUAL
UNION
SELECT '3' AS num FROM DUAL) A
CROSS JOIN
(SELECT 'A' AS let FROM DUAL
UNION
SELECT 'B' AS let FROM DUAL
UNION
SELECT 'C' AS let FROM DUAL) B;
Joins are the place you are most likely to make a mistake that leads to duplicates in your data and going from millions of records to billions.
The reason is if your join is not unique on at least one of the tables (i.e. the condition you give can be used to define each record and distinguish it from others) then each record in both tables will join to multiple in the other, getting out of hand very fast.
Always ask yourself; does my join link 1 row to 1/many rows in my other table? If the answer is no, you need to rethink your logic.
Run some of these queries so far and see the results.
Attempt questions 30-32
Often we want to get aggregate information of our data and group it by values in another column - this is similar to what we do in PivotTables in Excel.
To do this we use the GROUP BY clause and summary functions. This goes after WHERE but before ORDER BY.
SELECT
group_columns,
summary_columns
FROM
tables
WHERE
condition
GROUP BY
group_columns
ORDER BY
columns;
Any non-summarised variable must be listed in your GROUP BY otherwise it will error. You can group by something you don't select but it's a bad idea.
There are plenty of aggergation functions and the most common are listed below:
Agg type | Description |
---|---|
COUNT(*) | The total number of records per group |
COUNT(column) | The total number of records that are non-null in that variable |
SUM(column) | The sum of that variable, must be numeric |
AVG(column) | The mean of that variable, must be numeric |
MAX(column) | The maximum of that variable |
MIN(column) | The minimum of that variable |
Let's look at information about flights out of each airport
SELECT
origin,
COUNT(*) AS num_flights,
COUNT(dep_time) as num_departed_flights, -- doesn't count records with a null departure time
SUM(air_time) AS total_airtime,
AVG(air_time) AS mean_airtime,
MAX(dep_delay) AS latest_dep,
MIN(dep_delay) AS earliest_dep
FROM
schema_name.flights
GROUP BY
origin;
Let's look at information about flights out of each airport
AND by each destination airport.
SELECT
origin,
dest,
COUNT(*) AS num_flights,
COUNT(dep_time) as num_departed_flights,
SUM(air_time) AS total_airtime,
AVG(air_time) AS mean_airtime,
MAX(dep_delay) AS latest_dep,
MIN(dep_delay) AS earliest_dep
FROM
schema_name.flights
GROUP BY
origin,
dest;
We don't have to group by everything if we want details about the whole dataset.
SELECT
COUNT(*) AS num_flights,
MAX(dep_delay) AS latest_dep,
MIN(dep_delay) AS earliest_dep
FROM
schema_name.flights;
These aggergations becomes even more powerful when combined with a CASE statement.
SELECT
origin,
SUM( -- SUM uses 1 and 0 as the values
CASE
WHEN dep_delay = 0 AND arr_delay = 0
THEN 1
ELSE 0
END
) AS perfect_flights,
COUNT( -- COUNT uses anything and null as the values
CASE
WHEN dep_delay = 0 AND arr_delay = 0
THEN 1
ELSE null
END
) AS perfect_flights2
FROM
schema_name.flights
GROUP BY
origin;
You can filter on the results on an aggretation by using the HAVING keyword (as a WHERE filters records before they are aggregated).
SELECT
tailnum,
COUNT(*)
FROM
schema_name.flights
GROUP BY
tailnum
HAVING
-- only planes that had at least 50 flights in 2013
COUNT(*) > 50;
Run some of these queries so far and see the results.
Attempt questions 33-34
There are times we want to add aggregate information to our data but keep it at the record level. We could group and then join back on, but this is inefficient.
Instead we can use something called WINDOW FUNCTIONS (also known as ANALYTICAL FUNCTIONS).
Pretty much all of them have the same format:
WINDOW_FUNCTION(column) OVER (PARTITION BY columns ORDER BY columns)
Not all of the functions have an argument to the main function, or an ORDER BY clause, but most do.
The PARTITION BY defines your group for that record, and the ORDER BY defines the order within that group.
Most GROUP BY functions have an equivalent window function, but there are a few extras as well:
Window Function | Description |
---|---|
ROW_NUMBER() | Numbers records in group from 1 to N (number in group) based on order by |
RANK(column) | Rank value within that group, same value for ties (with gaps) |
DENSE_RANK(column) | Rank value within that group, same value for ties (with no gaps) |
FIRST/LAST_VALUE(column) | The first/last value within that group. |
LEAD/LAG(column) | The next/previous value within that group. |
Let's look at an example.
SELECT
a.*,
ROW_NUMBER() OVER (
PARTITION BY year, month, day, carrier
ORDER BY sched_dep_time
) AS day_carrier_flight,
COUNT(*) OVER (
PARTITION BY year, month, day, carrier
) AS day_carrier_flight_total,
AVG(dep_delay) OVER (
PARTITION BY year, month, day, carrier
) AS day_carrier_avg_dep_delay
FROM
schema_name.flights a;
Run some of these queries so far and see the results.
Attempt questions 35-36
We have seen that often the order we write our code in is not the same order that the database runs it in. Let's quickly review the order of writing a full query, then compare it with how the database will (usually) execute it.
A crime has taken place and the detective needs your help. The detective gave you the crime scene report, but you somehow lost it. You vaguely remember that the crime was a murder that occurred sometime on Jan.15, 2018 and that it took place in SQL City. Start by retrieving the corresponding crime scene report from the police department’s database.
All the clues to this mystery are buried in a huge database, and you need to use SQL to navigate through this vast network of information. Your first step to solving the mystery is to retrieve the corresponding crime scene report from the police department’s database.