Table Transformation

While working with databases, we often need to transform data from one format to achieve a desired result. In SQL, this is often called data transformation or table transformation.

We’ll begin learning about table transformation through subqueries. Subqueries, sometimes referred to as inner queries or nested queries, are used to transform table data by nesting one query within another query.

Non-Correlated Subqueries I

a subquery is nested within another query to give us our desired result.

We first create an inner query, or subquery, that finds the airports with elevation greater than 2000 from the airports table:

SELECT code 
  FROM airports 
  WHERE elevation > 2000;

Next, we take the result set of the inner query and use it to filter on the flights table, to find the flight detail that meets the elevation criteria.

SELECT * 
FROM flights 
WHERE origin in (
    SELECT code 
    FROM airports 
    WHERE elevation > 2000);

SELECT * FROM flights
WHERE origin in (
  SELECT code FROM airports
  WHERE elevation < 2000
);

Non-Correlated Subqueries II

A non-correlated subquery is a subquery that can be run independently of the outer query and as we saw, can be used to complete a multi-step transformation.

SELECT * FROM flights
WHERE origin in (
  SELECT code 
  FROM airports 
  WHERE faa_region = 'ASO'
);

Non-Correlated Subqueries III

we can also perform transformations on a single table. For instance, sometimes we need to aggregate in multiple steps - like taking an average of a count.

Imagine you’d like to know how many flights there are on average, for all Fridays in a given month from the flights table. First, we’d need to calculate the number of flights per day, and then we’d need to calculate the average based on the daily flight count for each day of the week. We can do this all in one step using a subquery:

SELECT a.dep_month,
       a.dep_day_of_week,
       AVG(a.flight_count) AS average_flights
  FROM (
        SELECT dep_month,
              dep_day_of_week,
               dep_date,
               COUNT(*) AS flight_count
          FROM flights
         GROUP BY 1,2,3
       ) a
 GROUP BY 1,2
 ORDER BY 1,2;

Using a subquery, find the average total distance flown by day of week and month.

Be sure to alias the outer query as average_distance and the inner query as flight_distance.

SELECT a.dep_month, a.dep_day_of_week, AVG(a.flight_distance) AS average_flights
FROM (
  SELECT dep_month, dep_day_of_week, dep_date,
  SUM(distance) AS  flight_distance
  FROM flights
  GROUP BY 1, 2, 3
) a
GROUP BY 1, 2
ORDER BY 1, 2

Correlated Subqueries I

The order of operations is important in a correlated subquery:

A row is processed in the outer query. Then, for that particular row in the outer query, the subquery is executed.

This means that for each row processed by the outer query, the subquery will also be processed for that row. In this example, we will find the list of all flights whose distance is above average for their carrier.

SELECT id
FROM flights AS f
WHERE distance > (
 SELECT AVG(distance)
 FROM flights
 WHERE carrier = f.carrier);

In the above query the inner query has to be re-executed for each flight. Correlated subqueries may appear elsewhere besides the WHERE clause, they can also appear in the SELECT.

Find the id of the flights whose distance is below average for their carrier.

SELECT id 
FROM flights AS f
WHERE distance < (
  SELECT AVG(distance) 
  FROM flights
  WHERE carrier = f.carrier
)

Correlated Subqueries II

It would also be interesting to order flights by giving them a sequence number based on time, by carrier.

For instance, assuming flight_id increments with each additional flight, we could use the following query to view flights by carrier, flight id, and sequence number:

Using the same pattern, write a query to view flights by origin, flight id, and sequence number. Alias the sequence number column as flight_sequence_number.

SELECT origin, id, (
  SELECT COUNT(*) FROM flights f
  WHERE f.id < flights.id
  AND f.origin = flights.origin
) + 1 AS flight_sequence_number

FROM flights;

sum up (1)

Subqueries are used to complete an SQL transformation by nesting one query within another query.
A non-correlated subquery is a subquery that can be run independently of the outer query and can be used to complete a multi-step transformation.
A correlated subquery is a subquery that cannot be run independently of the outer query. The order of operations in a correlated subquery is as follows:

A row is processed in the outer query.
Then, for that particular row in the outer query, the subquery is executed.

Set Operations

Sometimes, in order to answer certain questions based on data, we need to merge two tables together and then query the merged result.

Merge the rows, called a join.
Merge the columns, called a union.

SELECT column_name(s) FROM table1
UNION
SELECT column_name(s) FROM table2;

Each SELECT statement within the UNION must have the same number of columns with similar data types. The columns in each SELECT statement must be in the same order. By default, the UNION operator selects only distinct values.

ex

Select a complete list of brand names from the legacy_products and new_products tables.

SELECT brand FROM legacy_products 
UNION 
SELECT brand FROM new_products;

Union All

What if we wanted to allow duplicate values? We can do this by using the ALL keyword with UNION, with the following syntax:

SELECT column_name(s) FROM table1
UNION ALL
SELECT column_name(s) FROM table2;

Using the same pattern, utilize a subquery to find the average sale price over both order_items and order_items_historic tables.

SELECT id, AVG(a.sale_price) FROM (
SELECT id, sale_price FROM order_items
UNION ALL
SELECT id, sale_price FROM order_items_historic
) AS a
GROUP BY 1

Intersect

INTERSECT is used to combine two SELECT statements, but returns rows only from the first SELECT statement that are identical to a row in the second SELECT statement. This means that it returns only common rows returned by the two SELECT statements.

SELECT column_name(s) FROM table1

INTERSECT

SELECT column_name(s) FROM table2;

SELECT category FROM new_products
INTERSECT
SELECT category FROM legacy_products;

Except

returns distinct rows from the first SELECT statement that aren’t output by the second SELECT statement.

SELECT column_name(s) FROM table1

EXCEPT

SELECT column_name(s) FROM table2;

ex

select the items in the category column that are in the legacy_products table and not in the new_products table.

SELECT category FROM legacy_products
EXCEPT
SELECT category FROM new_products;

sum up (2)

The UNION clause allows us to utilize information from multiple tables in our queries.
The UNION ALL clause allows us to utilize information from multiple tables in our queries, including duplicate values.
INTERSECT is used to combine two SELECT statements, but returns rows only from the first SELECT statement that are identical to a row in the second SELECT statement.
EXCEPT returns distinct rows from the first SELECT statement that aren’t output by the second SELECT statement

https://adventofcode.com/2021/leaderboard/private 1442589-943acb67

CONDITIONAL AGGREGATES

CASE WHEN

SELECT
    CASE
        WHEN elevation < 500 THEN 'Low'
        WHEN elevation BETWEEN 500 AND 1999 THEN 'Medium'
        WHEN elevation >= 2000 THEN 'High'
        ELSE 'Unknown'
    END AS elevation_tier
    , COUNT(*)
FROM airports
GROUP BY 1;

In the above statement, END is required to terminate the statement, but ELSE is optional. If ELSE is not included, the result will be NULL. Also notice the shorthand method of referencing columns to use in GROUP BY, so we don’t have to rewrite the entire Case Statement.

COUNT(CASE WHEN )

Sometimes you want to look at an entire result set, but want to implement conditions on certain aggregates.

For instance, maybe you want to identify the total amount of airports as well as the total amount of airports with high elevation in the same result set. We can accomplish this by putting a CASE WHEN statement in the aggregate.

SELECT  state, 
    COUNT(CASE WHEN elevation >= 2000 THEN 1 ELSE NULL END) as count_high_elevation_aiports 
FROM airports 
GROUP BY state;

ex

Using the same pattern, write a query to count the number of low elevation airports by state where low elevation is defined as less than 1000 ft.

Be sure to alias the counted airports as count_low_elevation_airports.

SELECT state, 
       COUNT (CASE WHEN elevation < 1000 THEN 1 ELSE NULL END) AS count_low_elevation_airports
       FROM airports
       GROUP BY 1

SUM(CASE WHEN )

We can do that same thing for other aggregates like SUM(). For instance, if we wanted to sum the total flight distance and compare that to the sum of flight distance from a particular airline (in this case, United Airlines) by origin airport, we could run the following query:

SELECT origin, sum(distance) as total_flight_distance, sum(CASE WHEN carrier = 'UA' THEN distance ELSE 0 END) as total_united_flight_distance 
FROM flights 
GROUP BY origin;

Combining aggregates

Oftentimes we’d like to combine aggregates, to create percentages or ratios.

In the instance of the last query, we might want to find out the percent of flight distance that is from United by origin airport. We can do this simply by using the mathematical operators we need in SQL:

SELECT origin, 
    (100.0*(sum(CASE WHEN carrier = 'UN' THEN distance ELSE 0 END))/sum(distance)) as percentage_flight_distance_from_united FROM flights 
GROUP BY origin;

SELECT origin, 
  (100.0*(sum(CASE WHEN carrier = 'DL' THEN distance ELSE 0 END)) / SUM(distance)) AS 
  percentage_flight_distance_from_delta
FROM flights 
GROUP BY origin;

Combining aggregates II

Find the percentage of high elevation airports (elevation >= 2000) by state from the airports table.

In the query, alias the percentage column as percentage_high_elevation_airports.

SELECT state, 100.0 * sum(CASE WHEN elevation >= 2000 THEN 1 ELSE 0 END) / count(*)  as percentage_high_elevation_airports FROM airports GROUP BY state;

sum up (3)

Conditional Aggregates are aggregate functions the compute a result set based on a given set of conditions.

NULL can be used to denote an empty field value
CASE statements allow for custom classification of data
CASE statements can be used inside aggregates (like SUM() and COUNT()) to provide filtered measures

Date, Time and String Functions

Dates

Dates are often written in the following format

Date: YYYY-MM-DD
Datetime or Timestamp: YYYY-MM-DD hh:mm:ss

We can use SQL’s date functions to transform data into a desired format. Since date functions can be database specific, verify the functions that exist on your relational database management system.

// Would return the date and time for the manufacture_time column.
SELECT DATETIME(manufacture_time)
FROM baked_goods;

Dates II

let’s assume that we have a column in our baked_goods table named manufacture_time in the format YYYY-MM-DD hh:mm:ss.

We’d like to know the number of baked_goods manufactured by day, and not by second. We can use the DATE() function to easily convert timestamps to dates and complete the following query:

SELECT DATE(manufacture_time), count(*) as count_baked_goods
FROM baked_goods
GROUP BY DATE(manufacture_time);

Similarly, we can query the time with

SELECT TIME(manufacture_time), count(*) as count_baked_goods
FROM baked_goods
GROUP BY TIME(manufacture_time);

Find the number of baked goods by date of delivery.

Be sure to alias the total count of baked goods as count_baked_goods

SELECT DATE(delivery_time), COUNT(*) AS count_baked_goods
FROM baked_goods
GROUP BY DATE(delivery_time);

Dates III

Given a datepart and a column of date or timestamp data type, we can increment date or timestamp values by a specified interval.

Imagine that each dessert in our baked_goods table is inspected 2 hours, 30 minutes, and 1 day after the manufacture time. To derive the inspection date for each baked good, we can use the following query

SELECT DATETIME(manufacture_time, '+2 hours', '30 minutes', '1 day') as inspection_time
FROM baked_goods;

Each of the baked goods is packaged by Baker’s Market exactly five hours, twenty minutes, and two days after the delivery (designated by delivery_time). Create a query returning all the packaging times for the goods in the baked_goods table

Be sure to alias the package time column as package_time.

SELECT DATETIME(delivery_time, '+5 hours', '20 minutes', '2 day') as package_time
FROM baked_goods;

Numbers

SELECT (number1 + number2);: Returns the sum of two numbers. Similar, SQL can be used for subtraction, multiplication, and division.
SELECT CAST(number1 AS REAL) / number3;: Returns the result as a real number by casting one of the values as a real number, rather than an integer.
SELECT ROUND(number, precision);: Returns the numeric value rounded off to the next value specified.

ex

Find the bakery’s distance from the market rounded to two decimal places.

Be sure to alias the column as distance_from_market.

select round(distance, 2) as distance_from_market
from bakeries;

Numbers II

MAX(n1,n2,n3,...): returns the greatest value in the set of the input numeric expressions
MIN(n1,n2,n3,...): returns the least value in the set of the input numeric expressions

In our baked_goods table, in addition to the numeric ingredients_cost we have information about the packaging cost located in the packaging_cost column. We can use the MAX function to determine the overall greatest value of cost for each item using the following query:

SELECT id, MAX(ingredients_cost, packaging_cost)
FROM baked_goods;

SELECT id, MAX(cook_time, cool_down_time)
FROM baked_goods;

SELECT id, MIN(cook_time, cool_down_time)
FROM baked_goods;

Strings

A common use case for string manipulation in SQL is concatenation of strings. In SQLite, this is written as

SELECT string1 || ' ' || string2;

For example, the bakeries table contains both city and state columns. In order to create a route for these columns, we use the || function to concatenate them as in the following query:

SELECT city || ' ' || state as location
FROM bakeries;

String functions are again, very database specific, and it is best practice to consult documentation before proceeding.

ex

Combine the first_name and last_name columns from the bakeries table as the full_name to identify the owners of the bakeries.

Be sure to add a space between the names in the full_name as shown in the example.

SELECT first_name || ' ' || last_name AS full_name
FROM bakeries

Strings II

Another useful string function in SQL is REPLACE():

REPLACE(string,from_string,to_string)

The function returns the string string with all occurrences of the string from_string replaced by the string to_string.

ex

Any time enriched_flour appears in the ingredients list, we’d like to replace it with just flour.

Apply this transformation and be sure to rename the column item_ingredients.

SELECT REPLACE(ingredients, 'enriched_flour', 'flour')
FROM baked_goods

sum up (4)

Date Functions:

DATETIME; Returns the date and time of the column specified. This can be modified to return only the date or only the time.
DATETIME(time1, +X hours, Y minutes, Z days): Increments the specificed column by a given number of hours, minutes, or days.

Numeric Functions:

(number1 + number2);: Returns the sum of two numbers, or other mathematical operations, accordingly.
CAST(number1 AS REAL) / number2;: Returns the result as a real number by casting one of numeric inputs as a real number
ROUND(number, precision);: Returns the numeric value rounded off to the next value specified.

String Functions:

'string1' || ' ' || 'string2';: Concatenates string1 and string 2, with a space between.
REPLACE(string,from_string,to_string): Returns the string with all occurrences of the string from_string replaced by the string to_string.

資料來源 : How to Transform Tables with SQLpro-logo (codecademy)