Section 30.2 Grouping
In the last chapter, we learned how to aggregate data across all of the records retrieved by a query - to find the last (MAX) date of a ride taken, or the total number (COUNT) of rides taken at a particular station. As a refresher, here are the common aggregations:
COUNT
counts the rows. We can also do COUNT(DISTINCT column_name) to get the number of different values that appear in a column.
SUM
adds the values of a numeric column
MIN
calculates the minimum of a numeric column
MAX
calculates the maximum of a numeric column
AVG
calculates the mean of a numeric column
MEDIAN
calculates the median (middle value) of a numeric column
MODE
calculates the mode (most common value) of a numeric column
Sometimes while aggregating data, you want to aggregate records in groups. Maybe you want to find the longest trip that was taken from each station. Or maybe you want to find the number of trips taken on each bike.
The GROUP BY
statement allows you to group data as you aggregate it. The following query counts the number of trips (COUNT(*)
) for each member type (GROUP BY member_type
).
The result of this query has one row for each distinct member_type
, and the column named n_trips
counts the number of rows for each type. Recall that AS _______
gives a name to a calculated value - any time we do an aggregation, it will be helpful to name the result so we know what we are seeing in the output. By giving the aggregated value a name, we can use it in the ORDER BY
to sort by the total number of trips we calculated.
Once you start grouping records, it only makes sense to talk about aggregated data. You can’t group the records by member_type
and try to select the start_date
. There is no one start date for all of those records. You would instead have to ask about the MIN(start_station)
or MODE(start_station)
or COUNT(DISTINCT start_station)
.
The query below calculates the minimum and maximum trip duration for each start station. Because we are grouping on start_station
, we SELECT
start_station
. All the other columns we are selecting data from (duration
which is selected from twice) have an aggregation function (MAX
or MIN
in this case).
You can still use other columns that you are not selecting to filter the data that gets used to calculate the aggregated value. You just need to make sure to put the WHERE
before the GROUP BY
. For example, you can calculate the number of trips started from each station for just the bike ’W00174’ with this query:
Checkpoint 30.2.2.
The following code will calculate the mean trip duration for trips by each member type and sort the results so that the member type with the longest average trip is displayed first.
Put the instructions in the right order and indentation. You will use all of the blocks.
SELECT
---
member_type,
AVG(duration) AS average_time
---
FROM
---
trip_data
---
GROUP BY
---
member_type
---
ORDER BY
---
average_time DESC
Checkpoint 30.2.3.
For each bike, display the bike number and the total trip count. Sort the results so that the most used bikes come first.
Hints:
You need to group the data by bikes.
You need to find the count of trips made.
You need to order the results.
You have attempted
of
activities on this page.