...

Let’s write an SQL query that selects only the title column from the articles table.

Code Block

language	sql

SELECT title
FROM articles;

...

We have capitalized the words SELECT and FROM because they are SQL keywords. SQL is case-insensitive, but it helps for readability, and is good style.

If we want more information, we can add a new column to the list of fields, right after SELECT:

Code Block

language	sql

SELECT title, authors, issns, date
FROM articles;

Or we can select all of the columns in a table using the wildcard ‘*’

Code Block

language	sql

SELECT *
FROM articles;

Unique values

If we want only the unique values so that we can quickly see the ISSNs of journals included in the collection, we use DISTINCT

Code Block

language	sql

SELECT DISTINCT issns
FROM articles;

If we select more than one column, then the distinct pairs of values are returned

Code Block

language	sql

SELECT DISTINCT issns, day, month, year
FROM articles;

...

Calculated values

We can also do calculations with the values in a query. For example, if we wanted to look at the relative popularity of an article, so we divide by 10 (because we know the most popular article has 10 citations).

Code Block

language	sql

SELECT first_author, citation_count/10.0
FROM articles;

When we run the query, the expression citation_count / 10.0 is evaluated for each row and appended to that row, in a new column. Expressions can use any fields, any arithmetic operators (+, -, *, and /) and a variety of built-in functions. For example, we could round the values to make them easier to read.

...

Code Block

language	sql

SELECT first_author, title, ROUND(author_count/16.0, 2)
FROM articles;

...

Exercise

Write a query that returns the title, first_author, citation_count, author_count, month and year

Filtering

Databases can also filter data – selecting only the data meeting certain criteria. For example, let’s say we only want data for a specific ISSN for the Theory and Applications of Mathematics & Computer Science journal, which has a ISSN code 2067-2764|2247-6202. We need to add a WHERE clause to our query:

Code Block

language	sql

SELECT *
FROM articles
WHERE issns='2067-2764|2247-6202';

We can use more sophisticated conditions by combining tests with AND and OR. For example, suppose we want the data on Theory and Applications of Mathematics & Computer Science published after June:

Code Block

language	sql

SELECT *
FROM articles
WHERE (issns='2067-2764|2247-6202') AND (month > 06);

Note that the parentheses are not needed, but again, they help with readability. They also ensure that the computer combines AND and OR in the way that we intend.

If we wanted to get data for the Humanities and Religions journals, which have ISSNs codes 2076-0787 and 2077-1444, we could combine the tests using OR:

Code Block

language	sql

SELECT *

SELECT *


FROM articles
WHERE (issns = '2076-0787') OR (issns = '2077-1444');

...

Exercise

Write a query that returns the title, first_author, issns, month and year for all single author papers with more than 4 citations

Building more complex queries

Now, lets combine the above queries to get data for the 3 journals from June on. This time, let’s use IN as one way to make the query easier to understand. It is equivalent to saying WHERE (issns = '2076-0787') OR (issns = '2077-1444') OR (issns = '2067-2764|2247-6202'), but reads more neatly:

Code Block

language	sql

SELECT *
FROM articles
WHERE (month > 06) AND (issns IN ('2076-0787', '2077-1444', '2067-2764|2247-6202'));

We started with something simple, then added more clauses one by one, testing their effects as we went along. For complex queries, this is a good strategy, to make sure you are getting what you want. Sometimes it might help to take a subset of the data that you can easily see in a temporary database to practice your queries on before working on a larger or more complicated database.

When the queries become more complex, it can be useful to add comments. In SQL, comments are started by --, and end at the end of the line. For example, a commented version of the above query can be written as:

Code Block

language	sql

-- Get post June data on selected journals
-- These are in the articles table, and we are interested in all columns
SELECT * FROM articles
-- Sampling month is in the column `month`, and we want to include
-- everything after June
WHERE (month > 06)
-- selected journals have the `issns` 2076-0787, 2077-1444, 2067-2764|2247-6202
AND (issns IN ('2076-0787', '2077-1444', '2067-2764|2247-6202'));

...

Although SQL queries often read like plain English, it is always useful to add comments; this is especially true of more complex queries.

...

We can also sort the results of our queries by using ORDER BY. For simplicity, let’s go back to the articles table and alphabetize it by issns.

Code Block

language	sql

SELECT *
FROM articles
ORDER BY issns ASC;

The keyword ASC tells us to order it in Ascending order. We could alternately use DESC to get descending order.

Code Block

language	sql

SELECT *
FROM articles
ORDER BY first_author DESC;

ASC is the default.

We can also sort on several fields at once. To truly be alphabetical, we might want to order by genus then species.

Code Block

language	sql

SELECT *
FROM articles
ORDER BY issns DESC, first_author ASC;

...

Challenge
Write a query that returns title, first_author, issns and citation_count from the articles table, sorted with the most cited article at the top and alphabetically

Order of execution

Another Another note for ordering. We don’t actually have to display a column to sort by it. For example, let’s say we want to order the articles by their ISSN, but we only want to see Authors and Titles.

Code Block

language	sql

SELECT authors, title
FROM articles
WHERE issns = '2067-2764|2247-6202'
ORDER BY date ASC, first_author ASC;

We can do this because sorting occurs earlier in the computational pipeline than field selection.

...

Clauses are written in a fixed order: SELECT, FROM, WHERE, then ORDER BY. It is possible to write a query as a single line, but for readability, we recommend to put each clause on its own line.

...

Aggregation

...

next episode

Overview
Teaching: 30 min
Exercises: 30 min
Questions
How do you aggregate records in SQL?
Objectives
to understand how to aggregate records in SQL

`COUNT` and `GROUP BY`

Aggregation allows us to combine results by grouping records based on value and calculating combined values in groups.

Let’s go to the articles table and find out how many entries there are. Using the wildcard simply counts the number of records (rows)

SELECT COUNT(*)
FROM articles;

We can also find out how many authors have participated in these articles.

SELECT COUNT(*), SUM(author_count)
FROM articles;

There are many other aggregate functions included in SQL including MAX, MIN, and AVG.

Challenge

Now, let’s see how many articles were published in each journal. We do this using a GROUP BY clause

SELECT issns, COUNT( * )
FROM articles
GROUP BY issns;

GROUP BY tells SQL what field or fields we want to use to aggregate the data. If we want to group by multiple fields, we give GROUP BY a comma separated list.

Challenge

The `HAVING` keyword

In the previous lesson, we have seen the keywords WHERE, allowing to filter the results according to some criteria. SQL offers a mechanism to filter the results based on aggregate functions, through the HAVING keyword.

For example, we can adapt the last request we wrote to only return information about articles with a 10 or more published articles:

SELECT issns, COUNT( * )
FROM articles
GROUP BY issns
HAVING COUNT( * ) >= 10;

The HAVING keyword works exactly like the WHERE keyword, but uses aggregate functions instead of database fields.

If you use AS in your query to rename a column, HAVING can use this information to make the query more readable. For example, in the above query, we can call the COUNT(*) by another name, like occurrences. This can be written this way:

SELECT issns, COUNT( * ) AS occurrences
FROM articles
GROUP BY issns
HAVING occurrences >= 10;

Note that in both queries, HAVING comes after GROUP BY. One way to think about this is: the data are retrieved (SELECT), can be filtered (WHERE), then joined in groups (GROUP BY); finally, we only select some of these groups (HAVING).

Challenge

Ordering aggregated results.

We can order the results of our aggregation by a specific column, including the aggregated column. Let’s count the number of articles published in each journal, ordered by the count

SELECT issns, COUNT( * )
FROM articles
GROUP BY issns
ORDER BY COUNT( * ) DESC;

Saving queries for future use

It is not uncommon to repeat the same operation more than once, for example for monitoring or reporting purposes. SQL comes with a very powerful mechanism to do this: views. Views are a form of query that is saved in the database, and can be used to look at, filter, and even update information. One way to think of views is as a table, that can read, aggregate, and filter information from several places before showing it to you.

Creating a view from a query requires to add CREATE VIEW viewname AS before the query itself. For example, if we want to save the query giving the number of journals in a view, we can write

CREATE VIEW journal_counts AS
SELECT issns, COUNT(*)
FROM articles
GROUP BY issns;

Now, we will be able to access these results with a much shorter notation:

SELECT *
FROM journal_counts;

Assuming we do not need this view anymore, we can remove it from the database almost as we would a table:

DROP VIEW journal_counts;

You can also add a view using Create View in the View menu and see the results in the Views tab just like a table

Challenge

Key Points
SQL is ideal for aggregating database records

Space shortcuts

Page tree

Versions Compared

Old Version 3

New Version 4

Key

Unique values

Calculated values

Exercise

Filtering

Exercise

Building more complex queries

Challenge

Order of execution

Order of execution

Aggregation

next episode

Overview

`COUNT` and `GROUP BY`

Challenge

Challenge

The `HAVING` keyword

Challenge

Ordering aggregated results.

Saving queries for future use

Challenge

Key Points

previous episode

Space shortcuts

Page tree

Page History

Versions Compared

Old Version 3

New Version 4

Key

Unique values

Calculated values

Exercise

Filtering

Exercise

Building more complex queries

Challenge

Order of execution

Order of execution

Aggregation

next episode

Overview

COUNT and GROUP BY

Challenge

Challenge

The HAVING keyword

Challenge

Ordering aggregated results.

Saving queries for future use

Challenge

Key Points

previous episode

`COUNT` and `GROUP BY`

The `HAVING` keyword