Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

 

Introduction to SQL

What is SQL?

SQL (Structured Query Language) is a powerful language used to interrogate and manipulate relational databases. It is highly specialised. It is not a general programming language that you can use to write an entire program. However, SQL queries can be embedded in other programming languages to let any program work with databases. There are several different kinds of SQL, but all support the same basic statements that we will be covering today.

...

Introduction to SQLite Manager

Let’s all open the database we downloaded in SQLite Manager by clicking on the open file icon.

You can see the tables in the database by looking at the left hand side of the screen under Tables.

To see the contents of a table, click on that table and then click on the Browse and search tab in the right hand section of the screen.

If we want to write a query, we click on the Execute SQL tab.

Dataset Description

The data we will be using is a catalogue of journal articles from 51 different journals published during 2015. Articles are published in different languages, by different publishers and under different licences.

Import

  1. Make sure you installed SQLite (download and run the appropriate sqlite-tools executable from https://sqlite.org/download.html)
  2. Open Firefox.
  3. Add the extension for SQLite Manager (see https://addons.mozilla.org/en-US/firefox/addon/sqlite-manager/)
  4. Now go to Firefox > Add-ons > SQLite Manager to start that up.

Import Some Data into a new SQLite Database

  1. Download the CSV files from this figshare link.
  2. In the interface, start
  3. Download the CSV files from Figshare
  4. Start a New Database Database -> New Database
  5. Start the import Database -> Import
  6. Select the file to import (articles.csv).
  7. Give the table a name that matches the file name (articles, journals, licences, languages publishers), or use the default
  8. If the first row has column headings, check the appropriate box
  9. Make sure the delimiter and quotation options are appropriate for the CSV files. Ensure ‘Ignore trailing Separator/Delimiter’ is left unchecked.
  10. Press OK
  11. When asked if you want to modify the table, click OK
  12. Set the data types for each field: choose TEXT for fields with text (e.g. TitleAuthorsDOI, etc.) and INT for fields with numbers (e.g. Citation_CountAuthor_CountDay, etc.)

...

Data typeAccessSQLServerOracleMySQLPostgreSQL
booleanYes/NoBitByteN/ABoolean
integerNumber (integer)IntNumberInt / IntegerInt / Integer
floatNumber (single)Float / RealNumberFloatNumeric
currencyCurrencyMoneyN/AN/AMoney
string (fixed)N/ACharCharCharChar
string (variable)Text (<256) / Memo (65k+)VarcharVarchar / Varchar2VarcharVarchar
binary object OLE Object Memo Binary (fixed up to 8K)Varbinary (<8K)Image (<2GB) LongRaw BlobText Binary

Varbinary

Key Points

  • SQL is a powerful language used to interrogate and manipulate relational databases

...

SQL Queries

Writing my first query

Let’s start by using the articles table. Here we have data on every article that has been published, including the title of the article, the authors, date of publication, etc.

Let’s write an SQL query that selects only the title column from the articles table.

SELECT title
FROM articles;

We have capitalized the words SELECT and FROM because they are SQL keywords. SQL is case-insensitive, but it helps for readability, and is good style.

If we want more information, we can add a new column to the list of fields, right after SELECT:

SELECT title, authors, issns, date
FROM articles;

Or we can select all of the columns in a table using the wildcard ‘*’

SELECT *
FROM articles;

Unique values

If we want only the unique values so that we can quickly see the ISSNs of journals included in the collection, we use DISTINCT

SELECT DISTINCT issns
FROM articles;

If we select more than one column, then the distinct pairs of values are returned

SELECT DISTINCT issns, day, month, year
FROM articles;

Calculated values

We can also do calculations with the values in a query. For example, if we wanted to look at the relative popularity of an article, so we divide by 10 (because we know the most popular article has 10 citations).

SELECT first_author, citation_count/10.0
FROM articles;

When we run the query, the expression citation_count / 10.0 is evaluated for each row and appended to that row, in a new column. Expressions can use any fields, any arithmetic operators (+-*, and /) and a variety of built-in functions. For example, we could round the values to make them easier to read.

SELECT first_author, title, ROUND(author_count/16.0, 2)
FROM articles;

Challenge 

Write a query that returns the title, first_author, citation_count, author_count, month and year

Filtering

Databases can also filter data – selecting only the data meeting certain criteria. For example, let’s say we only want data for a specific ISSN for the Theory and Applications of Mathematics & Computer Science journal, which has a ISSN code 2067-2764|2247-6202. We need to add a WHERE clause to our query:

SELECT *
FROM articles
WHERE issns='2067-2764|2247-6202';

We can use more sophisticated conditions by combining tests with AND and OR. For example, suppose we want the data on Theory and Applications of Mathematics & Computer Science published after June:

SELECT *
FROM articles
WHERE (issns='2067-2764|2247-6202') AND (month > 06);

Note that the parentheses are not needed, but again, they help with readability. They also ensure that the computer combines AND and OR in the way that we intend.

If we wanted to get data for the Humanities and Religions journals, which have ISSNs codes 2076-0787 and 2077-1444, we could combine the tests using OR:

SELECT *
FROM articles
WHERE (issns = '2076-0787') OR (issns = '2077-1444');

Challenge 

Write a query that returns the title, first_author, issns, month and year for all single author papers with more than 4 citations

Building more complex queries

Now, lets combine the above queries to get data for the 3 journals from June on. This time, let’s use IN as one way to make the query easier to understand. It is equivalent to saying WHERE (issns = '2076-0787') OR (issns = '2077-1444') OR (issns = '2067-2764|2247-6202'), but reads more neatly:

SELECT *
FROM articles
WHERE (month > 06) AND (issns IN ('2076-0787', '2077-1444', '2067-2764|2247-6202'));

We started with something simple, then added more clauses one by one, testing their effects as we went along. For complex queries, this is a good strategy, to make sure you are getting what you want. Sometimes it might help to take a subset of the data that you can easily see in a temporary database to practice your queries on before working on a larger or more complicated database.

When the queries become more complex, it can be useful to add comments. In SQL, comments are started by --, and end at the end of the line. For example, a commented version of the above query can be written as:

-- Get post June data on selected journals
-- These are in the articles table, and we are interested in all columns
SELECT * FROM articles
-- Sampling month is in the column `month`, and we want to include
-- everything after June
WHERE (month > 06)
-- selected journals have the `issns` 2076-0787, 2077-1444, 2067-2764|2247-6202
AND (issns IN ('2076-0787', '2077-1444', '2067-2764|2247-6202'));

Although SQL queries often read like plain English, it is always useful to add comments; this is especially true of more complex queries.

Sorting

We can also sort the results of our queries by using ORDER BY. For simplicity, let’s go back to the articles table and alphabetize it by issns.

SELECT *
FROM articles
ORDER BY issns ASC;

The keyword ASC tells us to order it in Ascending order. We could alternately use DESC to get descending order.

SELECT *
FROM articles
ORDER BY first_author DESC;

ASC is the default.

We can also sort on several fields at once. To truly be alphabetical, we might want to order by genus then species.

SELECT *
FROM articles
ORDER BY issns DESC, first_author ASC;

Challenge 

Write a query that returns title, first_author, issns and citation_count from the articles table, sorted with the most cited article at the top and alphabetically

Order of execution

Another note for ordering. We don’t actually have to display a column to sort by it. For example, let’s say we want to order the articles by their ISSN, but we only want to see Authors and Titles.

SELECT authors, title
FROM articles
WHERE issns = '2067-2764|2247-6202'
ORDER BY date ASC, first_author ASC;

We can do this because sorting occurs earlier in the computational pipeline than field selection.

The computer is basically doing this:

  1. Filtering rows according to WHERE
  2. Sorting results according to ORDER BY
  3. Displaying requested columns or expressions.

Clauses are written in a fixed order: SELECTFROMWHERE, then ORDER BY. It is possible to write a query as a single line, but for readability, we recommend to put each clause on its own line.

Challenge 

Let’s try to combine what we’ve learned so far in a single query. Using the articles table write a query to display the three date fields, issn, and citation_count, for articles published after June, ordered alphabetically by first author name. Write the query as a single line, then put each clause on its own line, and see how more legible the query becomes!