Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

Table of Contents

0:

Data Scripting Bootcamp

Introduction

0.1

Overall

Bootcamp Schedule

  1. (Meta)Data Basics - including touching modeling, representations, and structures (1 hour)
  2. Using the command line (in the *nix Shell and Bash) to interact with (meta)data (1 hour 30 minutes)
  3. Versioning & collaborating on (meta)data using Git & GitHub (1 hour 15 minutes)
  4. Querying & updating (meta)data contained in traditional MySQL databases (a popular database selection at CUL & elsewhere) (1 hour)
  5. Wrap-up & Requests for future workshops (15 minutes)

0.2 Workshop Logistics

0.3: Short Discussion on Jargon(s)

This group task is an opportunity for everyone to get help understanding terms, phrases, or ideas around metadata, data science, and scripting.

  • As a full group, make a big list of all the problem terms, phrases, and ideas that come up.
  • Taking common words as a starting point, work together to try to explain what a term, phrase, or idea means.
  • What will we cover today? What can we cover in a future MWG meeting?
  • Add these to a shared CUL MWG Glossary.

11: Data Modeling

1.1: What is "Data Modeling"?

 

Data model: Represents the fundamental concepts that are relevant to a system. Concepts are often represent by classes that can have attributes. The combination of these create a data model, as they encapsulate the data that will be used by your system.

  • Data Model is sometimes an Application Artifact
  • Data Models act as a specification of a
  • Data Model is sometimes an Application Artifact
  • Data Models act as a specification of a System.
  • Model documents serve as documentation for design team, developers, and users.
  • Data Models are ultimately a way to communicate understandings that bridge the conceptual and the functional.
  • Data model: represents the fundamental concepts that are relevant to a system. In object-oriented programming, concepts are often represent by classes that can have attributes. The combination of these create a data model, as they encapsulate the data that will be used by your system.

 

 

1.2: Why do we "Data Model"?

1.2: Why do we "Data Model"?

  • Modeling for Design Feedback
  • Provides common language and point of discussion for programmers, designers & domain experts.
  • Early opportunity to address ethical issues:
    • How does what we model in our systems create affordances and restrictions for our users?
    • Falsehoods Programmers Believe

1.3: What 1.3: What Data Modeling Can Look Like?

...

2: Data Structures & Scripting

Data structures: A data structure is a particular way of organizing data in a computer so that it can be used effectively. The idea is to reduce the space and time complexities of different tasks. 

Data structures intersect with Data Models. Data models represent a concept to humans interacting with systems and computers. At a lower level than models, any type of data must be represented internally to the computer, and this is by means of data structures. For example, an array is a data structure , as well as dictionaries(a number of elements in a specific order, typically all of the same type), as well as dictionaries (collection of (key, value) pairs, such that each possible key appears at most once in the collection). Also, the classes that form your data model, are data structures too, any representation of a specific data object has to be in form of a data structure.

...

The following is taken mostly from the work of Library Carpentry's Data Intro for Librarians module. It is "some foundation level stuff - a combination of best practice and generic skills"

Some Points on Scripting

...

  • The Computer is Stupid. It can only do (whether process errors or explain errors of processes) what humans tell it to do.
  • Borrow, Borrow, and Borrow again; this is a mainstay of programming and a practice common to all skill levels. Borrow other people's code. Reuse and adapt existing scripts.
  • The correct (programming) language to learn is the one that works in your local context. 

  • Consider the role of programming in professional development.

  • Knowing (even a little) code helps you evaluate projects that use code. 

  • Automate some of your metadata work to make the time to do something else! 

Keyboard shortcuts are your friend

2.2: Simple Place to Start for Enabling Data Scripting

Keyboard shortcuts are your friend

Starting with relatively simple things like keyboard shortcuts can both help with automation and introduce programming concepts. This also extends to macros and spreadsheet formulas - these are scripting and mean you already have your start in programming. Honestly, many library technologists and programmers often started with these sort of tasks.

Plain text formats are your friend

...

Most standard office suites include the option to save files in .txt, .csv and .tsv formats, meaning you can continue to work with familiar software and still take appropriate action to make your work accessible. Compared to .doc or .xls, these formats have the additional benefit of containing only machine-readable elements. Whilst using Machine-readable elements usually mean not including textual formats like bold, italics, and colouring coloring. This type of formatting can be helpful to signify headings or to make a visual connection between data elements is common practice, these display-orientated annotations are not (easily) machine-readable and hence can neither be queried and searched nor are appropriate for large quantities of information (the rule of thumb is if you can’t find it by CTRL+F, it isn’t machine readable).

Though it is likely that notation schemes will emerge from existing individual practice, existing schema are available to represent headers, breaks, et al. if you need them. One such scheme is Markdown, a lightweight markup language. Markdown files are as .md, are machine readable, human readable, and used in many contexts - GitHub for example, renders text via Markdown. An excellent Markdown cheat sheet is available on GitHub for those who wish to follow – or adapt – this existing schema. Notepad++ http://notepad-plus-plus.org/ is recommended for Windows users as a tool to write Markdown text in, though it is by no means essential for working with .md files. Mac or Unix users may find Komodo Edit, Text Wrangler, Kate, or Atom helpful. Combined with pandoc, a markdown file can be exported to PDF, LaTeX or other formats, so it is a great way to create machine-readable, easily searchable documents that can be repurposed in many ways. It is a universal document converter.

...

Working with data is made easier by structuring your stuff files and directories in a consistent and predictable manner.

...

A typical example of the former (containing semantic elements) are the URLs used by news websites or blogging services. WordPress URLs follow the format:

...

In archival catalogues among other repository websites, URLs structured by a single data element are often used. The NLA’s TROVE structures its online archive using the format:

...

What we learn from these examples is that a combination of semantic description and data elements make for consistent and predictable data structures that are readable both by humans and machines. Transferring this to your stuff makes it easier to browse, to search, and to query using both the standard tools provided by operating systems and by the more advanced tools Library Carpentry will cover.

In practice, the structure of a good archive data directory might look something like this:

...

All this should help you remember something you were working on when you come back to it later (call it real world preservation)or need to work on a large number of files at once.

The crucial bit for our purposes, however, is the file naming convention you choose. The name of a file is important to ensuring it and its contents are easy to identify. ‘Data.xslx’ doesn’t fulfil this purpose. A title that describes the data does. And adding dating convention to the file name, associating derived data with base data through file names, and using directory structures to aid comprehension strengthens those connection.

3

...

: Data Queries & Regular Expressions

Filenaming conventions become important when you want to use some simple automation and scripting tools to interact with your files and data.

Consistent and semantic filenaming and directory structure meansOne of the reason why I have stressed the value of consistent and predictable directory and filenaming conventions is that working in this way enables you to use the computer to select files based on the characteristics of their file name. So, for example, if you have a bunch of files where the first four digits are the year and you only want to do something with files from '2014', then you can. Or if you have 'journal' somewhere in a filename when you have data about journals, you can use the computer to select just those files then do something with them.

Equally, using plain text formats means that you can go further and select files or elements of files based on characteristics of the data within files.

3.1: Regular Expressions

A powerful means of doing this selecting & querying based on file or data characteristics is to use regular expressions, often abbreviated to regex. A regular expression is a sequence of characters that define a search pattern, mainly for use in pattern matching with strings, or string matching, i.e. "find and replace"-like operations. Regular expressions are typically surrounded by / characters, though we will (mostly) ignore those for ease of comprehension. Regular expressions will let you:

...

As most computational software has regular expression functionality built in and as many computational tasks in libraries are built around complex matching, it is good place for Library Carpentry to start in earnestfor data scripting here.

A very simple use of a regular expression would be to locate the same word spelled two different ways. For example the regular expression organi[sz]e matches both "organise" and "organize".

But it would also match reorganisereorganizeorganisesorganizesorganisedorganized, et cetera, because we've not specified the beginning or end of our string. So there are a bunch of special syntax that help us be more precise.

3.2: Regular Expressions Syntax (Starter)

The first we've seen: square brackets can be used to define a list or range of characters to be found. So:

...

  • . matches any character
  • \d matches any single digit
  • \w matches any part of word character (equivalent to [A-Za-z0-9])
  • \s matches any space, tab, or newline
  • \ NB: this is also used to escape the following character when that character is a special character. So, for example, a regular expression that found .com would be \.com because . is a special character that matches any character.
  • ^ asserts the position at the start of the line. So what you put after it will only match the first characters of a line or contents of a cell.
  • $ asserts the position at the end of the line. So what you put after it will only match the last character of a line of contents of a cell.
  • \b adds a word boundary. Putting this either side of a stops the regular expression matching longer variants of words. So:
    • the regular expression foobar will match foobar and find 666foobarfoobar7778thfoobar8th et cetera
    • the regular expression \bfoobar will match foobar and find foobar777
    • the regular expression foobar\b will match foobar and find 666foobar
    • the regular expression \bfoobar\b will find foobar

Exercises

4.1.1: Using special characters in regular expression matches

Can you guess what the regular expression ^[Oo]rgani.e\b will match?

Solution

organise
organize
Organise
Organize
organife
Organike
Or, any other string that starts a line, begins with a letter o in lower or capital case, proceeds with rgani, has any character in the 7th position, and ends with the letter e.

Other useful special characters are:

  • * matches when the preceding character appears any number of times including zero
  • + matches when the preceding character appears any number of times excluding zero
  • ? matches when the preceding character appears one or zero times
  • {VALUE} matches the preceding character the number of times define by VALUE; ranges can be specified with the syntax {VALUE,VALUE}
  • | means or.

4.1.2: Using special characters in regular expression matches

Can you guess what the regular expression ^[Oo]rgani.e\w* will match?

Solution

organise
Organize
organifer
Organi2ed111
Or, any other string that starts a line, begins with a letter o in lower or capital case, proceeds with rgani, has any character in the 7th position, follows with letter e and zero or more characters from the range [A-Za-z0-9].
[Oo]rgani.e\w+$

...

Other useful special characters are:

  • * matches when the preceding character appears any number of times including zero
  • + matches when the preceding character appears any number of times excluding zero
  • ? matches when the preceding character appears one or zero times
  • {VALUE} matches the preceding character the number of times define by VALUE; ranges can be specified with the syntax {VALUE,VALUE}
  • | means or.

3.3: Regular Expression Exercises

Below are some examples to get a feeling for using Regular Expressions. We will go through the first few together to get a feeling for their logic - which will be extremely helpful for running matches for certain filenames, call values in a table, or matches in the data itself.

If there is time, run through the following examples by yourself or in small groups. You do not need to get through all of these, and we will be referring back to these in the next session.

If you want to check your Regular Expression logic in real-time, use regex101, regexrmyregexpregex pal, or regexper.com: the first four help you see what text your regular expression will match, the latter shows the workflow of a regular expression.

3.3.1: Using special characters in regular expression matches

Expand
titleCan you guess what the regular

...

expression ^[Oo]rgani.e\

...

b will match?

...

(Click to see the answer)
organiseorganize
Organise
Organize
organife
Organike
 

Or, any other string that starts a line, begins with a letter o in lower or capital case, proceeds with rgani, has any character in the 7th position, and ends with the letter e.

3.3.2

Solution

organiser
Organized
organifer
Organi2ed111
Or, any other string that ends a line, begins with a letter o in lower or capital case, proceeds with rgani, has any character in the 7th position, follows with letter e and one or more characters from the range [A-Za-z0-9].

...

: Using special characters in regular expression matches: \w*

Expand
titleCan you guess what the regular

...

expression ^[Oo]rgani.e\w

...

* will match?

...

Solution

organise
Organized
organifer
Organi2ek

...

(Click to see answer)
organise
Organize
organifer
Organi2ed111
 

Or, any other string that starts a line, begins with a letter o in lower or capital case, proceeds with rgani, has any character in the 7th position, follows with letter e and zero or more characters from the range [A-Za-z0-9].

...

[Oo]rgani.e\w+$

3.3.3

...

: Using special characters in regular expression matches: \w+$

Expand
titleCan you guess what the regular

...

expression [Oo]rgani.e\w

...

+$

...

will match?

...

Solution

...

(Click here for answer)
organiser
Organized
organifer

...

Organi2ed111
 

Or,

...

any

...

other

...

string

...

that

...

ends

...

a

...

line,

...

begins

...

with

...

a

...

letter o in

...

lower

...

or

...

capital

...

case,

...

proceeds

...

with rgani,

...

has

...

any

...

character

...

in

...

the

...

7th

...

position,

...

follows

...

with

...

letter e and

...

one or more characters from the range [A-Za-z0-9].

...

3.

...

3.

...

4: Using special characters in regular expression matches: \w?\b

Expand
titleCan you guess what the regular

...

expression ^[Oo]rgani.e\w

...

?\b

...

will match?

...

(Click here for answer)
organise
Organized
organifer
Organi2ek
 

Solution

...

Or,

...

any

...

other

...

string

...

that

...

starts a line, begins

...

with

...

a

...

letter o in

...

lower

...

or

...

capital

...

case

...

,

...

proceeds

...

with rgani,

...

has

...

any

...

character

...

in

...

the

...

7th

...

position,

...

follows

...

with

...

letter e,

...

and

...

ends

...

with zero or one characters from the range [A-Za-z0-9].

...

3.

...

3.

...

5: Using special characters in regular expression matches: \w?$

Expand
titleCan you guess what the regular

...

expression ^[Oo]rgani.e\w

...

?$ will match?

...

(Click for answer)
organise
Organized
organifer
Organi2ek
 

Solution

...

Or,

...

any

...

other

...

string

...

that starts and ends a line, begins with a letter o in lower or capital case, proceeds with rgani, has any character in the 7th position, follows with letter e and zero or one characters from the range [A-Za-z0-9].

3.3.6: Using special characters in regular expression matches: \w{2}\b

Expand
titleCan you guess what the regular expression \b[Oo]rgani.e\w{2}\b will match? (Click for answer)
organisers
Organizers
organifers
Organi2ek1
 

Or, any other string that begins with a letter o in lower or capital case after a word boundary, proceeds with rgani, has any character in the 7th position, follows with letter e, and ends with two characters from the range [A-Za-z0-9].

This logic is super useful when you have lots of files in a directory, when those files have logical file names, and when you want to isolate a selection of files. Or for looking at cells in spreadsheets for certain values. Or for extracting some data from a column of a spreadsheet to make new columns. I could go on. The point is, it is super useful in many contexts. To embed this knowledge we won't - however - be using computers. Instead we'll use pen and paper. I want you to work in teams of 4 to work through the exercises in the handout. I have an answer sheet over here if you want to check where you've gone wrong. When you finish, I'd like you to split your team into two groups and write each other some tests. These should include a) strings you want the other team to write regex for and b) regular expressions you want the other team to work out what they would match. Then test each other on the answers. If you want to check your logic, use regex101myregexp, or regex pal regexper.com: the first three help you see what text your regular expression will match, the latter visualises the workflow of a regular expression.

4.1.8: Using Square Brackets

Can you guess what the regular expression Fr[ea]nc[eh] will match?

Solution

French
France
Frence
Franch
This will also find words where there are characters either side of the solutions above, such as Francer, foobarFrench, and Franch911.

4.1.9: Using dollar signs

Can you guess what the regular expression Fr[ea]nc[eh]$ will match?

Solution

French
France
Frence
Franch
This will also find strings at the end of a line. It will find words where there were characters before these, for example foobarFrench.

4.1.1: Introducing options

What would match the strings French and France only that appear at the beginning of a line?

Solution

^France|^French
This will also find words where there were characters after French such as Frenchness.

4.1.1: Case insensitivity

How do you match the whole words colour and color (case insensitive)?

Solutions

\b[Cc]olou?r\b|\bCOLOU?R\b
/colou?r/i
In real life, you should only come across the case insensitive variations colour, color, Colour, Color, COLOUR, and COLOR (rather than, say, coLour). So based on what we know, the logical regular expression is \b[Cc]olou?r\b|\bCOLOU?R\b. An alternative more elegant option we've not discussed is to take advantage of the / delimiters and add an ignore case flag: so /colou?r/i will match all case insensitive variants of colour and color.

4.1.1: Word boundaries

How would you find the whole-word headrest and or the 2-gram head rest but not head rest (that is, with two spaces between head and rest?

Solution

\bhead ?rest\b
Note that although \bhead\s?rest\b does work, it will also match zero or one tabs or newline characters between head and rest. So again, although in most real world cases it will be fine, it isn't strictly correct. 

4.1.1: Matching non-linguistic patterns

How would you find a string that ends with 4 letters preceded by at least one zero?

Solution

0+[a-z]{4}\b

4.1.1: Matching digits

How do you match any 4 digit string anywhere?

Solution

\d{4}
Note this will also match 4 digit strings within longer strings of numbers and letters.

4.1.1: Matching dates

How would you match the date format dd-MM-yyyy?

Solution

\b\d{2}-\d{2}-\d{4}\b
Depending on your data, you may chose to remove the word bounding.

4.1.1: Matching multiple date formats

How would you match the date format dd-MM-yyyy or dd-MM-yy at the end of a string only?

Solution

\d{2}-\d{2}-\d{2,4}$
Note this will also find strings such as 31-01-198 at the end of a line, so you may wish to check your data and revise the expression to exclude false positives. Depending on your data, you may chose to add word bounding at the start of the expression.

4.1.1: Matching publication formats

How would you match publication formats such as British Library : London, 2015 and Manchester University Press: Manchester, 1999?

Solution

.* : .*, \d{4}

...

3.3.7: Using special characters in regular expression matches: \w{1}\b

Expand
titleCan you guess what the regular expression \b[Oo]rgani.e\b|\b[Oo]rgani.e\w{1}\b will match? (Click for answer)
organise
Organi1e
Organizer
organifed
 

Or, any other string that begins with a letter o in lower or capital case after a word boundary, proceeds with rgani, has any character in the 7th position, and end with letter e, or any other string that begins with a letter o in lower or capital case after a word boundary, proceeds with rgani, has any character in the 7th position, follows with letter e, and ends with a single character from the range [A-Za-z0-9].

3.3.8: Using Square Brackets

Expand
titleCan you guess what the regular expression Fr[ea]nc[eh] will match? (Click for answer)
French 
France 
Frence 
Franch
 

This will also find words where there are characters either side of the solutions above, such as Francer, foobarFrench, and Franch911.

3.3.9: Using dollar signs

Expand
titleCan you guess what the regular expression Fr[ea]nc[eh]$ will match? (Click for answer)
French 
France 
Frence 
Franch
 

This will also find strings at the end of a line. It will find words where there were characters before these, for example foobarFrench.

3.3.10: Introducing options

Expand
titleWhat would match the strings French and France only that appear at the beginning of a line? (Click for answer)
^France|^French
 

This will also find words where there were characters after French such as Frenchness. 

3.3.11: Case insensitivity

Expand
titleHow do you match the whole words colour and color (case insensitive)? (Click for answer)
\b[Cc]olou?r\b|\bCOLOU?R\b /colou?r/i
 

In real life, you should only come across the case insensitive variations colour, color, Colour, Color, COLOUR, and COLOR (rather than, say, coLour). So based on what we know, the logical regular expression is \b[Cc]olou?r\b|\bCOLOU?R\b. An alternative more elegant option we've not discussed is to take advantage of the / delimiters and add an ignore case flag: so /colou?r/i will match all case insensitive variants of colour and color.

3.3.12: Word boundaries

Expand
titleHow would you find the whole-word headrest and or the 2-gram head rest but not head rest (that is, with two spaces between head and rest? (Click for answer)
\bhead ?rest\b
 

Note that although \bhead\s?rest\b does work, it will also match zero or one tabs or newline characters between head and rest. So again, although in most real world cases it will be fine, it isn't strictly correct.

3.3.13: Matching non-linguistic patterns (Write the Regex for the Query)

Expand
titleHow would you find a string that ends with 4 letters preceded by at least one zero? (Click for answer)
0+[a-z]{4}\b

3.3.14: Matching digits (Write the Regex for the Query)

Expand
titleHow do you match any 4 digit string anywhere? (Click here for answer)
\d{4}
 

Note this will also match 4 digit strings within longer strings of numbers and letters. 

3.3.15: Matching dates  (Write the Regex for the Query)

Expand
titleHow would you match the date format dd-MM-yyyy? (Click for answer)
\b\d{2}-\d{2}-\d{4}\b
 

Depending on your data, you may chose to remove the word bounding. 

3.3.16: Matching multiple date formats (Write the Regex for the Query)

Expand
titleHow would you match the date format dd-MM-yyyy or dd-MM-yy at the end of a string only? (Click for answer)
\d{2}-\d{2}-\d{2,4}$
 
Note this will also find strings such as 31-01-198 at the end of a line, so you may wish to check your data and revise the expression to exclude false positives. Depending on your data, you may chose to add word bounding at the start of the expression.

3.3.17: Matching publication formats (Write the Regex for the Query)

Expand
titleHow would you match publication formats such as British Library : London, 2015 and Manchester University Press: Manchester, 1999? (Click for answer)
.* : .*, \d{4}
 

Without word boundaries you will find that this matches any text you put before British or Manchester. Nevertheless, the regular expression does a good job on the first look up and may be need to be refined on a second depending on your data.


References & Further Reading