Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

Table of Contents

0: Data Scripting Bootcamp Introduction

0.1 Overall Bootcamp Schedule

  1. (Meta)Data Basics - including touching modeling, representations, and structures
  2. Using the command line (in the *nix Shell and Bash) to interact with (meta)data
  3. Versioning & collaborating on (meta)data using Git & GitHub
  4. Querying & updating (meta)data contained in traditional MySQL databases (a popular database selection at CUL & elsewhere)

0.2 Workshop Logistics

  • CUL Metadata Working Group 2016-2017 Blurb
  • Audience Introductions / Comfort Level
  • Curriculum location online
    • Parts of this will be self-directed
    • We want to stay informal, so please let us know if you need help
  • This is a Bootcamp, so it won't go in depth. We just want to help you get acclimated to these technologies.
    • Use your Google-fu skills wisely.
    • Please be patient with yourself + others. 
  • Hacker School Rules, please.

0.3: Short Discussion on Jargon(s)

...

1.1: What is "Data Modeling"?

 

  • Data Model is sometimes an Application Artifact
  • Data Models act as a specification of a System.
  • Model documents serve as documentation for design team, developers, and users.
  • Data Models are ultimately a way to communicate understandings that bridge the conceptual and the functional.
  • Data model: represents the fundamental concepts that are relevant to a system. In object-oriented programming, concepts are often represent by classes that can have attributes. The combination of these create a data model, as they encapsulate the data that will be used by your system.

 

 

1.2: Why do we "Data Model"?

...

  • . matches any character
  • \d matches any single digit
  • \w matches any part of word character (equivalent to [A-Za-z0-9])
  • \s matches any space, tab, or newline
  • \ NB: this is also used to escape the following character when that character is a special character. So, for example, a regular expression that found .com would be \.com because . is a special character that matches any character.
  • ^ asserts the position at the start of the line. So what you put after it will only match the first characters of a line or contents of a cell.
  • $ asserts the position at the end of the line. So what you put after it will only match the last character of a line of contents of a cell.
  • \b adds a word boundary. Putting this either side of a stops the regular expression matching longer variants of words. So:
    • the regular expression foobar will match foobar and find 666foobarfoobar7778thfoobar8th et cetera
    • the regular expression \bfoobar will match foobar and find foobar777
    • the regular expression foobar\b will match foobar and find 666foobar
    • the regular expression \bfoobar\b will find foobar

Regular Expression Exercises

4.1.1: Using special characters in regular expression matches

Expand
titleCan you guess what the regular

...

expression ^[Oo]rgani.e\

...

b will match?

...

Solution

...

(Click to see the answer)
organiseorganize
Organise
Organize
organife
Organike
 
Or, any other string that starts a line, begins with a 

...

letter o in lower or capital case, proceeds 

...

with rgani, has any character in the 7th position, and ends with the 

...

letter e.

Other useful special characters are:

...

4.1.2: Using special characters in regular expression matches

Expand
titleCan you guess what the regular

...

expression ^[Oo]rgani.e\w*

...

Solution

...

will match?

...

(Click to see answer)
organise
Organize
organifer
Organi2ed111
 
Or, any other string that starts a line, begins with a letter o in lower or capital case, proceeds with rgani, has any character in the 7th position, follows with letter e and zero or more characters from the range [A-Za-z0-9].

...

[Oo]rgani.e\w+$

4.1.3: Using special characters in regular expression matches

Expand
titleCan you guess what the regular

...

expression [Oo]rgani.e\w+$

...

will match?

...

(Click here for answer)
organiser
Organized
organifer
Organi2ed111

...

Or, any other string that ends a line, begins with a letter o in lower or capital case, proceeds with rgani, has any character in the 7th position, follows with letter e and one or more characters from the range [A-Za-z0-9].

4.1.4: Using special characters in regular expression matches

Expand
titleCan you guess what the regular

...

expression ^[Oo]rgani.e\w?\b

...

will match?

...

(Click here for answer)
organise
Organized
organifer
Organi2ek
 
Or, any other string that starts a line, begins with a letter o in lower or capital case, proceeds with rgani, has any character in the 7th position, follows with letter e, and ends with zero or one characters from the range [A-Za-z0-9].

4.1.5: Using special characters in regular expression matches

Expand
titleCan you guess what the regular

...

expression ^[Oo]rgani.e\w?$

...

will match? (Click for answer)

...

organise

...

Organized

...

organifer
Organi2ek

...

Or, any other string that starts and ends a line, begins with a letter o in lower or capital case, proceeds with rgani, has any character in the 7th position, follows with letter e and zero or one characters from the range [A-Za-z0-9].

4.1.6: Using special characters in regular expression matches

Expand
titleCan you guess what the regular

...

expression \b[Oo]rgani.e\w{2}\b

...

will match? (Click for answer)

...

organisers

...

Organizers
organifers
Organi2ek1

...

Or, any other string that begins with a letter o in lower or capital case after a word boundary, proceeds with rgani, has any character in the 7th position, follows with letter e, and ends with two characters from the range [A-Za-z0-9].

4.1.7: Using special characters in regular expression matches

Expand
titleCan you guess what the regular

...

expression \b[Oo]rgani.e\b|\b[Oo]rgani.e\w{1}\b

...

will match? (Click for answer)

...

organise

...

Organi1e
Organizer
organifed

...

Or, any other string that begins with a letter o in lower or capital case after a word boundary, proceeds with rgani, has any character in the 7th position, and end with letter e, or any other string that begins with a letter o in lower or capital case after a word boundary, proceeds with rgani, has any character in the 7th position, follows with letter e, and ends with a single character from the range [A-Za-z0-9].

This logic is super useful when you have lots of files in a directory, when those files have logical file names, and when you want to isolate a selection of files. Or for looking at cells in spreadsheets for certain values. Or for extracting some data from a column of a spreadsheet to make new columns. I could go on. The point is, it is super useful in many contexts. To embed this knowledge we won't - however - be using computers. Instead we'll use pen and paper. I want you to work in teams of 4 to work through the exercises in the handout. I have an answer sheet over here if you want to check where you've gone wrong. When you finish, I'd like you to split your team into two groups and write each other some tests. These should include a) strings you want the other team to write regex for and b) regular expressions you want the other team to work out what they would match. Then test each other on the answers. If you want to check your logic, use regex101myregexp, or regex pal regexper.com: the first three help you see what text your regular expression will match, the latter visualises the workflow of a regular expression.

4.1.8: Using Square Brackets

 

Expand
titleCan you guess what the regular

...

expression Fr[ea]nc[eh]

...

will match?

...

(Click for answer)
French 
France 
Frence 
Franch
 

Solution

...

This will also find words where there are characters either side of the solutions above, such as Francer, foobarFrench, and Franch911.

 

4.1.9: Using dollar signs

 

Expand
titleCan you guess what the regular

...

expression Fr[ea]nc[eh]

...

$ will match?

...

(Click for answer)
French 
France 
Frence 
Franch
 

Solution

...

This will also find strings at the end of a line. It will find words where there were characters before these, for example foobarFrench.

 

4.1.

...

10: Introducing options

 

Expand
titleWhat would match the

...

strings French and France only that appear at the beginning of a line?

...

(Click for answer)
^France|^French

...

This will also find words where there were characters after French such as Frenchness.
 

4.1.

...

11: Case insensitivity

 

Expand
titleHow do you match the whole

...

words colour and color (case insensitive)? (Click for answer)

Solutions

\b[Cc]olou?r\b|\bCOLOU?R\b

...

 /colou?r/i

...

 
In real life, you should only come across the case insensitive variations colour, color, Colour, Color, COLOUR, and COLOR (rather than, say, coLour). So based on what we know, the logical regular expression is \b[Cc]olou?r\b|\bCOLOU?R\b. An alternative more elegant option we've not discussed is to take advantage of the / delimiters and add an ignore case flag: so /colou?r/i will match all case insensitive variants of colour and color.

 

4.1.1: Word boundaries

 

Expand
titleHow would you find the whole-

...

word headrest and or the 2-

...

Solution

gram head rest but not head rest (that is, with two spaces

...

between head and rest? (Click for answer)
\bhead ?rest\b

...

 
Note that although \bhead\s?rest\b does work, it will also match zero or one tabs or newline characters between head and rest. So again, although in most real world cases it will be fine, it isn't strictly correct.

...

4.1.1: Matching non-linguistic patterns

Expand
titleHow would you find a string that ends with 4 letters preceded by at least one zero? (Click for answer)

Solution

0+[a-z]{4}\b

4.1.1: Matching digits

 

Expand
titleHow do you match any 4 digit string anywhere?

...

(Click here for answer)
\d{4}

...

 
Note this will also match 4 digit strings within longer strings of numbers and letters.
 

4.1.1: Matching dates

 

Expand
titleHow would you match the date

...

format dd-MM-yyyy?

...

(Click for answer)
\b\d{2}-\d{2}-\d{4}\b

...

 
Depending on your data, you may chose to remove the word bounding.
 

4.1.1: Matching multiple date formats

 

Expand
titleHow would you match the date

...

format dd-MM-

...

yyyy or dd-MM-

...

yy at the end of a string only?

...

(Click for answer)
\d{2}-\d{2}-\d{2,4}$

...

 
Note this will also find strings such as 31-01-198 at the end of a line, so you may wish to check your data and revise the expression to exclude false positives. Depending on your data, you may chose to add word bounding at the start of the expression.

4.1.1: Matching publication formats

 

Expand
titleHow would you match publication formats such

...

as British Library : London,

...

2015 and Manchester University Press: Manchester, 1999?

...

(Click for answer)
.* : .*, \d{4}

...

 
Without word boundaries you will find that this matches any text you put before British or Manchester. Nevertheless, the regular expression does a good job on the first look up and may be need to be refined on a second depending on your data.