Page History

...

Code Block

language	bash

$ cd 
$ cd Desktop/shell-lessonCUL-MWG-Workshop-master 
$ pwd
/Users/Christina/Desktop/CUL-MWG-Workshop-master

...

Code Block

language	bash

$ cd shell-lessonCUL-MWG-Workshop-master

Remember, if at any time you are not sure where you are in your directory structure, use the pwd command to find out:

...

Expand

title	Click to see the answer

From man wc, you will see that there is a -w flag to print the number of words:

     -w      The number of words in each input file is written to the standard
             output.

So to print the word counts of the .csv files:

$ wc -w *.csv

 1804072 2017-ecommons-CU-etds.csv
  452320 2017-ecommons-CUL-community.csv
   15220 CUlecturetapes_metadata_ingest-ready.csv
 2271612 total

And to sort the lines numerically:

$ wc -w *.csv | sort -n

   15220 CUlecturetapes_metadata_ingest-ready.csv
  452320 2017-ecommons-CUL-community.csv
 1804072 2017-ecommons-CU-etds.csv
 2271612 total

total

1.7: Mining or searching

Searching for something in one or more files is something we'll often need to do, so let's introduce a command for doing that: grep (short for global regular expression print). As the name suggests, it supports regular expressions and is therefore only limited by your imagination, the shape of your data, and - when working with thousands or millions of files - the processing power at your disposal.

To begin using grep, first navigate to the shell-lessonCUL-MWG-Workshop-master directory if not already there. Then create a new directory "results":

Code Block

language	bash

$ mkdir cmh329_results

Now let's try our first search:

Code Block

language	bash

$ grep

...

metadata *.

...

csv

Remember that the shell will expand *.tsvcsv to a list of all the .tsv csv files in the directory. grep will then search these for instances of the string "1999" and print the matching linesrows.

1.8: Strings

A string is a sequence of characters, or "a piece of text".

1.8.1: Introduction to grep

Press the up arrow once in order to cycle back to your most recent action. Amend grep 1999 metadata *.tsvcsv to grep -c 1999 metadata *.tsvcsv and hit enter.

Code Block

language	bash

$ grep -c

...

metadata *.

...

csv
2017-ecommons-CU-etds.csv:3
2017-ecommons-CUL-community.csv:139
CUlecturetapes_metadata_ingest-ready.csv:0

The shell now prints the number of times the string 1999 metadata appeared in each file. If you look at the output from the previous command, this tends to refer to the date field for each journal article.

We will try another search:

Code Block

language	bash

$ grep -c 'Digital

...

Archive' *.

...

csv
2017-ecommons-CU-etds.csv:0
2017-ecommons-CUL-community.csv:3
CUlecturetapes_metadata_ingest-ready.csv:0

We got back the counts of the instances of the string revolution'application/pdf' within the files. Now, amend the above command to the below and observe how the output of each is different:

Code Block

language	bash

$ grep -ci 'Digital

...

Archive' *.

...

csv
2017-ecommons-CU-etds.csv:0
2017-ecommons-CUL-community.csv:7
CUlecturetapes_metadata_ingest-ready.csv:0

This repeats the query, but prints a case insensitive count (including instances of both digital archive and Digital Archive and other variants).

revolution and Revolutionand other variants). Note how the count has increased nearly 30 fold for those journal article titles that contain the keyword 'america'. As before, cycling back and adding > cmh329_results/, followed by a filename (ideally in .txt format), will save the results to a data file. Go ahead and do this on your own.

1.8.2: CSV results from grep

So far we have counted strings in file and printed to the shell or to file those counts. But the real power of grepgrep comes in that you can also use it to create subsets of tabulated data (or indeed any data) from one or multiple files.

Code Block

language	bash

$ grep -i

...

metadata *.

...

csv

This script looks in the defined files and prints any lines containing revolution (without regard to case) to the shell.

Code Block

language	bash

$ grep -i

...

metadata *.

...

csv > cmh329_results/

...

2017-

...

02-

...

15_

...

metadata-

...

ecommons.

...

csv

This saves the subsetted data to file.

1.8.3: Whole words (-w) grep

Sometimes you need to capture the whole word only in that form (so revolution, but not revolutionary). The However, if we look at this file, it contains every instance of the string 'revolution' including as a single word and as part of other words such as 'revolutionary'. This perhaps isn't as useful as we thought... Thankfully, the -w flag instructs grep to look for whole words only, giving us greater precision in our search.

Code Block

language	bash

$ grep -iw revolution *.

...

csv > cmh329_results/DATE_JAiw-revolution.

...

csv

This script looks in both of the defined files and exports any lines containing the whole word revolution word revolution (without regard to case) to the specified .tsv csv file.

We can show the difference between the files we created.

Code Block

language	bash

$ wc -l cmh329_results/*.

...

csv

...

186 cmh329_results/2017-02-15_metadata-ecommons.csv

...

 162 cmh329_results/DATE_JAiw-revolution.

...

csv

...

 348 total

Finally, we'll use the regular expression syntax covered earlier to search for similar words.

1.9: Basic and extended regular expressions

There is unfortunately both "basic" and "extended" regular expressions. This is a common cause of confusion, since most tutorials, including ours, teach extended regular expression, but grep uses basic by default. Unles . Unless you want to remember the details, make your life easy by always using just always use extended regular expressions (-E flag) when doing something more complex than searching for a plain string.

The regular expression 'fr[ae]nc[eh]' will match "france", "french", but also "frence" and "franch". It's generally a good idea to enclose the expression in single quotation marks, since that ensures the shell sends it directly to grep without any processing (such as trying to expand the wildcard operator *).

Code Block

language	bash

$ grep -iwE 'fr[ae]nc[eh]' *.

...

csv

The shell will print out each matching line.

We include the -o flag to print only the matching part of the lines e.g. (handy for isolating/checking results):

Code Block

language	bash

$ grep -iwEo 'fr[ae]nc[eh]' *.

...

csv

Pair up with your neighbor and work on these exercies:

1.10: Case sensitive search

Search for all case sensitive instances of a word you choose in all four derived tsv csv files in this directory. Print your results to the shell.

Code Block

language	bash

$ grep

...

DCAPS *.

...

Case sensitive search in select files

Search for all case sensitive instances of a word you choose in the 'America' and 'Africa' tsv files in this directory. Print your results to the shell.

Solution

$ grep hero *a.tsv

Count words (case sensitive)

...

csv

1.11: Exercises

Run some of the now-possible searches on the sample metadata. How can you combine it with the Regex stuff as well as the output options above? Such as:

How do you run a search only on files that are from 2017 and a csv?
How do you count all case sensitive instances of a word you choose in

...

Solution

$ grep -c hero *a.tsv

Count words (case insensitive)

Count all case insensitive instances of that word in the 'America' and 'Africa' tsv files in this directory. Print your results to the shell.

Solution

$ grep -ci hero *a.tsv

...

select csv files?
How do you count all case insensitive instances of a word you choose in select csv files?
Search for all case insensitive instances of

...

a select word in

...

the csv files in this directory. Print your results to a

...

Solution

$ grep -i hero *a.tsv > results/new.tsv

...

new csv file.
Search for all case insensitive instances of

...

a whole word in the

...

csv files in this directory. Print your

...

results to a new csv file.
Use regular expressions to find all ISSN numbers, unis, or other type identifiers with a regular structure in the csv files.

1.12:

Solution

$ grep -iw hero *a.tsv > results/new2.tsv

Searching with regular expressions

Use regular expressions to find all ISSN numbers (four digits followed by hyphen followed by four digits) in 2014-01_JA.tsv and print the results to a file results/issns.tsv.

Solution

$ grep -E '\d{4}-\d{4}' 2014-01_JA.tsv > issns.tsv

If you came up with something more advanced, perhaps including word boundaries, please share your result on the Etherpad and give yourself a pat on the shoulder.

Finding unique values

If you pipe something to the uniq command, it will filter out duplicate lines and only return unique ones. Try piping the output from the command in the last exercise to uniq and then to wc -l to count the number of unique ISSN values.

Code Block

language

...

bash

$ grep -Eo '\d{4}-\d{4}' 2014-01_JA.tsv | uniq | wc -l

1.13: Counting number of files, part II

In the earlier counting exercise in this episodelesson, you we tried counting the number of files and directories in the current directory.

Recall that the command ls -l | wc -l took us quite far, but the result was one too high because it included the "total" line in the line count.
With the knowledge of grep, can you figure out how to exclude the "total" line from the ls -l output?
Hint: You want to exclude any line starting with the text "total". The hat character (^) is used in regular expressions to indicate the start of a line.

Solution

To find any lines starting with "total", we would use:

...

Space shortcuts

Page tree

Versions Compared

Old Version 4

New Version 5

Key

1.7: Mining or searching

1.8: Strings

1.8.1: Introduction to grep

1.8.2: CSV results from grep

1.8.3: Whole words (-w) grep

1.9: Basic and extended regular expressions

1.10: Case sensitive search

Case sensitive search in select files

Solution

Count words (case sensitive)

1.11: Exercises

Count words (case insensitive)

Solution

1.12:

Finding unique values

1.13: Counting number of files, part II

Solution