...
Code Block | ||
---|---|---|
| ||
$ cd $ cd Desktop/shell-lessonCUL-MWG-Workshop-master $ pwd /Users/Christina/Desktop/CUL-MWG-Workshop-master |
...
Code Block | ||
---|---|---|
| ||
$ cd shell-lessonCUL-MWG-Workshop-master |
Remember, if at any time you are not sure where you are in your directory structure, use the pwd command to find out:
...
Expand | ||
---|---|---|
| ||
From man wc, you will see that there is a -w flag to print the number of words: -w The number of words in each input file is written to the standard output. So to print the word counts of the .csv files: $ wc -w *.csv 1804072 2017-ecommons-CU-etds.csv And to sort the lines numerically: $ wc -w *.csv | sort -n 15220 CUlecturetapes_metadata_ingest-ready.csv |
total |
1.7: Mining or searching
Searching for something in one or more files is something we'll often need to do, so let's introduce a command for doing that: grep
(short for global regular expression print). As the name suggests, it supports regular expressions and is therefore only limited by your imagination, the shape of your data, and - when working with thousands or millions of files - the processing power at your disposal.
To begin using grep
, first navigate to the shell-lesson
CUL-MWG-Workshop-master
directory if not already there. Then create a new directory "results":
Code Block | ||
---|---|---|
| ||
$ mkdir cmh329_results |
Now let's try our first search:
Code Block | ||
---|---|---|
| ||
$ grep |
...
metadata *. |
...
csv |
Remember that the shell will expand *.tsvcsv
to a list of all the .tsv csv files in the directory. grep
will then search these for instances of the string "1999" and print the matching linesrows.
1.8: Strings
A string is a sequence of characters, or "a piece of text".
1.8.1: Introduction to grep
Press the up arrow once in order to cycle back to your most recent action. Amend grep 1999 metadata *.tsvcsv
to grep -c 1999 metadata *.tsvcsv
and hit enter.
Code Block | ||
---|---|---|
| ||
$ grep -c |
...
metadata *. |
...
csv
2017-ecommons-CU-etds.csv:3
2017-ecommons-CUL-community.csv:139
CUlecturetapes_metadata_ingest-ready.csv:0 |
The shell now prints the number of times the string 1999 metadata appeared in each file. If you look at the output from the previous command, this tends to refer to the date field for each journal article.
We will try another search:
Code Block | ||
---|---|---|
| ||
$ grep -c 'Digital |
...
Archive' *. |
...
csv
2017-ecommons-CU-etds.csv:0
2017-ecommons-CUL-community.csv:3
CUlecturetapes_metadata_ingest-ready.csv:0 |
We got back the counts of the instances of the string revolution
'application/pdf'
within the files. Now, amend the above command to the below and observe how the output of each is different:
Code Block | ||
---|---|---|
| ||
$ grep -ci 'Digital |
...
Archive' *. |
...
csv
2017-ecommons-CU-etds.csv:0
2017-ecommons-CUL-community.csv:7
CUlecturetapes_metadata_ingest-ready.csv:0 |
This repeats the query, but prints a case insensitive count (including instances of both digital archive
and Digital Archive and other variants).
revolution
and Revolution
and other variants). Note how the count has increased nearly 30 fold for those journal article titles that contain the keyword 'america'. As before, cycling back and adding > cmh329_results/
, followed by a filename (ideally in .txt format), will save the results to a data file. Go ahead and do this on your own.
1.8.2: CSV results from grep
So far we have counted strings in file and printed to the shell or to file those counts. But the real power of grep
grep
comes in that you can also use it to create subsets of tabulated data (or indeed any data) from one or multiple files.
Code Block | ||
---|---|---|
| ||
$ grep -i |
...
metadata *. |
...
csv |
This script looks in the defined files and prints any lines containing revolution
(without regard to case) to the shell.
Code Block | ||
---|---|---|
| ||
$ grep -i |
...
metadata *. |
...
csv > cmh329_results/ |
...
2017- |
...
02- |
...
15_ |
...
metadata- |
...
ecommons. |
...
csv |
This saves the subsetted data to file.
1.8.3: Whole words (-w) grep
Sometimes you need to capture the whole word only in that form (so revolution, but not revolutionary). The However, if we look at this file, it contains every instance of the string 'revolution' including as a single word and as part of other words such as 'revolutionary'. This perhaps isn't as useful as we thought... Thankfully, the -w
flag instructs grep
to look for whole words only, giving us greater precision in our search.
Code Block | ||
---|---|---|
| ||
$ grep -iw revolution *. |
...
csv > cmh329_results/DATE_JAiw-revolution. |
...
csv |
This script looks in both of the defined files and exports any lines containing the whole word revolution
word revolution (without regard to case) to the specified .tsv csv file.
We can show the difference between the files we created.
Code Block | ||
---|---|---|
| ||
$ wc -l cmh329_results/*. |
...
csv |
...
186 cmh329_results/2017-02-15_metadata-ecommons.csv |
...
162 cmh329_results/DATE_JAiw-revolution. |
...
csv |
...
348 total |
Finally, we'll use the regular expression syntax covered earlier to search for similar words.
1.9: Basic and extended regular expressions
There is unfortunately both "basic" and "extended" regular expressions. This is a common cause of confusion, since most tutorials, including ours, teach extended regular expression, but grep uses basic by default. Unles . Unless you want to remember the details, make your life easy by always using just always use extended regular expressions (-E flag) when doing something more complex than searching for a plain string.
The regular expression 'fr[ae]nc[eh]' will match "france", "french", but also "frence" and "franch". It's generally a good idea to enclose the expression in single quotation marks, since that ensures the shell sends it directly to grep without any processing (such as trying to expand the wildcard operator *).
Code Block | ||
---|---|---|
| ||
$ grep -iwE 'fr[ae]nc[eh]' *. |
...
csv |
The shell will print out each matching line.
We include the -o
flag to print only the matching part of the lines e.g. (handy for isolating/checking results):
Code Block | ||
---|---|---|
| ||
$ grep -iwEo 'fr[ae]nc[eh]' *. |
...
csv |
Pair up with your neighbor and work on these exercies:
1.10: Case sensitive search
Search for all case sensitive instances of a word you choose in all four derived tsv csv files in this directory. Print your results to the shell.
Code Block | ||
---|---|---|
| ||
$ grep |
...
DCAPS *. |
...
Case sensitive search in select files
Search for all case sensitive instances of a word you choose in the 'America' and 'Africa' tsv files in this directory. Print your results to the shell.
Solution
$ grep hero *a.tsv
Count words (case sensitive)
...
csv |
1.11: Exercises
Run some of the now-possible searches on the sample metadata. How can you combine it with the Regex stuff as well as the output options above? Such as:
- How do you run a search only on files that are from 2017 and a csv?
- How do you count all case sensitive instances of a word you choose in
...
Solution
$ grep -c hero *a.tsv
Count words (case insensitive)
Count all case insensitive instances of that word in the 'America' and 'Africa' tsv files in this directory. Print your results to the shell.
Solution
$ grep -ci hero *a.tsv
...
- select csv files?
- How do you count all case insensitive instances of a word you choose in select csv files?
- Search for all case insensitive instances of
...
- a select word in
...
- the csv files in this directory. Print your results to a
...
Solution
$ grep -i hero *a.tsv > results/new.tsv
...
- new csv file.
- Search for all case insensitive instances of
...
- a whole word in the
...
- csv files in this directory. Print your
...
- results to a new csv file.
- Use regular expressions to find all ISSN numbers, unis, or other type identifiers with a regular structure in the csv files.
1.12:
Solution
$ grep -iw hero *a.tsv > results/new2.tsv
Searching with regular expressions
Use regular expressions to find all ISSN numbers (four digits followed by hyphen followed by four digits) in 2014-01_JA.tsv and print the results to a file results/issns.tsv.
Solution
$ grep -E '\d{4}-\d{4}' 2014-01_JA.tsv > issns.tsv
If you came up with something more advanced, perhaps including word boundaries, please share your result on the Etherpad and give yourself a pat on the shoulder.
Finding unique values
If you pipe something to the uniq command, it will filter out duplicate lines and only return unique ones. Try piping the output from the command in the last exercise to uniq and then to wc -l to count the number of unique ISSN values.
Code Block | |
---|---|
|
...
| |
$ grep -Eo '\d{4}-\d{4}' 2014-01_JA.tsv | uniq | wc -l |
1.13: Counting number of files, part II
In the earlier counting exercise in this episodelesson, you we tried counting the number of files and directories in the current directory.
Recall that the command ls -l | wc -l took us quite far, but the result was one too high because it included the "total" line in the line count.
With the knowledge of grep, can you figure out how to exclude the "total" line from the ls -l output?
Hint: You want to exclude any line starting with the text "total". The hat character (^) is used in regular expressions to indicate the start of a line.
Solution
To find any lines starting with "total", we would use:
...