Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

File encodings can be tricky. For the most part files these days are all UTF-8 encoded. To understand some of the magic behind UTF-8, check out this excellent video. Nonetheless, there are times where we receive a file that isn’t in this format. This can lead to some wonky attempts at swapping the encoding schema. Here, iconv is a life saver. Iconv is a simple program that will take text in one encoding and output the text in another.

 
Code Block
languagebash
# Converting -f (from) latin1 (ISO-8859-1)
# -t (to) standard UTF_8
iconv -f ISO-8859-1 -t UTF-8 < input.txt > output.txt

 

Useful options:

  • >iconv -l list all known encodings
  • iconv -c silently discard characters that cannot be converted

HEAD

Code Block
languagebash
# List all known encodings
iconv -l
 
# Silently discard characters that cannot be converted
iconv -c

 

HEAD

If you are a frequent Pandas user then head will be familiar. Often when dealing with new data the first thing we want to do is get a sense of what exists. This leads to firing up Pandas, reading in the data and then calling df. head() - strenuous, to say the least. Head, without any flags, will print out the first 10 lines of a file. The true power of head lies in testing out cleaning operations. For instance, if we wanted to change the delimiter of a file from commas to pipes. One quick test would be: head mydata.csv | sed 's/,/|/g'..

Code Block
languagebash
# Prints out first 10 lines
head filename.csv
 
# Prints first 3 lines
head -n 3 filename.csv

 

Useful options:

Code Block
languagebash
# Print a specific number of lines
head -n
 
# Print a specific number of bytes
head -c

 

TR

Tr is analogous to translate. This powerful utility is a workhorse for basic file cleaning. An ideal use case is for swapping out the delimiters within a file.

Code Block
languagebash
# Converting a tab delimited file into commas
cat tab_delimited.txt | tr "\\t" "," comma_delimited.csv

 

Another feature of tr is all the built in [:class:] variables at your disposal. These include: 

Code Block
languagebash
[:alnum:] all letters and digits
[:alpha:] all letters
[:blank:] all horizontal whitespace
[:cntrl:] all control characters
[:digit:] all digits
[:graph:] all printable characters, not including space
[:lower:] all lower case letters
[:print:] all printable characters, including space
[:punct:] all punctuation characters
[:space:] all horizontal or vertical whitespace
[:upper:] all upper case letters
[:xdigit:] all hexadecimal digits

You can chain a variety of these together to compose powerful programs. The following is a basic word count program you could use to check your READMEs for overuse.

Code Block
cat README.md | tr "[:punct:][:space:]" "\n" | tr "[:upper:]" "[:lower:]" | grep . | sort | uniq -c | sort -nr

Another example using basic regex:

Code Block
languagebash
# Converting all upper case letters to lower case
cat filename.csv | tr '[A-Z]' '[a-z]'

 

Useful options:

Code Block
languagebash
# Delete characters
tr -d
 
# Squeeze characters
tr -s
 
# Backspace
\b
 
# Form feed
\f
 
# Vertical tab
\v
 
# Characters with octal value NNN
\NNN

 

WC

Word count. Its value is primarily derived from the -l flag, which will give you the line count.

Code Block
languagebash
# Will return number of lines in CSV
wc -l gigantic_comma.csv

 

This tool comes in handy to confirm the output of various commands. So, if we were to convert the delimiters within a file and then run wc -l we would expect the total lines to be the same. If not, then we know something went wrong.

Useful options:

Code Block
languagebash
# Print the byte counts
wc -c
 
# Print the character counts
wc -m
 
# Print length of longest line
wc -L
 
# Print word counts
wc -w

 

SPLIT

File sizes can range dramatically. And depending on the job, it could be beneficial to split up the file — thus split. The basic syntax for split is:

Code Block
languagebash
# We will split our CSV into new_filename every 500 lines
split -l 500 filename.csv new_filename_


# filename.csv

# new_filename_aaa
# new_filename_aab
# new_filename_aac

 

Two quirks are the naming convention and lack of file extensions. The suffix convention can be numeric via the -d flag. To add file extensions, you’ll need to run the following find command. It will change the names of ALL files within the current directory by appending .csv, so be careful.

Code Block
languagebash
find . -type f -exec mv '{}' '{}'.csv \;


# filename.csv.csv
# new_filename_aaa.csv
# new_filename_aab.csv
# new_filename_aac.csv

 

Useful options:

Code Block
languagebash
# Split by certain byte size
split -b
 
# Generate suffixes of length N
split -a
 
# Split using hex suffixes
split -x