Page History

...

File encodings can be tricky. For the most part files these days are all UTF-8 encoded. To understand some of the magic behind UTF-8, check out this excellent video. Nonetheless, there are times where we receive a file that isn’t in this format. This can lead to some wonky attempts at swapping the encoding schema. Here, iconv is a life saver. Iconv is a simple program that will take text in one encoding and output the text in another.

Code Block

language	bash

# Converting -f (from) latin1 (ISO-8859-1)
# -t (to) standard UTF_8
iconv -f ISO-8859-1 -t UTF-8 < input.txt > output.txt

Useful options:

>iconv -l list all known encodings
iconv -c silently discard characters that cannot be converted

HEAD

Code Block

language	bash

# List all known encodings
iconv -l
 
# Silently discard characters that cannot be converted
iconv -c

HEAD

If you are a frequent Pandas user then head will be familiar. Often when dealing with new data the first thing we want to do is get a sense of what exists. This leads to firing up Pandas, reading in the data and then calling df. head() - strenuous, to say the least. Head, without any flags, will print out the first 10 lines of a file. The true power of head lies in testing out cleaning operations. For instance, if we wanted to change the delimiter of a file from commas to pipes. One quick test would be: head mydata.csv | sed 's/,/|/g'..

Code Block

language	bash

# Prints out first 10 lines
head filename.csv
 
# Prints first 3 lines
head -n 3 filename.csv

Useful options:

Code Block

language	bash

# Print a specific number of lines
head -n
 
# Print a specific number of bytes
head -c

TR

Tr is analogous to translate. This powerful utility is a workhorse for basic file cleaning. An ideal use case is for swapping out the delimiters within a file.

Code Block

language	bash

# Converting a tab delimited file into commas
cat tab_delimited.txt | tr "\\t" "," comma_delimited.csv

Another feature of tr is all the built in [:class:] variables at your disposal. These include:

Code Block

language	bash

[:alnum:] all letters and digits
[:alpha:] all letters
[:blank:] all horizontal whitespace
[:cntrl:] all control characters
[:digit:] all digits
[:graph:] all printable characters, not including space
[:lower:] all lower case letters
[:print:] all printable characters, including space
[:punct:] all punctuation characters
[:space:] all horizontal or vertical whitespace
[:upper:] all upper case letters
[:xdigit:] all hexadecimal digits

You can chain a variety of these together to compose powerful programs. The following is a basic word count program you could use to check your READMEs for overuse.

Code Block
cat README.md \| tr "[:punct:][:space:]" "\n" \| tr "[:upper:]" "[:lower:]" \| grep . \| sort \| uniq -c \| sort -nr

Another example using basic regex:

Code Block

language	bash

# Converting all upper case letters to lower case
cat filename.csv | tr '[A-Z]' '[a-z]'

Useful options:

Code Block

language	bash

# Delete characters
tr -d
 
# Squeeze characters
tr -s
 
# Backspace
\b
 
# Form feed
\f
 
# Vertical tab
\v
 
# Characters with octal value NNN
\NNN

WC

Word count. Its value is primarily derived from the -l flag, which will give you the line count.

Code Block

language	bash

# Will return number of lines in CSV
wc -l gigantic_comma.csv

This tool comes in handy to confirm the output of various commands. So, if we were to convert the delimiters within a file and then run wc -l we would expect the total lines to be the same. If not, then we know something went wrong.

Useful options:

Code Block

language	bash

# Print the byte counts
wc -c
 
# Print the character counts
wc -m
 
# Print length of longest line
wc -L
 
# Print word counts
wc -w

SPLIT

File sizes can range dramatically. And depending on the job, it could be beneficial to split up the file — thus split. The basic syntax for split is:

Code Block

language	bash

# We will split our CSV into new_filename every 500 lines
split -l 500 filename.csv new_filename_


# filename.csv

# new_filename_aaa
# new_filename_aab
# new_filename_aac

Two quirks are the naming convention and lack of file extensions. The suffix convention can be numeric via the -d flag. To add file extensions, you’ll need to run the following find command. It will change the names of ALL files within the current directory by appending .csv, so be careful.

Code Block

language	bash

find . -type f -exec mv '{}' '{}'.csv \;


# filename.csv.csv
# new_filename_aaa.csv
# new_filename_aab.csv
# new_filename_aac.csv

Useful options:

Code Block

language	bash

# Split by certain byte size
split -b
 
# Generate suffixes of length N
split -a
 
# Split using hex suffixes
split -x

Page tree

Versions Compared

Old Version 5

New Version 6

Key

HEAD

HEAD

TR

WC

SPLIT