...
File encodings can be tricky. For the most part files these days are all UTF-8 encoded. To understand some of the magic behind UTF-8, check out this excellent video. Nonetheless, there are times where we receive a file that isn’t in this format. This can lead to some wonky attempts at swapping the encoding schema. Here, iconv
is a life saver. Iconv is a simple program that will take text in one encoding and output the text in another.
Code Block | ||
---|---|---|
| ||
# Converting -f (from) latin1 (ISO-8859-1) # -t (to) standard UTF_8 iconv -f ISO-8859-1 -t UTF-8 < input.txt > output.txt |
Useful options:
- >iconv -l list all known encodings
- iconv -c silently discard characters that cannot be converted
HEAD
Code Block | ||
---|---|---|
| ||
# List all known encodings
iconv -l
# Silently discard characters that cannot be converted
iconv -c |
HEAD
If you are a frequent Pandas user then head will be familiar. Often when dealing with new data the first thing we want to do is get a sense of what exists. This leads to firing up Pandas, reading in the data and then calling df. head() - strenuous, to say the least. Head, without any flags, will print out the first 10 lines of a file. The true power of head lies in testing out cleaning operations. For instance, if we wanted to change the delimiter of a file from commas to pipes. One quick test would be: head mydata.csv | sed 's/,/|/g'..
Code Block | ||
---|---|---|
| ||
# Prints out first 10 lines
head filename.csv
# Prints first 3 lines
head -n 3 filename.csv |
Useful options:
Code Block | ||
---|---|---|
| ||
# Print a specific number of lines
head -n
# Print a specific number of bytes
head -c |
TR
Tr is analogous to translate. This powerful utility is a workhorse for basic file cleaning. An ideal use case is for swapping out the delimiters within a file.
Code Block | ||
---|---|---|
| ||
# Converting a tab delimited file into commas
cat tab_delimited.txt | tr "\\t" "," comma_delimited.csv |
Another feature of tr is all the built in [:class:] variables at your disposal. These include:
Code Block | ||
---|---|---|
| ||
[:alnum:] all letters and digits
[:alpha:] all letters
[:blank:] all horizontal whitespace
[:cntrl:] all control characters
[:digit:] all digits
[:graph:] all printable characters, not including space
[:lower:] all lower case letters
[:print:] all printable characters, including space
[:punct:] all punctuation characters
[:space:] all horizontal or vertical whitespace
[:upper:] all upper case letters
[:xdigit:] all hexadecimal digits |
You can chain a variety of these together to compose powerful programs. The following is a basic word count program you could use to check your READMEs for overuse.
Code Block |
---|
cat README.md | tr "[:punct:][:space:]" "\n" | tr "[:upper:]" "[:lower:]" | grep . | sort | uniq -c | sort -nr |
Another example using basic regex:
Code Block | ||
---|---|---|
| ||
# Converting all upper case letters to lower case
cat filename.csv | tr '[A-Z]' '[a-z]' |
Useful options:
Code Block | ||
---|---|---|
| ||
# Delete characters
tr -d
# Squeeze characters
tr -s
# Backspace
\b
# Form feed
\f
# Vertical tab
\v
# Characters with octal value NNN
\NNN |
WC
Word count. Its value is primarily derived from the -l flag, which will give you the line count.
Code Block | ||
---|---|---|
| ||
# Will return number of lines in CSV
wc -l gigantic_comma.csv |
This tool comes in handy to confirm the output of various commands. So, if we were to convert the delimiters within a file and then run wc -l we would expect the total lines to be the same. If not, then we know something went wrong.
Useful options:
Code Block | ||
---|---|---|
| ||
# Print the byte counts
wc -c
# Print the character counts
wc -m
# Print length of longest line
wc -L
# Print word counts
wc -w |
SPLIT
File sizes can range dramatically. And depending on the job, it could be beneficial to split up the file — thus split. The basic syntax for split is:
Code Block | ||
---|---|---|
| ||
# We will split our CSV into new_filename every 500 lines
split -l 500 filename.csv new_filename_
# filename.csv
# new_filename_aaa
# new_filename_aab
# new_filename_aac |
Two quirks are the naming convention and lack of file extensions. The suffix convention can be numeric via the -d flag. To add file extensions, you’ll need to run the following find command. It will change the names of ALL files within the current directory by appending .csv, so be careful.
Code Block | ||
---|---|---|
| ||
find . -type f -exec mv '{}' '{}'.csv \;
# filename.csv.csv
# new_filename_aaa.csv
# new_filename_aab.csv
# new_filename_aac.csv |
Useful options:
Code Block | ||
---|---|---|
| ||
# Split by certain byte size
split -b
# Generate suffixes of length N
split -a
# Split using hex suffixes
split -x |