On linux, unlike the VAX, file names and directory names are case sensitive. You can tab complete instead of typing in the whole filename/directory name. Press tab twice to list possible completions.

Working with files and directories from the Linux commandline:

Command	Description
pwd	Show current directory
cd ~	Change to your home directory (same as your W: drive on WIndows)
cd dirname	Change directory to a subfolder named dirname
cd ..	Go up a directory
ls	List files and folders in the current directory (with color highlighting)
dir	List files and folders in the current directory without color highlighting
mkdir dirname	Make a new subfolder named dirname
rmdir dirname	Deletes the subfolder name dirname (forlder must be empty)
rm filename	Deletes the file named filename
rm -rf dirname	Recursively deletes the non-empty folder name dirname and all contained files and folders
mv oldname newname	Rename a file or directory from oldname to newname
mv src destination	Move a file or directory from src to destination
cp src destination	Copies a file from src to destination
cp -av src destination	Recursively copies a folder from src to destination
xdg-open .	Open a gui file browser in the current directory
xdg-open file	Open a file

Command Line Tricks for Data Scientists

This is reproduced from: https://medium.com/@kadek/command-line-tricks-for-data-scientists-c98e0abe5da .

ICONV

File encodings can be tricky. For the most part files these days are all UTF-8 encoded. To understand some of the magic behind UTF-8, check out this excellent video. Nonetheless, there are times where we receive a file that isn’t in this format. This can lead to some wonky attempts at swapping the encoding schema. Here, iconv is a life saver. Iconv is a simple program that will take text in one encoding and output the text in another.

# Converting -f (from) latin1 (ISO-8859-1)
# -t (to) standard UTF_8
iconv -f ISO-8859-1 -t UTF-8 < input.txt > output.txt

Useful options:

# List all known encodings
iconv -l
 
# Silently discard characters that cannot be converted
iconv -c

HEAD

Often when dealing with new data the first thing we want to do is get a sense of what exists. Head, without any flags, will print out the first 10 lines of a file. The true power of head lies in testing out cleaning operations. For instance, if we wanted to change the delimiter of a file from commas to pipes.

# Prints out first 10 lines
head filename.csv
 
# Prints first 3 lines
head -n 3 filename.csv

Useful options:

# Print a specific number of lines
head -n
 
# Print a specific number of bytes
head -c

TR

Tr is analogous to translate. This powerful utility is a workhorse for basic file cleaning. An ideal use case is for swapping out the delimiters within a file.

# Converting a tab delimited file into commas
cat tab_delimited.txt | tr "\\t" "," comma_delimited.csv

Another feature of tr is all the built in [:class:] variables at your disposal. These include:

[:alnum:] all letters and digits
[:alpha:] all letters
[:blank:] all horizontal whitespace
[:cntrl:] all control characters
[:digit:] all digits
[:graph:] all printable characters, not including space
[:lower:] all lower case letters
[:print:] all printable characters, including space
[:punct:] all punctuation characters
[:space:] all horizontal or vertical whitespace
[:upper:] all upper case letters
[:xdigit:] all hexadecimal digits

You can chain a variety of these together to compose powerful programs. The following is a basic word count program you could use to check your READMEs for overuse.

cat README.md | tr "[:punct:][:space:]" "\n" | tr "[:upper:]" "[:lower:]" | grep . | sort | uniq -c | sort -nr

Another example using basic regex:

# Converting all upper case letters to lower case
cat filename.csv | tr '[A-Z]' '[a-z]'

Useful options:

# Delete characters
tr -d
 
# Squeeze characters
tr -s
 
# Backspace
\b
 
# Form feed
\f
 
# Vertical tab
\v
 
# Characters with octal value NNN
\NNN

WC

Word count. Its value is primarily derived from the -l flag, which will give you the line count.

# Will return number of lines in CSV
wc -l gigantic_comma.csv

This tool comes in handy to confirm the output of various commands. So, if we were to convert the delimiters within a file and then run wc -l we would expect the total lines to be the same. If not, then we know something went wrong.

Useful options:

# Print the byte counts
wc -c
 
# Print the character counts
wc -m
 
# Print length of longest line
wc -L
 
# Print word counts
wc -w

SPLIT

File sizes can range dramatically. And depending on the job, it could be beneficial to split up the file — thus split. The basic syntax for split is:

# We will split our CSV into new_filename every 500 lines
split -l 500 filename.csv new_filename_


# filename.csv

# new_filename_aaa
# new_filename_aab
# new_filename_aac

Two quirks are the naming convention and lack of file extensions. The suffix convention can be numeric via the -d flag. To add file extensions, you’ll need to run the following find command. It will change the names of ALL files within the current directory by appending .csv, so be careful.

find . -type f -exec mv '{}' '{}'.csv \;


# filename.csv.csv
# new_filename_aaa.csv
# new_filename_aab.csv
# new_filename_aac.csv

Useful options:

# Split by certain byte size
split -b
 
# Generate suffixes of length N
split -a
 
# Split using hex suffixes
split -x

Page tree

Linux CheatSheet

Working with files and directories from the Linux commandline:

Command Line Tricks for Data Scientists

ICONV

HEAD

TR

WC

SPLIT