On linux, unlike the VAX, file names and directory names are case sensitive. You can tab complete instead of typing in the whole filename/directory name. Press tab twice to list possible completions.
Working with files and directories from the Linux commandline:
Command | Description |
---|---|
pwd | Show current directory |
cd ~ | Change to your home directory (same as your W: drive on WIndows) |
cd dirname | Change directory to a subfolder named dirname |
cd .. | Go up a directory |
ls | List files and folders in the current directory (with color highlighting) |
dir | List files and folders in the current directory without color highlighting |
mkdir dirname | Make a new subfolder named dirname |
rmdir dirname | Deletes the subfolder name dirname (forlder must be empty) |
rm filename | Deletes the file named filename |
rm -rf dirname | Recursively deletes the non-empty folder name dirname and all contained files and folders |
mv oldname newname | Rename a file or directory from oldname to newname |
mv src destination | Move a file or directory from src to destination |
cp src destination | Copies a file from src to destination |
cp -av src destination | Recursively copies a folder from src to destination |
xdg-open . | Open a gui file browser in the current directory |
xdg-open file | Open a file |
Command Line Tricks for Data Scientists
This is reproduced from: https://medium.com/@kadek/command-line-tricks-for-data-scientists-c98e0abe5da .
ICONV
File encodings can be tricky. For the most part files these days are all UTF-8 encoded. To understand some of the magic behind UTF-8, check out this excellent video. Nonetheless, there are times where we receive a file that isn’t in this format. This can lead to some wonky attempts at swapping the encoding schema. Here, iconv
is a life saver. Iconv is a simple program that will take text in one encoding and output the text in another.
# Converting -f (from) latin1 (ISO-8859-1) # -t (to) standard UTF_8 iconv -f ISO-8859-1 -t UTF-8 < input.txt > output.txt
Useful options:
# List all known encodings iconv -l # Silently discard characters that cannot be converted iconv -c
HEAD
Often when dealing with new data the first thing we want to do is get a sense of what exists. Head, without any flags, will print out the first 10 lines of a file. The true power of head lies in testing out cleaning operations. For instance, if we wanted to change the delimiter of a file from commas to pipes.
# Prints out first 10 lines head filename.csv # Prints first 3 lines head -n 3 filename.csv
Useful options:
# Print a specific number of lines head -n # Print a specific number of bytes head -c
TR
Tr is analogous to translate. This powerful utility is a workhorse for basic file cleaning. An ideal use case is for swapping out the delimiters within a file.
# Converting a tab delimited file into commas cat tab_delimited.txt | tr "\\t" "," comma_delimited.csv
Another feature of tr is all the built in [:class:] variables at your disposal. These include:
[:alnum:] all letters and digits [:alpha:] all letters [:blank:] all horizontal whitespace [:cntrl:] all control characters [:digit:] all digits [:graph:] all printable characters, not including space [:lower:] all lower case letters [:print:] all printable characters, including space [:punct:] all punctuation characters [:space:] all horizontal or vertical whitespace [:upper:] all upper case letters [:xdigit:] all hexadecimal digits
You can chain a variety of these together to compose powerful programs. The following is a basic word count program you could use to check your READMEs for overuse.
cat README.md | tr "[:punct:][:space:]" "\n" | tr "[:upper:]" "[:lower:]" | grep . | sort | uniq -c | sort -nr
Another example using basic regex:
# Converting all upper case letters to lower case cat filename.csv | tr '[A-Z]' '[a-z]'
Useful options:
# Delete characters tr -d # Squeeze characters tr -s # Backspace \b # Form feed \f # Vertical tab \v # Characters with octal value NNN \NNN
WC
Word count. Its value is primarily derived from the -l flag, which will give you the line count.
# Will return number of lines in CSV wc -l gigantic_comma.csv
This tool comes in handy to confirm the output of various commands. So, if we were to convert the delimiters within a file and then run wc -l we would expect the total lines to be the same. If not, then we know something went wrong.
Useful options:
# Print the byte counts wc -c # Print the character counts wc -m # Print length of longest line wc -L # Print word counts wc -w
SPLIT
File sizes can range dramatically. And depending on the job, it could be beneficial to split up the file — thus split. The basic syntax for split is:
# We will split our CSV into new_filename every 500 lines split -l 500 filename.csv new_filename_ # filename.csv # new_filename_aaa # new_filename_aab # new_filename_aac
Two quirks are the naming convention and lack of file extensions. The suffix convention can be numeric via the -d flag. To add file extensions, you’ll need to run the following find command. It will change the names of ALL files within the current directory by appending .csv, so be careful.
find . -type f -exec mv '{}' '{}'.csv \; # filename.csv.csv # new_filename_aaa.csv # new_filename_aab.csv # new_filename_aac.csv
Useful options:
# Split by certain byte size split -b # Generate suffixes of length N split -a # Split using hex suffixes split -x