You are viewing an old version of this page. View the current version.

Compare with Current View Page History

« Previous Version 6 Next »

On linux, unlike the VAX, file names and directory names are case sensitive. You can tab complete instead of typing in the whole filename/directory name. Press tab twice to list possible completions.

Working with files and directories from the Linux commandline:

Command

Description

pwd

Show current directory

cd ~

Change to your home directory (same as your W: drive on WIndows)

cd dirname

Change directory to a subfolder named dirname

cd ..

Go up a directory

ls

List files and folders in the current directory (with color highlighting)

dir

List files and folders in the current directory without color highlighting

mkdir dirname

Make a new subfolder named dirname

rmdir dirname

Deletes the subfolder name dirname (forlder must be empty)

rm filename

Deletes the file named filename

rm -rf dirname

Recursively deletes the non-empty folder name dirname and all contained files and folders

mv oldname newname

Rename a file or directory from oldname to newname

mv src destination

Move a file or directory from src to destination

cp src destination

Copies a file from src to destination

cp -av src destination

Recursively copies a folder from src to destination

xdg-open .

Open a gui file browser in the current directory

xdg-open file

Open a file

 

Command Line Tricks for Data Scientists

This is reproduced from: https://medium.com/@kadek/command-line-tricks-for-data-scientists-c98e0abe5da .

ICONV

File encodings can be tricky. For the most part files these days are all UTF-8 encoded. To understand some of the magic behind UTF-8, check out this excellent video. Nonetheless, there are times where we receive a file that isn’t in this format. This can lead to some wonky attempts at swapping the encoding schema. Here, iconv is a life saver. Iconv is a simple program that will take text in one encoding and output the text in another.

# Converting -f (from) latin1 (ISO-8859-1)
# -t (to) standard UTF_8
iconv -f ISO-8859-1 -t UTF-8 < input.txt > output.txt

 

Useful options:

# List all known encodings
iconv -l
 
# Silently discard characters that cannot be converted
iconv -c

 

HEAD

Often when dealing with new data the first thing we want to do is get a sense of what exists. Head, without any flags, will print out the first 10 lines of a file. The true power of head lies in testing out cleaning operations. For instance, if we wanted to change the delimiter of a file from commas to pipes.

# Prints out first 10 lines
head filename.csv
 
# Prints first 3 lines
head -n 3 filename.csv

 

Useful options:

# Print a specific number of lines
head -n
 
# Print a specific number of bytes
head -c

 

TR

Tr is analogous to translate. This powerful utility is a workhorse for basic file cleaning. An ideal use case is for swapping out the delimiters within a file.

# Converting a tab delimited file into commas
cat tab_delimited.txt | tr "\\t" "," comma_delimited.csv

 

Another feature of tr is all the built in [:class:] variables at your disposal. These include: 

[:alnum:] all letters and digits
[:alpha:] all letters
[:blank:] all horizontal whitespace
[:cntrl:] all control characters
[:digit:] all digits
[:graph:] all printable characters, not including space
[:lower:] all lower case letters
[:print:] all printable characters, including space
[:punct:] all punctuation characters
[:space:] all horizontal or vertical whitespace
[:upper:] all upper case letters
[:xdigit:] all hexadecimal digits

You can chain a variety of these together to compose powerful programs. The following is a basic word count program you could use to check your READMEs for overuse.

cat README.md | tr "[:punct:][:space:]" "\n" | tr "[:upper:]" "[:lower:]" | grep . | sort | uniq -c | sort -nr

Another example using basic regex:

# Converting all upper case letters to lower case
cat filename.csv | tr '[A-Z]' '[a-z]'

 

Useful options:

# Delete characters
tr -d
 
# Squeeze characters
tr -s
 
# Backspace
\b
 
# Form feed
\f
 
# Vertical tab
\v
 
# Characters with octal value NNN
\NNN

 

WC

Word count. Its value is primarily derived from the -l flag, which will give you the line count.

# Will return number of lines in CSV
wc -l gigantic_comma.csv

 

This tool comes in handy to confirm the output of various commands. So, if we were to convert the delimiters within a file and then run wc -l we would expect the total lines to be the same. If not, then we know something went wrong.

Useful options:

# Print the byte counts
wc -c
 
# Print the character counts
wc -m
 
# Print length of longest line
wc -L
 
# Print word counts
wc -w

 

SPLIT

File sizes can range dramatically. And depending on the job, it could be beneficial to split up the file — thus split. The basic syntax for split is:

# We will split our CSV into new_filename every 500 lines
split -l 500 filename.csv new_filename_


# filename.csv

# new_filename_aaa
# new_filename_aab
# new_filename_aac

 

Two quirks are the naming convention and lack of file extensions. The suffix convention can be numeric via the -d flag. To add file extensions, you’ll need to run the following find command. It will change the names of ALL files within the current directory by appending .csv, so be careful.

find . -type f -exec mv '{}' '{}'.csv \;


# filename.csv.csv
# new_filename_aaa.csv
# new_filename_aab.csv
# new_filename_aac.csv

 

Useful options:

# Split by certain byte size
split -b
 
# Generate suffixes of length N
split -a
 
# Split using hex suffixes
split -x
  • No labels