3.3 Decompressing Files
If the original data set is very large or it’s a collection of many files, the file may be a (compressed) archive. Data sets which contain many repeated values (such as the words in a text file or the keys in a JSON file) are especially well suited for compression.
Common file extensions of compressed archives are: .tar.gz, .zip, and .rar. To decompress these, you would use the command-line tools tar
(Bailey, Eggert, and Poznyakoff 2014), unzip
(Smith 2009), and unrar
(Asselstine, Scheurer, and Winkelmann 2014), respectively. There exists a few more, though less common, file extensions for which you would need yet other tools. For example, in order to extract a file named logs.tar.gz, you would use:
$ cd ~/book/ch03
$ tar -xzvf data/logs.tar.gz
Indeed, tar
is notorious for its many command-line arguments. In this case, the four command-line arguments x
, z
, v
, and f
specify that tar
should extract files from an archive, use gzip as the decompression algorithm, be verbose and use file logs.tar.gz. In time, you’ll get used to typing these four characters, but there’s a more convenient way.
Rather than remembering the different command-line tools and their options, there’s a handy script called unpack
(Brisbin 2013), which will decompress many different formats. unpack
looks at the extension of the file that you want to decompress, and calls the appropriate command-line tool.
The unpack
tool is part of the Data Science Toolbox. Remember that you can look up how it can be installed in the appendix. Example 3.1 shows the source of unpack
. Although Bash scripting is not the focus of this book, it’s still useful to take a moment to figure out how it works.
Example 3.1 (Decompress various file formats)
#!/usr/bin/env bash
# unpack: Extract common file formats
# Display usage if no parameters given
if [[ -z "$@" ]]; then
echo " ${0##*/} <archive> - extract common file formats)"
exit
fi
# Required program(s)
req_progs=(7z unrar unzip)
for p in ${req_progs[@]}; do
hash "$p" 2>&- || \
{ echo >&2 " Required program \"$p\" not installed."; exit 1; }
done
# Test if file exists
if [ ! -f "$@" ]; then
echo "File "$@" doesn't exist"
exit
fi
# Extract file by using extension as reference
case "$@" in
*.7z ) 7z x "$@" ;;
*.tar.bz2 ) tar xvjf "$@" ;;
*.bz2 ) bunzip2 "$@" ;;
*.deb ) ar vx "$@" ;;
*.tar.gz ) tar xvf "$@" ;;
*.gz ) gunzip "$@" ;;
*.tar ) tar xvf "$@" ;;
*.tbz2 ) tar xvjf "$@" ;;
*.tar.xz ) tar xvf "$@" ;;
*.tgz ) tar xvzf "$@" ;;
*.rar ) unrar x "$@" ;;
*.zip ) unzip "$@" ;;
*.Z ) uncompress "$@" ;;
* ) echo " Unsupported file format" ;;
esac
Now, in order to decompress this same file, you would simply use:
$ unpack logs.tar.gz