7.4 Creating Visualizations
In this section we’re going to discuss how to create visualizations at the command line. We’ll be looking at two different software packages: gnuplot and ggplot. First, we’ll introduce both packages. Then, we’ll demonstrate how to create several different types of visualizations using both of them.
7.4.1 Introducing Gnuplot and Feedgnuplot
The first software package to create visualizations that we’re discussing in this chapter is Gnuplot. Gnuplot has been around since 1986. Despite being rather old, its visualization capabilities are quite extensive. As such, it’s impossible to do it justice. There are other good resources available, including Gnuplot in Action by Janert (2009).
To demonstrate the flexibility (and its archaic notation), consider Example 7.2, which is copied from the Gnuplot website (http://gnuplot.sourceforge.net/demo/histograms.6.gnu).
Example 7.2 (Creating a histogram using Gnuplot)
# set terminal pngcairo transparent enhanced font "arial,10" fontscale 1.0 size
# set output 'histograms.6.png'
set border 3 front linetype -1 linewidth 1.000
set boxwidth 0.75 absolute
set style fill solid 1.00 border lt -1
set grid nopolar
set grid noxtics nomxtics ytics nomytics noztics nomztics \
nox2tics nomx2tics noy2tics nomy2tics nocbtics nomcbtics
set grid layerdefault linetype 0 linewidth 1.000, linetype 0 linewidth 1.000
set key outside right top vertical Left reverse noenhanced autotitles columnhead
set style histogram columnstacked title offset character 0, 0, 0
set datafile missing '-'
set style data histograms
set xtics border in scale 1,0.5 nomirror norotate offset character 0, 0, 0 auto
set xtics norangelimit
set xtics ()
set ytics border in scale 0,0 mirror norotate offset character 0, 0, 0 autojust
set ztics border in scale 0,0 nomirror norotate offset character 0, 0, 0 autoju
set cbtics border in scale 0,0 mirror norotate offset character 0, 0, 0 autojus
set rtics axis in scale 0,0 nomirror norotate offset character 0, 0, 0 autojust
set title "Immigration from Northern Europe\n(columstacked histogram)"
set xlabel "Country of Origin"
set ylabel "Immigration by decade"
set yrange [ 0.00000 : * ] noreverse nowriteback
i = 23
plot 'immigration.dat' using 6 ti col, '' using 12 ti col, '' using 13 ti c
Please note that this is trimmed to 80 characters wide. The above script generates the following image:
Figure 7.1: Immigration Plot by Gnuplot
Gnuplot is different from most command-line tools we’ve been using for two reasons. First, it uses a script instead of command-line arguments. Second, the output is always written to a file and not printed to standard output.
One great advantage of Gnuplot being around for so long, and the main reason we’ve included it in this book, is that it’s able to produce visualizations for the command line. That is, it’s able to print its output to the terminal without the need for a graphical user interface (GUI). Even then, you would need to set up a script.
Luckily, there is a command-line tool called feedgnuplot
(Kogan 2014), which can help us with setting up a script for Gnuplot. feedgnuplot
is entirely configurable through command-line arguments. Plus, it reads from standard input. After we have introduced ggplot2
, we’re going to create a few visualizations using feedgnuplot
.
One great feature of feedgnuplot
that we would like to mention here, is that it allows you to plot streaming data. The following is a snapshot of a continuously updated plot based on random input data:
$ while true; do echo $RANDOM; done | sample -d 10 | feedgnuplot --stream \
> --terminal 'dumb 80,25' --lines --xlen 10
30000 ++-----+------------+-------------+-------------+------------+-----++
| + * + + + |
| : ** : ******* : *
25000 ++.................*.*..........................*.....*............+*
| : *: * : *: * : *|
| : *: * : *: * : *|
| : * : * : * : * : * |
20000 ++................*....*......................*.........*.........*++
| : * : * : * : * : * |
| : * : * : * : * : * |
15000 ++....**.........*.......*..................*............*.......*.++
| **** :* * : * : * : * : * |
** :* * : * **** * : * : * |
10000 ++.......*......*.........*....**....*.....*..............*.....*..++
| : * * : * ** : * * : * : * |
| : * * : ** : ** * : * : * |
| : * * : : * : * : * |
5000 ++..........*..*.........................*..................*.*....++
| : * * : : : *:* |
| + ** + + + * |
0 ++-----+------*-----+-------------+-------------+------------*-----++
2350 2352 2354 2356 2358
7.4.2 Introducing ggplot2
A more modern software package for creating visualizations is ggplot, which is an implementation of the grammar of graphics in R (Wickham 2009).
Thanks to the grammar of graphics and using sensible defaults, ggplot2
commands tend to be very short and expressive. When used through Rio
, this is a very convenient way of creating visualizations from the command line.
To demonstrate it’s expressiveness, we’ll recreate the histogram plot generated above by gnuplot, with the help of Rio
. Because Rio
expects the data set to be comma-delimited, and because ggplot2
expects the data in long format, we first need to scrub and transform the data a little bit:
$ < data/immigration.dat sed -re '/^#/d;s/\t/,/g;s/,-,/,0,/g;s/Region/'\
> 'Period/' | tee data/immigration.csv | head | cut -c1-80
Period,Austria,Hungary,Belgium,Czechoslovakia,Denmark,France,Germany,Greece,Irel
1891-1900,234081,181288,18167,0,50231,30770,505152,15979,388416,651893,26758,950
1901-1910,668209,808511,41635,0,65285,73379,341498,167519,339065,2045877,48262,1
1911-1920,453649,442693,33746,3426,41983,61897,143945,184201,146181,1109524,4371
1921-1930,32868,30680,15846,102194,32430,49610,412202,51084,211234,455315,26948,
1931-1940,3563,7861,4817,14393,2559,12623,144058,9119,10973,68028,7150,4740,3960
1941-1950,24860,3469,12189,8347,5393,38809,226578,8973,19789,57661,14860,10100,1
1951-1960,67106,36637,18575,918,10984,51121,477765,47608,43362,185491,52277,2293
1961-1970,20621,5401,9192,3273,9201,45237,190796,85969,32966,214111,30606,15484,
The sed
expression consists of four parts, delimited by semicolons:
Remove lines that start with #.
Convert tabs to commas.
Change dashes (missing values) into zero’s.
Change the feature name Region into Period.
We then select only the columns that matter using csvcut
and subsequently convert the data from a wide format to a long one using the Rio
and the melt
function which part of the R package reshape2
:
$ < data/immigration.csv csvcut -c Period,Denmark,Netherlands,Norway,\
> Sweden | Rio -re 'melt(df, id="Period", variable.name="Country", '\
> 'value.name="Count")' | tee data/immigration-long.csv | head | csvlook
|------------+-------------+--------|
| Period | Country | Count |
|------------+-------------+--------|
| 1891-1900 | Denmark | 50231 |
| 1901-1910 | Denmark | 65285 |
| 1911-1920 | Denmark | 41983 |
| 1921-1930 | Denmark | 32430 |
| 1931-1940 | Denmark | 2559 |
| 1941-1950 | Denmark | 5393 |
| 1951-1960 | Denmark | 10984 |
| 1961-1970 | Denmark | 9201 |
| 1891-1900 | Netherlands | 26758 |
|------------+-------------+--------|
Now, we can use Rio
again, but then with an expression that builds up a ggplot2
visualization:
$ < data/immigration-long.csv Rio -ge 'g + geom_bar(aes(Country, Count,'\
> ' fill=Period), stat="identity") + scale_fill_brewer(palette="Set1") '\
> '+ labs(x="Country of origin", y="Immigration by decade", title='\
> '"Immigration from Northern Europe\n(columstacked histogram)")' | display
Figure 7.2: Immigration plot by Rio and ggplot2
The -g
command-line argument indicates that Rio should load the ggplot2
package. The output is an image in PNG format. You can either view the PNG image via display
, which is part of ImageMagick (LLC 2009) or you can redirect the output to a PNG file. If you’re on a remote terminal then you probably won’t be able to see any graphics. A workaround for this is to start a webserver from a particular directory:
$ python -m SimpleHTTPServer 8000
Make sure that you have access to the port (8000 in this case). If you save the PNG image to the directory from which the webserver was launched, then you can access the image from your browser at http://localhost:8000/file.png.
7.4.3 Histograms
Using Rio
:
$ < data/tips.csv Rio -ge 'g+geom_histogram(aes(bill))' | display
Figure 7.3: Histogram
Using feedgnuplot
:
< data/tips.csv csvcut -c bill | feedgnuplot --terminal 'dumb 80,25' \
--histogram 0 --with boxes --ymin 0 --binwidth 1.5 --unset grid --exit
25 ++----+------+-----+--***-+-----+------+-----+------+-----+------+----++
+ + + +*** * + + + + + + + +
| * * * |
| *** * * * |
20 ++ * * * * * ++
| **** * * * * |
| * ** *** * * *** |
| * ** * * * * * * |
15 ++ * ** * * * * * * ++
| * ** * * * * * * |
| * ** * * * * * * |
| * ** * * * * * * *** |
10 ++ * ** * * * *** *** * ++
| * ** * * * * * * * * |
| *** ** * * * * * * * ***** *** |
| * * ** * * * * * * * * * *** * |
5 ++ *** * ** * * * * * * * * * * * * *** ++
| * * * ** * * * * * * * * * * * * *** * |
| * * * ** * * * * * * * * * * * *** * ******** *** *** |
+ ***+*** * * ** *+* * * * * * * * * *+* * *+** * *+* ***+* * * *** +
0 ++-***+***********************************************-*****-***-***--++
0 5 10 15 20 25 30 35 40 45 50 55
7.4.4 Bar Plots
Using Rio
:
$ < data/tips.csv Rio -ge 'g+geom_bar(aes(factor(size)))' | display
Figure 7.4: Bar Plot
Using feedgnuplot
:
$ < data/tips.csv | csvcut -c size | header -d | feedgnuplot --terminal \
> 'dumb 80,25' --histogram 0 --with boxes --unset grid --exit
160 ++--------+----***********----+---------+---------+---------+--------++
+ + * + * + + + + +
140 ++ * * ++
| * * |
| * * |
120 ++ * * ++
| * * |
100 ++ * * ++
| * * |
| * * |
80 ++ * * ++
| * * |
60 ++ * * ++
| * * |
| * * |
40 ++ * ********************* ++
| * * * * |
20 ++ * * * * ++
| * * * * |
+ *********** + * + * + ********************* +
0 ++---*************************************************************---++
0 1 2 3 4 5 6 7
7.4.5 Density Plots
Using Rio
:
$ < data/tips.csv Rio -ge 'g+geom_density(aes(tip / bill * 100, fill=sex), '\
> 'alpha=0.3) + xlab("percent")' | display
Figure 7.5: Density Plot
Since feedgnuplot
cannot generate density plots, it’s best to just generate a histogram.
7.4.6 Box Plots
Using Rio
:
$ < data/tips.csv Rio -ge 'g+geom_boxplot(aes(time, bill))' | display
Figure 7.6: Box Plot
Drawing a box plot is unfortunately not possible with feedgnuplot
.
7.4.7 Scatter Plots
Using Rio
:
$ < data/tips.csv Rio -ge 'g+geom_point(aes(bill, tip, color=time))' | display
Figure 7.7: Scatter Plot
Using feedgnuplot
:
< data/tips.csv csvcut -c bill,tip | tr , ' ' | header -d | feedgnuplot \
--terminal 'dumb 80,25' --points --domain --unset grid --exit --style 'pt' '14'
10 ++----+------+-----+------+-----+------+-----+------+-----+------+A---++
+ + + + + + + + + + + +
9 ++ A ++
| |
8 ++ ++
| A |
| |
7 ++ A A ++
| A A |
6 ++ A A A ++
| A A |
5 ++ A A A A A AA A AA A A A ++
| A A A A |
4 ++ A A AAAA AAA A A A A A ++
| A AAAAA AAA AA A A |
| A AAAAAAA AA A A AA A AA |
3 ++ A AAAAAAAAAAA A A AA AA A ++
| AAAAAAA AA A A A A A |
2 ++ AA AAAAAAAAA A A A AA A A A ++
+ + AAAAAAAA +A AA+ + A + + + + + +
1 ++--A-+A-A---+--AA-+--A---+-----+------+--A--+------+-----+------+----++
0 5 10 15 20 25 30 35 40 45 50 55
7.4.8 Line Graphs
$ < data/immigration-long.csv Rio -ge 'g+geom_line(aes(x=Period, '\
> 'y=Count, group=Country, color=Country)) + theme(axis.text.x = '\
> 'element_text(angle = -45, hjust = 0))' | display
Figure 7.8: Line Graph
$ < data/immigration.csv | csvcut -c Period,Denmark,Netherlands,Norway,Sweden |
> header -d | tr , ' ' | feedgnuplot --terminal 'dumb 80,25' --lines \
> --autolegend --domain --legend 0 "Denmark" --legend 1 "Netherlands" \
> --legend 2 "Norway" --legend 3 "Sweden" --xlabel "Period" --unset grid --exit
250000 ++-----%%%-------+-------+--------+-------+-------+--------+------++
+ %%%% + % + + + + + Denmark+****** +
|%% % Netherlands ###### |
| % Norway $$$$$$ |
200000 ++ % Sweden %%%%%%++
| $ % |
| $ $ % |
| $ $ % |
150000 ++ $$ $ % ++
| $ $ % |
| $ $ % |
100000 ++$ $ % ++
|$ $ %%%%%%%%%% |
| $ % |
| *********** $$$$$$$$$$$% |
50000 +**** #########** $%% ####### ++
| #### ******** $$% ### ## |
|## ******## ##$$$$$$$$$$$$# |
+ + + + **###########$$************* +
0 ++------+--------+-------+--------*************---+--------+------++
1890 1900 1910 1920 1930 1940 1950 1960 1970
Period
7.4.9 Summary
Both Rio
with ggplot2
and feedgnuplot
with Gnuplot have their advantages. The plots generated by Rio
are obviously of much higher quality. It offers a consistent syntax that lends itself well for the command line. The only down-side would be that the output is not viewable from the command line. This is where feedgnuplot
may come in handy. Each plot has roughly the same command-line arguments. As such, it would be straightforward to create a small Bash script that would make generating plots from and for the command line even easier. After all, with the command line having such a low resolution, we don’t need a lot of flexibility.