13.3 Tidy the input data
The German government provides gridded census data at either 1 km or 100 m resolution.The following code chunk downloads, unzips and reads in the 1 km data.
download.file("https://tinyurl.com/ybtpkwxz",
destfile = "census.zip", mode = "wb")
unzip("census.zip") # unzip the files
census_de = readr::read_csv2(list.files(pattern = "Gitter.csv"))
As a convenience to the reader, the corresponding data has been put into spDataLarge and can be accessed as follows
data("census_de", package = "spDataLarge")
The census_de
object is a data frame containing 13 variables for more than 300,000 grid cells across Germany.For our work, we only need a subset of these: Easting (x
) and Northing (y
), number of inhabitants (population; pop
), mean average age (mean_age
), proportion of women (women
) and average household size (hh_size
).These variables are selected and renamed from German into English in the code chunk below and summarized in Table 13.1.Further, mutate_all()
is used to convert values -1 and -9 (meaning unknown) to NA
.
# pop = population, hh_size = household size
input = dplyr::select(census_de, x = x_mp_1km, y = y_mp_1km, pop = Einwohner,
women = Frauen_A, mean_age = Alter_D,
hh_size = HHGroesse_D)
# set -1 and -9 to NA
input_tidy = mutate_all(input, list(~ifelse(. %in% c(-1, -9), NA, .)))
class | Population | % female | Mean age | Household size |
---|---|---|---|---|
1 | 3-250 | 0-40 | 0-40 | 1-2 |
2 | 250-500 | 40-47 | 40-42 | 2-2.5 |
3 | 500-2000 | 47-53 | 42-44 | 2.5-3 |
4 | 2000-4000 | 53-60 | 44-47 | 3-3.5 |
5 | 4000-8000 | >60 | >47 | >3.5 |
6 | >8000 |