library(arealDB)
library(tidyverse)

DBDir <- tempdir()
gazDir <- "directory/to/gazetteer.rds"
ontoDir <- "directory/to/ontology.rds"

1 Rationale

Areal data are any data that summarise a certain aspect related to a particular region. Those data are relevant in many applications in the environmental and socioeconomic sciences, such as biodiversity checklists, agricultural statistics, or socioeconomic surveys. For applications that surpass the spatial, temporal or thematic scope of any single data source, data must be integrated from several heterogeneous sources. Inconsistent concepts, definitions, or messy data tables make this a tedious and error-prone process.

arealDB has been developed for the purpose of providing an easy-to-use way to integrate areal data together with associated geometries into a standardised database. In the current, revised version, it makes use of the ontologics R-package to harmonise the names of territories (the geometries) and the target variables (the tables).

A previous version of this tutorial can be found in the replication script appendix/source of the pre-print. This version is not an exact replication script, as it is built mostly with dummy-code where paths on the harddisc would be needed. However, it should still serve to exemplify how to use arealDB in the current version.

2 The basics

To ensure that databases that are built with arealDB are properly standardized, the R-package makes use of a particular data-structure in which concepts and and their hierarchical relationships are recorded. The idea for this structure is based on a broader effort that shall make the internet more interoperable and machine-readable, the "Semantic Web". This can be useful for both, the names of territories (which are typically different across languages and across statistical agencies) and also the values of the variables that shall be recorded. For both registers, arealDB makes use of the R-package ontologics, that allows recording of various concepts in such a data-structure.

2.1 Ontology

[…] More simply, an ontology is a way of showing the properties of a subject area and how they are related, by defining a set of concepts and categories that represent the subject. [wikipedia]

Any target variable can (and should be) recorded in a standardized fashion. It may be extremely tricky to identify a domain-specific standard because this usually involves social processes (people need to agree upon it), however this is not part of this tutorial. For example, concepts such as biological species are part of a complex taxonomy of terms that identify kingdoms, families, species, varieties and many more. Species are nested within families and other hierarchical and synonymous relationships exist in such taxonomies. All these issues are addressed in an ontology.

2.2 Gazetteer

A gazetteer is a geographical index or directory used in conjunction with a map or atlas… [wikipedia]

Territory names are recorded in a gazetteer, which can be regarded as an ontology of territorial names. Territorial names are typically nested within regions at several levels, such as counties that are nested in a nation. To showcase how a gazetteer (and also an ontology by extension) should be configured, we are building one for the United Nations geoscheme here.

# The regions and subregions we want to include into the gazetteer
un_region <- c("AFRICA", "AMERICAS", "ANTARCTICA", "ASIA", "EUROPE", "OCEANIA")
                         
un_subregion <- tibble(concept = c(
  "Eastern Africa", "Middle Africa", "Northern Africa", "Southern Africa", "Western Africa",
  "Caribbean", "Central America", "Northern America", "Southern America",
  "Antarctica",
  "Central Asia", "Eastern Asia", "Southeastern Asia", "Southern Asia", "Western Asia",
  "Eastern Europe", "Northern Europe", "Southern Europe", "Western Europe",
  "Australia and New Zealand", "Melanesia", "Micronesia", "Polynesia"),
  broader = c(rep(un_region[1], 5), rep(un_region[2], 4), rep(un_region[3], 1),
             rep(un_region[4], 5), rep(un_region[5], 4), rep(un_region[6], 4)))

# To start building the gazetteer, we first need to record some meta-data ...
gazetteer <- 
  start_ontology(name = "gazetteer", 
                 path = paste0(DBDir, "/tables/"),
                 version = "1.0.0",
                 code = ".xxx",
                 description = "a UN geomscheme gazetteer",
                 homepage = "https://en.wikipedia.org/wiki/United_Nations_geoscheme",
                 license = "CC-BY-4.0",
                 notes = "This gazetteer nests each nation into the United Nations geoscheme.")

# ... then we can define some new source of information from which external concepts would be taken.
gazetteer <- 
  new_source(name = "gadm",
             date = Sys.Date(),
             description = "GADM wants to map the administrative areas of all countries, at all levels 
                            of sub-division. We provide data at high spatial resolutions that includes 
                            an extensive set of attributes.",
             homepage = "https://gadm.org/index.html",
             license = "CC-BY",
             ontology = gazetteer)

# All items stored in an ontology must have some class that is ontology-specific and thus needs to be
# defined manually. Here we define the three classes 'un_region', 'un_subregion' and 'nation'.
gazetteer <- 
  new_class(new = "un_region", 
            target = NA,
            description = "region according to the UN geoscheme", 
            ontology = gazetteer) %>%
  new_class(new = "un_subregion", 
            target = "un_region",                               # un_subregion is nested into un_region
            description = "sub-region acco