How to add a sentence annotation to an indexed corpus

Andreas Blätte (andreas.blaette@uni-due.de)

2021-06-11

About

Adding an annotation of sentences as a structural attribute to an existing is a frequent scenario. This vignette offers a basic recipe for a corpus that already includes a part-of-speech annotation. The GermaParlSamplee corpus serves as an example.

Getting started

In addition to the cwbtools package, we use the functionality of the RcppCWB package to decode the p-attribute ‘pos’. The same could be achieved using the higher-level get_token_stream() method of the polmineR package, but we want to avoid creating an additional dependency of this package.

library(cwbtools)
library(RcppCWB)

For the purposes of this vigneette, ee will work with a temporary corpy of the corpus we wish to augment, so we create a temporary corpus directory.

registry_dir_tmp <- fs::path(tempdir(), "registry_dir_tmp")
corpus_dir_tmp <- fs::path(tempdir(), "corpus_dir_tmp")

dir.create(path = registry_dir_tmp)
dir.create(path = corpus_dir_tmp)

regdir_envvar <- Sys.getenv("CORPUS_REGISTRY")
Sys.setenv(CORPUS_REGISTRY = registry_dir_tmp)

The GermaParlSample corpus can be downloaded from Zenodo as follows.

is_corpus_available <- cwbtools::corpus_install(
  doi = "10.5281/zenodo.3823245",
  registry_dir = registry_dir_tmp, corpus_dir = corpus_dir_tmp,
  verbose = FALSE
)
list.files(file.path(corpus_dir_tmp, "germaparlsample"))
##  [1] "date.avs"         "date.avx"         "date.rng"         "interjection.avs"
##  [5] "interjection.avx" "interjection.rng" "party.avs"        "party.avx"       
##  [9] "party.rng"        "pos.corpus.cnt"   "pos.crc"          "pos.crx"         
## [13] "pos.hcd"          "pos.huf"          "pos.huf.syn"      "pos.lexicon"     
## [17] "pos.lexicon.idx"  "pos.lexicon.srt"  "speaker.avs"      "speaker.avx"     
## [21] "speaker.rng"      "template.json"    "word.corpus.cnt"  "word.crc"        
## [25] "word.crx"         "word.hcd"         "word.huf"         "word.huf.syn"    
## [29] "word.lexicon"     "word.lexicon.idx" "word.lexicon.srt"

Recipe

We generate the data for the sentence annotation from the part-of-speech annotation that is already present.

At first, we decode the p-attribute “pos”.

germaparl_size <- cl_attribute_size(
  corpus = "GERMAPARLSAMPLE",
  attribute = "word", attribute_type = "p"
)
cpos_vec <- seq.int(from = 0L, to = germaparl_size - 1L)
ids <- cl_cpos2id(corpus = "GERMAPARLSAMPLE", p_attribute = "pos", cpos = cpos_vec)
pos <- cl_id2str(corpus = "GERMAPARLSAMPLE", p_attribute = "pos", id = ids)

The pos-tag for Stuttgart Tuebingen Tag Set (STTS) is a “$.”. From this information, we can generate a region matrix with start and end corpus positions of sentences easily.

sentence_end <- grep("\\$\\.", pos)
sentence_factor <- cut(x = cpos_vec, breaks = c(0L, sentence_end), include.lowest = TRUE, right = FALSE)
sentences_cpos <- unname(split(x = cpos_vec, f = sentence_factor))
region_matrix <- do.call(rbind, lapply(sentences_cpos, function(cpos) c(cpos[1L], cpos[length(cpos)])))

So let us see what this looks like …

head(region_matrix)
##      [,1] [,2]
## [1,]    0    9
## [2,]   10   25
## [3,]   26   35
## [4,]   36   52
## [5,]   53   64
## [6,]   65   86

And this is how the new annotation layer is written back to the corpus.

s_attribute_encode(
  values = as.character(seq.int(from = 0L, to = nrow(region_matrix) - 1L)),
  data_dir = registry_file_parse(corpus = "GERMAPARLSAMPLE")[["home"]],
  s_attribute = "s",
  corpus = "GERMAPARLSAMPLE",
  region_matrix = region_matrix,
  method = "R",
  registry_dir = Sys.getenv("CORPUS_REGISTRY"),
  encoding = registry_file_parse(corpus = "GERMAPARLSAMPLE")[["properties"]][["charset"]],
  delete = TRUE,
  verbose = TRUE
)
## ... adding s-attribute 's' to registry
## Corpus to delete (ID): GERMAPARLSAMPLE
## Corpus name: GermaParlSample: Sample subset of the GermaParl corpus of plenary protocols of the German Bundestag
## Number of loads before reset: 5
## Number of loads resetted: 1

Checking the result

To see whether everything has worked, we get the left and right boundaries of the sentence with corpus position 60.

left <- cl_cpos2lbound("GERMAPARLSAMPLE", cpos = 60, s_attribute = "s")
right <- cl_cpos2rbound("GERMAPARLSAMPLE", cpos = 60, s_attribute = "s")
ids <- cl_cpos2id("GERMAPARLSAMPLE", cpos = left:right, p_attribute = "word")
cl_id2str("GERMAPARLSAMPLE", p_attribute = "word", id = ids)
##  [1] "Ich"      "bin"      "am"       "Sonntag"  ","        "dem"     
##  [7] "1."       "Dezember" "1935"     ","        "geboren"  "."

Cleaning up

As a matter of housekeeping, we remove temporary directories and restore value of environment variable CORPUS_REGISTRY.

unlink(corpus_dir_tmp)
unlink(registry_dir_tmp)

Sys.setenv(CORPUS_REGISTRY = regdir_envvar)

Next Steps

Using the part-of-speech annotation is a basic approach to obtain the data for annotation sentences. An alternative would be to use the NLP annotation machinery of an integrated tool such as Stanford CoreNLP, or OpenNLP. But that’s a different story to be told.