postGGIR

Wei Guo

January 06, 2022

Introduction

What is GGIR?

GGIR is an R-package to process multi-day raw accelerometer data for physical activity and sleep research. GGIR will write all output files into two sub-directories of ./meta and ./results. GGIR is increasingly being used by a number of academic institutes across the world.

What is postGGIR?

postGGIR is an R-package to data processing after running GGIR for accelerometer data. In detail, all necessary R/Rmd/shell files were generated for data processing after running GGIR for accelerometer data. Then in part 1, all csv files in the GGIR output directory were read, transformed and then merged. In part 2, the GGIR output files were checked and summarized in one excel sheet. In part 3, the merged data was cleaned according to the number of valid hours on each night and the number of valid days for each subject. In part 4, the cleaned activity data was imputed by the average ENMO over all the valid days for each subject. Finally, a comprehensive report of data processing was created using Rmarkdown, and the report includes few explortatory plots and multiple commonly used features extracted from minute level actigraphy data in part 5-7. This vignette provides a general introduction to postGGIR.

Software Architecture

The R package postGGIR has been released with an open-source GPL-3 license on CRAN, and postGGIR can run on Windows and Linux. Parallel computing in Linux is recommended due to the memory requirements associated with reading in multiple of the large data files. The package contains one primary function for users which, when run, generates all necessary R/R Markdown/shell executable files for data processing after running GGIR for accelerometer data; load, read, transform and merge long activity data; examine and summarize GGIR outputs; clean the merged activity data according to the number of valid hours per night and the number of valid days per subject; activity data imputation by taking the average across the valid days for each subject; build a comprehensive report of data processing and exploratory plots; extract multiple commonly used features and study feature structure by the covariance decomposition. Figure 1 presents a flowchart for each step in this process which is described in greater detail below. The procedure, R functions, inputs, and outputs are all described in this package vignette. In addition, more documentation and example data could be found in postGGIR repository on GitHub (URL: https://github.com/dora201888/postGGIR).

Mirroring the GGIR structure of processing individual data files in multiple parts, the postGGIR package is split into seven parts, grouping functionalities in logical processing order. The parts are numbered from 1 to 7. Parts 1 to 4 are dedicated to data processing. Parts 5 to 7 are dedicated to producing R Markdown reports of data cleaning, feature extraction, and unsupervised covariance decomposition via the joint and individual variance explained (JIVE) method, respectively. These seven parts are carried out sequentially with milestone data automatically being saved locally. To use postGGIR, the first step for users is to install and load the postGGIR package. Then, users run the create.postGGIR() function which creates a single R script. The newly created R script, Studyname_part0.maincall.R, is then edited by users, allowing for the specification of arguments relevant for each of the seven parts. All optional arguments and their defaults are described in the package vignette. In addition, for users with access to a cluster for parallel processing, a shell function, named as part9_swarm.sh is created which can parallelizes the processing of individual files with minor modifications by the user. These modifications are described in the package vignette. Computationally, part 1 is the most time-consuming task, taking up at least 60% of the processing time, which the activity data in .csv format was transformed and merged. Generally, part 1 takes about 10~30 minutes to process a file with 14 days of data recorded at 30 Hz on a GeneActiv device in processor cores of 36 x 2.3 GHz (Intel Gold 6140). All output created for each part is described in the package vignette. Briefly, part 1 and part 2 output are saved using a directory structure with a depth of two, containing output data and summary for all participants. The reports for parts 5 to 7 are saved in .html format and are generated using R Markdown (.Rmd) files. These .Rmd files are included in the output, users the flexibility to adapt the source code to their research purpose.

Figure 1: Overview of main steps and output in postGGIR workflow.

Software Dependencies

All postGGIR code is written in R and reports generated in R Markdown. The R packages ActFrag and ActCR are used for the calculation of certain physical activity and circadian rhythmicity features. The R package r.jive is used to perform the feature interaction analysis and to study the joint and individual variation structure by JIVE.

Setting up your work environment

Install R and RStudio

Download and install R

Download and install RStudio (optional, but recommended)

Download GGIR with its dependencies, you can do this with one command from the console command line:

install.packages("postGGIR", dependencies = TRUE)

Prepare folder structure

  1. folder of .bin files for GGIR or a file listing all .bin files
    • R program will check the missing in the GGIR output by comparing with all raw .bin files
  2. foder of the GGIR output with two sub-folders
    • meta (./basic, ./csv, etc)
    • results (partsummary.csv)

Quick start

Create a template shell script of postGGIR

library(postGGIR)
create.postGGIR()

The function will create a template shell script of postGGIR in the current directory, names as STUDYNAME_part0.maincall.R.

cat STUDYNAME_part0.maincall.R
options(width=2000) 
argv = commandArgs(TRUE);  
print(argv) 
print(paste("length=",length(argv),sep=""))  
mode<-as.numeric(argv[1])  
print(c("mode =", mode))
# (Note) Please remove the above lines if you are running this within R console 
#        instead of submitting jobs to a cluster.
 
#########################################################################   
# (user-define 1) you need to redefine this according different study!!!!
######################################################################### 
# example 1 
filename2id.1<-function(x)  unlist(strsplit(y1,"\\."))[1] 
 
#  example 2 (use csv file =c("filename","ggirID")) 
filename2id.2<-function(x) {
  d<-read.csv("./postGGIR/inst/example/filename2id.csv",head=1,stringsAsFactors=F)
  y1<-which(d[,"filename"]==x)
  if (length(y1)==0) stop(paste("Missing ",x," in filename2id.csv file",sep=""))
  if (length(y1)>=1) y2<-d[y1[1],"newID"] 
  return(as.character(y2))
} 


#########################################################################  
#  main call
######################################################################### 
  
call.afterggir<-function(mode,filename2id=filename2id.1){ 

library(postGGIR) 
#########################################################################  
# (user-define 2) Fill in parameters of your ggir output
########################################################################## 
  
currentdir = 
studyname =
bindir = 
outputdir =  
setwd(currentdir) 

rmDup=FALSE   # keep all subjects in postGGIR
PA.threshold=c(50,100,400)
part5FN="WW_L50M125V500_T5A5" 
epochIn = 5
epochOut = 5
flag.epochOut = 60
use.cluster = FALSE
log.multiplier = 9250
QCdays.alpha = 7
QChours.alpha = 16 
useIDs.FN<-NULL 
Rversion="R" 
desiredtz="US/Eastern" 
RemoveDaySleeper=FALSE 
part5FN=part5FN,
NfileEachBundle=20 
trace=FALSE
#########################################################################  
#   remove duplicate sample IDs for plotting and feature extraction 
######################################################################### 
if (mode==3 & rmDup){
# step 1: read ./summary/*remove_temp.csv file (output of mode=2)
keep.last<-TRUE #keep the latest visit for each sample
sumdir<-paste(currentdir,"/summary",sep="")  
setwd(sumdir)  
inFN<-paste(studyname,"_samples_remove_temp.csv",sep="")
useIDs.FN<-paste(sumdir,"/",studyname,"_samples_remove.csv",sep="") 

#########################################################################  
# (user-define 3 as rmDup=TRUE)  create useIDs.FN file
######################################################################### 
# step 2: create the ./summary/*remove.csv file manually or by R commands
d<-read.csv(inFN,head=1,stringsAsFactors=F)
d<-d[order(d[,"Date"]),]
d<-d[order(d[,"newID"]),]
d[which(is.na(d[,"newID"])),]
S<-duplicated(d[,"newID"],fromLast=keep.last) #keep the last copy for nccr
d[S,"duplicate"]<-"remove"
write.csv(d,file=useIDs.FN,row.names=F)

} 

#########################################################################  
#   call afterggir
######################################################################### 

setwd(currentdir)  
afterggir(mode=mode,
          useIDs.FN=useIDs.FN,
          currentdir=currentdir,
          studyname=studyname,
          bindir=bindir,
          outputdir=outputdir,
          epochIn=epochIn,
          epochOut=epochOut,
          flag.epochOut=flag.epochOut,
          log.multiplier=log.multiplier,
          use.cluster=use.cluster,
          QCdays.alpha=QCdays.alpha,
          QChours.alpha=QChours.alpha,
          QCnights.feature.alpha=QCnights.feature.alpha, 
          Rversion=Rversion,
          filename2id=filename2id,
          PA.threshold=PA.threshold,
          desiredtz=desiredtz,
          RemoveDaySleeper=RemoveDaySleeper,
          part5FN=part5FN,
          NfileEachBundle=NfileEachBundle,
          trace=trace) 

} 
#########################################################################
          call.afterggir(mode)   
######################################################################### 

#   Note:   call.afterggir(mode)
#        mode = 0 : creat sw/Rmd file
#        mode = 1 : data transform using cluster or not
#        mode = 2 : summary
#        mode = 3 : clean 
#        mode = 4 : impu

Edit shell script

Three places were marked as “user-define” and need to be edited by user in the STUDYNAME_part0.maincall.R file. Please rename the file by replacing your real studyname after the edition.

1. Define the function filename2id( )

This user-defined function will change the filename of the raw accelerometer file to the short ID. For example, the first example change “0002__026907_2016-03-11 13-05-59.bin” to new ID of “0002”. If you prefer to define new ID by other way, you could create a .CSV file including “filename” and “newID” at least and then defined this function as the second example. The new variable of “newID”, included in the output files, could be used as the key ID in the summary report of postGGIR and be used to define the duplicate samples as well.

2. Parameters of shell script

User needs to define the following parameters as follows,

Variables Description
rmDup Set rmDup = TRUE if user want to remove some samples such as duplicates. Set rmDup = FALSE if user want to keep all samples.
mode Specify which of the five parts need to be run, e.g. mode = 0 makes that all R/Rmd/sh files are generated for other parts. When mode = 1, all csv files in the GGIR output directory were read, transformed and then merged. When mode = 2, the GGIR output files were checked and summarized in one excel sheet. When mode = 3, the merged data was cleaned according to the number of valid hours on each night and the number of valid days for each subject. When mode = 4, the cleaned data was imputed.
useIDs.FN Filename with or without directory for sample information in CSV format, which including “filename” and “duplicate” in the headlines at least. If duplicate=“remove”, the accelerometer files will not be used in the data analysis of part 5-7. Defaut is NULL, which makes all accelerometer files will be used in part 5-7.
currentdir Directory where the output needs to be stored. Note that this directory must exist.
studyname Specify the study name that used in the output file names
bindir Directory where the accelerometer files are stored or list
outputdir Directory where the GGIR output was stored.
epochIn Epoch size to which acceleration was averaged (seconds) in GGIR output. Defaut is 5 seconds.
epochOut Epoch size to which acceleration was averaged (seconds) in part1. Defaut is 5 seconds.
flag.epochOut Epoch size to which acceleration was averaged (seconds) in part 3. Defaut is 60 seconds.
log.multiplier The coefficient used in the log transformation of the ENMO data, i.e. log( log.multiplier * ENMO + 1), which have been used in part 5-7. Defaut is 9250.
use.cluster Specify if part1 will be done by parallel computing. Default is TRUE, and the CSV file in GGIR output will be merged for every 20 files first, and then combined for all.
QCdays.alpha Minimum required number of valid days in subject specific analysis as a quality control step in part2. Default is 7 days.
QChours.alpha Minimum required number of valid hours in day specific analysis as a quality control step in part2. Default is 16 hours.
QCnights.feature.alpha Minimum required number of valid nights in day specific mean and SD analysis as a quality control step in the JIVE analysis. Default is c(0,0), i.e. no additional data cleaning in this step.
Rversion R version, eg. “R/3.6.3”. Default is “R”.
filename2id User defined function for converting filename to sample IDs. Default is NULL.
PA.threshold Threshold for light, moderate and vigorous physical activity. Default is c(50,100,400).
desiredtz desired timezone: see also https://en.wikipedia.org/wiki/Zone.tab. Used in g.inspectfile(). Default is “US/Eastern”.
RemoveDaySleeper Specify if the daysleeper nights are removed from the calculation of number of valid days for each subject. Default is FALSE.
part5FN Specify which output is used in the GGIR part5 results. Defaut is “WW_L50M125V500_T5A5”, which means that part5_daysummary_WW_L50M125V500_T5A5.csv and part5_personsummary_WW_L50M125V500_T5A5.csv are used in the analysis.
NfileEachBundle Number of files in each bundle when the csv data were read and processed in a cluster. Default is 20.
trace Specify if the intermediate results is printed when the function was executed. Default is FALSE.

3. Subset of samples (optional)

The postGGIR package not only simply transform/merge the activity and sleep data, but it also can do some prelimary data analysis such as principle componet analysis and feature extraction. Therefore, the basic data clean will be processed first as follows,

  • data clean by removing valid days/samples defined by minimum required number of valid hours/days in the activity data
  • remove duplicate samples

If you prefer to use all samples, just skip this part and use rmDup=FALSE as the default. Otherwise, if you want to remove some samples such as duplicates, there are two ways as follows,

  • Edit R codes of “step 2” in this part. For example, the template will keep the later copy for duplicate samples
  • Remove R codes of “step 2” in this part, and create studyname_samples_remove.csv file by filling “remove” in the “duplicate” column in the template file of studyname_samples_remove_temp.csv. The data will be kept unless duplicate=“remove”.

Run R script

call.afterggir(mode,filename2id)   
Variables Description
mode Specify which of the five parts need to be run, e.g. mode = 0 makes that all R/Rmd/sh files are generated for other parts. When mode = 1, all csv files in the GGIR output directory were read, transformed and then merged. When mode = 2, the GGIR output files were checked and summarized in one excel sheet. When mode = 3, the merged data was cleaned according to the number of valid hours on each night and the number of valid days for each subject. When mode = 4, the cleaned data was imputed.
filename2id This user-defined function will change the filename of the raw accelerometer file to the short ID for the purpose of identifying duplicate IDs.

Run script in a cluster

#!/bin/bash
#
#$ -cwd
#$ -j y
#$ -S /bin/bash
  source ~/.bash_profile

   cd /postGGIR/inst/example/afterGGIR; 
   module load R ; 
     R --no-save --no-restore --args  < studyname_ggir9s_postGGIR.pipeline.maincall.R  0
     R --no-save --no-restore --args  < studyname_ggir9s_postGGIR.pipeline.maincall.R  1
     R --no-save --no-restore --args  < studyname_ggir9s_postGGIR.pipeline.maincall.R  2
     R --no-save --no-restore --args  < studyname_ggir9s_postGGIR.pipeline.maincall.R  3
     R --no-save --no-restore --args  < studyname_ggir9s_postGGIR.pipeline.maincall.R  4 

     R -e "rmarkdown::render('part5_studyname_postGGIR.report.Rmd'   )" 
     R -e "rmarkdown::render('part6_studyname_postGGIR.nonwear.report.Rmd'   )" 
     R -e "rmarkdown::render('part7a_studyname_postGGIR_JIVE_1_somefeatures.Rmd'   )" 
     R -e "rmarkdown::render('part7b_studyname_postGGIR_JIVE_2_allfeatures.Rmd'