GGIR is an R-package to process multi-day raw accelerometer data for physical activity and sleep research. GGIR will write all output files into two sub-directories of ./meta and ./results. GGIR is increasingly being used by a number of academic institutes across the world.
postGGIR is an R-package to data processing after running GGIR for accelerometer data. In detail, all necessary R/Rmd/shell files were generated for data processing after running GGIR for accelerometer data. Then in part 1, all csv files in the GGIR output directory were read, transformed and then merged. In part 2, the GGIR output files were checked and summarized in one excel sheet. In part 3, the merged data was cleaned according to the number of valid hours on each night and the number of valid days for each subject. In part 4, the cleaned activity data was imputed by the average ENMO over all the valid days for each subject. Finally, a comprehensive report of data processing was created using Rmarkdown, and the report includes few explortatory plots and multiple commonly used features extracted from minute level actigraphy data in part 5-7. This vignette provides a general introduction to postGGIR.
The R package postGGIR has been released with an open-source GPL-3 license on CRAN, and postGGIR can run on Windows and Linux. Parallel computing in Linux is recommended due to the memory requirements associated with reading in multiple of the large data files. The package contains one primary function for users which, when run, generates all necessary R/R Markdown/shell executable files for data processing after running GGIR for accelerometer data; load, read, transform and merge long activity data; examine and summarize GGIR outputs; clean the merged activity data according to the number of valid hours per night and the number of valid days per subject; activity data imputation by taking the average across the valid days for each subject; build a comprehensive report of data processing and exploratory plots; extract multiple commonly used features and study feature structure by the covariance decomposition. Figure 1 presents a flowchart for each step in this process which is described in greater detail below. The procedure, R functions, inputs, and outputs are all described in this package vignette. In addition, more documentation and example data could be found in postGGIR repository on GitHub (URL: https://github.com/dora201888/postGGIR).
Mirroring the GGIR structure of processing individual data files in multiple parts, the postGGIR package is split into seven parts, grouping functionalities in logical processing order. The parts are numbered from 1 to 7. Parts 1 to 4 are dedicated to data processing. Parts 5 to 7 are dedicated to producing R Markdown reports of data cleaning, feature extraction, and unsupervised covariance decomposition via the joint and individual variance explained (JIVE) method, respectively. These seven parts are carried out sequentially with milestone data automatically being saved locally. To use postGGIR, the first step for users is to install and load the postGGIR package. Then, users run the create.postGGIR() function which creates a single R script. The newly created R script, Studyname_part0.maincall.R, is then edited by users, allowing for the specification of arguments relevant for each of the seven parts. All optional arguments and their defaults are described in the package vignette. In addition, for users with access to a cluster for parallel processing, a shell function, named as part9_swarm.sh is created which can parallelizes the processing of individual files with minor modifications by the user. These modifications are described in the package vignette. Computationally, part 1 is the most time-consuming task, taking up at least 60% of the processing time, which the activity data in .csv format was transformed and merged. Generally, part 1 takes about 10~30 minutes to process a file with 14 days of data recorded at 30 Hz on a GeneActiv device in processor cores of 36 x 2.3 GHz (Intel Gold 6140). All output created for each part is described in the package vignette. Briefly, part 1 and part 2 output are saved using a directory structure with a depth of two, containing output data and summary for all participants. The reports for parts 5 to 7 are saved in .html format and are generated using R Markdown (.Rmd) files. These .Rmd files are included in the output, users the flexibility to adapt the source code to their research purpose.
Figure 1: Overview of main steps and output in postGGIR workflow.
All postGGIR code is written in R and reports generated in R Markdown. The R packages ActFrag and ActCR are used for the calculation of certain physical activity and circadian rhythmicity features. The R package r.jive is used to perform the feature interaction analysis and to study the joint and individual variation structure by JIVE.
Download and install RStudio (optional, but recommended)
Download GGIR with its dependencies, you can do this with one command from the console command line:
install.packages("postGGIR", dependencies = TRUE)
library(postGGIR)
create.postGGIR()
The function will create a template shell script of postGGIR in the current directory, names as STUDYNAME_part0.maincall.R.
cat STUDYNAME_part0.maincall.R
options(width=2000)
= commandArgs(TRUE);
argv print(argv)
print(paste("length=",length(argv),sep=""))
<-as.numeric(argv[1])
modeprint(c("mode =", mode))
# (Note) Please remove the above lines if you are running this within R console
# instead of submitting jobs to a cluster.
#########################################################################
# (user-define 1) you need to redefine this according different study!!!!
#########################################################################
# example 1
.1<-function(x) unlist(strsplit(y1,"\\."))[1]
filename2id
# example 2 (use csv file =c("filename","ggirID"))
.2<-function(x) {
filename2id<-read.csv("./postGGIR/inst/example/filename2id.csv",head=1,stringsAsFactors=F)
d<-which(d[,"filename"]==x)
y1if (length(y1)==0) stop(paste("Missing ",x," in filename2id.csv file",sep=""))
if (length(y1)>=1) y2<-d[y1[1],"newID"]
return(as.character(y2))
}
#########################################################################
# main call
#########################################################################
<-function(mode,filename2id=filename2id.1){
call.afterggir
library(postGGIR)
#########################################################################
# (user-define 2) Fill in parameters of your ggir output
##########################################################################
=
currentdir =
studyname =
bindir =
outputdir setwd(currentdir)
=FALSE # keep all subjects in postGGIR
rmDup=c(50,100,400)
PA.threshold="WW_L50M125V500_T5A5"
part5FN= 5
epochIn = 5
epochOut = 60
flag.epochOut = FALSE
use.cluster = 9250
log.multiplier = 7
QCdays.alpha = 16
QChours.alpha <-NULL
useIDs.FN="R"
Rversion="US/Eastern"
desiredtz=FALSE
RemoveDaySleeper=part5FN,
part5FN=20
NfileEachBundle=FALSE
trace#########################################################################
# remove duplicate sample IDs for plotting and feature extraction
#########################################################################
if (mode==3 & rmDup){
# step 1: read ./summary/*remove_temp.csv file (output of mode=2)
<-TRUE #keep the latest visit for each sample
keep.last<-paste(currentdir,"/summary",sep="")
sumdirsetwd(sumdir)
<-paste(studyname,"_samples_remove_temp.csv",sep="")
inFN<-paste(sumdir,"/",studyname,"_samples_remove.csv",sep="")
useIDs.FN
#########################################################################
# (user-define 3 as rmDup=TRUE) create useIDs.FN file
#########################################################################
# step 2: create the ./summary/*remove.csv file manually or by R commands
<-read.csv(inFN,head=1,stringsAsFactors=F)
d<-d[order(d[,"Date"]),]
d<-d[order(d[,"newID"]),]
dwhich(is.na(d[,"newID"])),]
d[<-duplicated(d[,"newID"],fromLast=keep.last) #keep the last copy for nccr
S"duplicate"]<-"remove"
d[S,write.csv(d,file=useIDs.FN,row.names=F)
}
#########################################################################
# call afterggir
#########################################################################
setwd(currentdir)
afterggir(mode=mode,
useIDs.FN=useIDs.FN,
currentdir=currentdir,
studyname=studyname,
bindir=bindir,
outputdir=outputdir,
epochIn=epochIn,
epochOut=epochOut,
flag.epochOut=flag.epochOut,
log.multiplier=log.multiplier,
use.cluster=use.cluster,
QCdays.alpha=QCdays.alpha,
QChours.alpha=QChours.alpha,
QCnights.feature.alpha=QCnights.feature.alpha,
Rversion=Rversion,
filename2id=filename2id,
PA.threshold=PA.threshold,
desiredtz=desiredtz,
RemoveDaySleeper=RemoveDaySleeper,
part5FN=part5FN,
NfileEachBundle=NfileEachBundle,
trace=trace)
} #########################################################################
call.afterggir(mode)
#########################################################################
# Note: call.afterggir(mode)
# mode = 0 : creat sw/Rmd file
# mode = 1 : data transform using cluster or not
# mode = 2 : summary
# mode = 3 : clean
# mode = 4 : impu
Three places were marked as “user-define” and need to be edited by user in the STUDYNAME_part0.maincall.R file. Please rename the file by replacing your real studyname after the edition.
This user-defined function will change the filename of the raw accelerometer file to the short ID. For example, the first example change “0002__026907_2016-03-11 13-05-59.bin” to new ID of “0002”. If you prefer to define new ID by other way, you could create a .CSV file including “filename” and “newID” at least and then defined this function as the second example. The new variable of “newID”, included in the output files, could be used as the key ID in the summary report of postGGIR and be used to define the duplicate samples as well.
User needs to define the following parameters as follows,
Variables | Description |
---|---|
rmDup | Set rmDup = TRUE if user want to remove some samples such as duplicates. Set rmDup = FALSE if user want to keep all samples. |
mode | Specify which of the five parts need to be run, e.g. mode = 0 makes that all R/Rmd/sh files are generated for other parts. When mode = 1, all csv files in the GGIR output directory were read, transformed and then merged. When mode = 2, the GGIR output files were checked and summarized in one excel sheet. When mode = 3, the merged data was cleaned according to the number of valid hours on each night and the number of valid days for each subject. When mode = 4, the cleaned data was imputed. |
useIDs.FN | Filename with or without directory for sample information in CSV format, which including “filename” and “duplicate” in the headlines at least. If duplicate=“remove”, the accelerometer files will not be used in the data analysis of part 5-7. Defaut is NULL, which makes all accelerometer files will be used in part 5-7. |
currentdir | Directory where the output needs to be stored. Note that this directory must exist. |
studyname | Specify the study name that used in the output file names |
bindir | Directory where the accelerometer files are stored or list |
outputdir | Directory where the GGIR output was stored. |
epochIn | Epoch size to which acceleration was averaged (seconds) in GGIR output. Defaut is 5 seconds. |
epochOut | Epoch size to which acceleration was averaged (seconds) in part1. Defaut is 5 seconds. |
flag.epochOut | Epoch size to which acceleration was averaged (seconds) in part 3. Defaut is 60 seconds. |
log.multiplier | The coefficient used in the log transformation of the ENMO data, i.e. log( log.multiplier * ENMO + 1), which have been used in part 5-7. Defaut is 9250. |
use.cluster | Specify if part1 will be done by parallel computing. Default is TRUE, and the CSV file in GGIR output will be merged for every 20 files first, and then combined for all. |
QCdays.alpha | Minimum required number of valid days in subject specific analysis as a quality control step in part2. Default is 7 days. |
QChours.alpha | Minimum required number of valid hours in day specific analysis as a quality control step in part2. Default is 16 hours. |
QCnights.feature.alpha | Minimum required number of valid nights in day specific mean and SD analysis as a quality control step in the JIVE analysis. Default is c(0,0), i.e. no additional data cleaning in this step. |
Rversion | R version, eg. “R/3.6.3”. Default is “R”. |
filename2id | User defined function for converting filename to sample IDs. Default is NULL. |
PA.threshold | Threshold for light, moderate and vigorous physical activity. Default is c(50,100,400). |
desiredtz | desired timezone: see also https://en.wikipedia.org/wiki/Zone.tab. Used in g.inspectfile(). Default is “US/Eastern”. |
RemoveDaySleeper | Specify if the daysleeper nights are removed from the calculation of number of valid days for each subject. Default is FALSE. |
part5FN | Specify which output is used in the GGIR part5 results. Defaut is “WW_L50M125V500_T5A5”, which means that part5_daysummary_WW_L50M125V500_T5A5.csv and part5_personsummary_WW_L50M125V500_T5A5.csv are used in the analysis. |
NfileEachBundle | Number of files in each bundle when the csv data were read and processed in a cluster. Default is 20. |
trace | Specify if the intermediate results is printed when the function was executed. Default is FALSE. |
The postGGIR package not only simply transform/merge the activity and sleep data, but it also can do some prelimary data analysis such as principle componet analysis and feature extraction. Therefore, the basic data clean will be processed first as follows,
If you prefer to use all samples, just skip this part and use rmDup=FALSE
as the default. Otherwise, if you want to remove some samples such as duplicates, there are two ways as follows,
call.afterggir(mode,filename2id)
Variables | Description |
---|---|
mode | Specify which of the five parts need to be run, e.g. mode = 0 makes that all R/Rmd/sh files are generated for other parts. When mode = 1, all csv files in the GGIR output directory were read, transformed and then merged. When mode = 2, the GGIR output files were checked and summarized in one excel sheet. When mode = 3, the merged data was cleaned according to the number of valid hours on each night and the number of valid days for each subject. When mode = 4, the cleaned data was imputed. |
filename2id | This user-defined function will change the filename of the raw accelerometer file to the short ID for the purpose of identifying duplicate IDs. |
#!/bin/bash
#
#$ -cwd
#$ -j y
#$ -S /bin/bash
~/.bash_profile
source
/postGGIR/inst/example/afterGGIR;
cd
module load R ; --no-save --no-restore --args < studyname_ggir9s_postGGIR.pipeline.maincall.R 0
R --no-save --no-restore --args < studyname_ggir9s_postGGIR.pipeline.maincall.R 1
R --no-save --no-restore --args < studyname_ggir9s_postGGIR.pipeline.maincall.R 2
R --no-save --no-restore --args < studyname_ggir9s_postGGIR.pipeline.maincall.R 3
R --no-save --no-restore --args < studyname_ggir9s_postGGIR.pipeline.maincall.R 4
R
-e "rmarkdown::render('part5_studyname_postGGIR.report.Rmd' )"
R -e "rmarkdown::render('part6_studyname_postGGIR.nonwear.report.Rmd' )"
R -e "rmarkdown::render('part7a_studyname_postGGIR_JIVE_1_somefeatures.Rmd' )"
R -e "rmarkdown::render('part7b_studyname_postGGIR_JIVE_2_allfeatures.Rmd' R