The package forceR
has originally been written for
insect bite force data preparation and analysis, but it can be used for
any kind of time series measurements. Functions include
This vignette guides you through all functions of the package. A limited data set of 8 bite force measurement files is used to get the ideas across while allowing for fast calculations.
At the end of this vignette you will find a self-sufficient workflow
example that runs all forceR
commands after file loading
and drift corrections. Instead of loading files, bite series are
simulated and then analyzed.
Please cite the following publication when using the
forceR
package:
Rühr, PT & Blanke, A (2022): forceX
and forceR
: a mobile setup and R package to measure and
analyze a wide range of animal closing forces. Methods in Ecology
and Evolution XXX: pp.XX-XX. doi: 10.1111/2041-210X.13909
You can install the official release of forceR
from CRAN:
install.packages('forceR')
The development version of forceR
is available at GitHub:
require(devtools)
::install_github("https://github.com/Peter-T-Ruehr/forceR") devtools
forceR
library(magrittr)
library(dplyr)
library(stringr)
library(purrr)
library(ggplot2)
library(readr)
library(forceR)
From hereon, we will use data stored in a local folder. Please
download and unzip the content of the example_data.zip file from
Zenodo onto your hard drive, unzip the files and store the folder
location in the variable data.folder
. This folder now
contains every file that is needed and has been produced during the
creation of this vignette, so you can always restore the files from
Zenodo
in case you have accidentally overwritten anything
you wanted to keep.
If you decide to use your own files, consider that
forceR
generally expects file names to start with leading
digits specifying the measurement number (E.g.
0001_G_maculatus.csv
). The number (in this case
0001
) is used to keep data files, log files, and PDF files
of the same measurement associated with each other. If not already the
case, we strongly recommend renaming the raw data files accordingly
before continuing with this forceR
vignette.
First, we set the folder containing the example data as the
data.folder
(make sure this fits your local path,
e.g. "./example_data"
):
<- "./example_data" data.folder
In case you want to use your own measurements, and if they were
recorded by the LJStream software (LabJack Corporation, Lakewood,
Colorado, US; tested for v1.19) as suggested in the forceX
instructions, you can use the convert_measurement()
function of forceR
to convert your files and save them in
the data.folder
. See ?convert_measurement()
for details.
The measurement file should have the time steps (preferably in m.secs, since all code of this package expects this) in the first column, and the measurements in the second column. Column names do not matter, and everything but the first two columns will be ignored.
<- file.path(data.folder, "0982.csv")
file plot_measurement(file,
columns = c(1:2))
The function crop_measurement()
can be used to crop a
measurement. Since we want to work with the cropped file later, we
define path.data
when calling the function to store the
result as a new file. Before running the function, create the folder on
your hard drive, e.g. at "./example_data/cropped"
.
<- "./example_data/cropped" cropped.folder
If a measurement contains two distinct regions of bites with a lot of unnecessary data and/or measurement artefacts in-between (such as user-made peaks), I recommend to manually copy the RAW data files, give the copy a new measurement number (as if it was actually a separate measurement), and crop the distinct parts containing actual bites separately from the two copies of the file. For more distinct regions, create more copies.
I recommend to not crop the files too much in case baseline
corrections are needed later, because then the
baseline_corr()
function will not be able to figure out
where the actual baseline might be. Leaving several seconds before and
after the first and last bite of a series will prevent such
problems.
<- crop_measurement(file,
file.cropped path.data = cropped.folder)
Select two points at the beginning and end of the part of the
measurement we want to keep. If more than two points are selected, only
the last two points will be considered. This allows for corrections of
erroneous clicks. The position of the points on the y-axis is
irrelevant. Make sure that the start and end points are placed on the
graph between peaks, i.e. on the baseline. The y-values of the graph at
these positions will be used as reference points in the following
amp_drift_corr()
function.
The resulting file will be saved with the suffix
"_cropped"
. We can plot it again:
<- "./example_data/cropped"
cropped.folder
<- file.path(cropped.folder, "0982_cropped.csv")
file plot_measurement(file)
To remove the systemic, asymptotical drift characteristic for charge
amplifiers with resistor–capacitor (RC) circuits, we can use the
amp_drift_corr()
function. This function requires a few
parameters. ?amp_drift_corr
brings up the full explanation
of all values.
Basically, we only have to decide how we want to plot the data, define the correct folder with our measurement data, provide the time constant (tau, \(\tau\)) of our charge amplifier, and define a folder where we want to store the results. This function, in contrast to the previous functions, takes a folder as input because when the same charge amplifier was used for all measurements, the same settings can be used for all measurement files.
# create file list with all cropped files that need amplifier drift corrections.
# Here we use the cropped.folder with the cropped measurements.
<- list.files(cropped.folder,
file.list pattern = "csv$",
full.names = TRUE)
# define folder where to save the amplifier drift corrected file.
# If this folder does not yet exist, it will be created.
<- "./cropped/ampdriftcorr"
ampdriftcorr.folder
for(filename in file.list){
print(filename)
amp_drift_corr(filename = filename,
tau = 9400,
res.reduction = 10,
plot.to.screen = FALSE,
write.data = TRUE,
write.PDFs = TRUE,
write.logs = TRUE,
output.folder = ampdriftcorr.folder,
show.progress = FALSE)
print("***********")
}
Because we set the arguments write.data
,
write.PDFs
and write.logs
to
TRUE
, this creates the output.folder
("./ampdriftcorr"
) and two new folders inside the
output.folder
:
"./ampdriftcorr/logs/"
where log files containing info
on the amp_drift_corr
used are stored"./ampdriftcorr/pdfs/"
where the PDFs of the time
series before and after the corrections are stored.The resulting graph of one corrected measurement may look like this:
In gray, we see the raw measurements, and in green the corrected
data. A linear drift in the corrected data may occur in case the first
value of the measurement is not exactly 0
. This drift is
also corrected for by the function. The final result for each file is,
since write.data
was set to TRUE
, also saved
in a new CSV file in the "./ampdriftcorr/"
folder.
If the baseline (zero-line) of a measurement is unstable (e.g. due to
temperature fluctuations, wind, …), it needs to be continually adjusted
throughout the measurement. Ideally, this is done automatically by the
function baseline.corr()
. This automatic adjustment of the
baseline invokes a sliding window approach, during which the ‘minimum’
within each window is stored. A ‘minimum’ is defined by the parameter
quantile.size
. If quantile.size
is set to
0.05
, the value below which 5% of the measurement data
within the current window lies, is treated as the current window’s
‘minimum’. Not taking the actual minimum within each window prevents the
treatment of short undershoots (artifacts created during the
measurement) as minima. In a second iteration, another sliding window
calculates the average of the ‘minima’ within each window. The resulting
‘minimal’ values are subtracted from the original time series.
This approach works well for time series with relatively short peaks.
For longer peaks, the window.size.mins
should be increased.
However, the greater window.size.mins
gets, the smaller
becomes the baseline-correcting effect of the function.
For the file
"./example_data/cropped/ampdriftcorr/1068_ampdriftcorr.csv"
,
we choose a window.size.mins
of 2000:
= file.path(ampdriftcorr.folder, "1068_ampdriftcorr.csv")
filename
plot_measurement(filename)
<- "./cropped/ampdriftcorr/baselinecorr"
baselinecorr.folder
<- baseline_corr(filename = filename,
file.baselinecorr corr.type = "auto",
window.size.mins = 2000,
window.size.means = NULL,
quantile.size = 0.05,
y.scale = 0.5,
res.reduction = 10,
Hz = 100,
plot.to.screen = TRUE,
write.data = TRUE,
write.PDFs = TRUE,
write.logs = TRUE,
output.folder = baselinecorr.folder,
show.progress = FALSE)
Again, new folders, PDFs and log files are created. The resulting plot shows the original data in gray, the ‘minima’ of the sliding windows in blue, the sliding mean of the ‘minima’ in bright green and the corrected time series in dark green:
If the automatic approach does not yield acceptable results, an
interactive manual approach to correct the baseline can be performed
instead. We do the manual correction for the file
"./example_data/cropped/ampdriftcorr/1174_ampdriftcorr.csv"
,
which contains measurements with long, plateau-like peaks where the
automatic baseline correction would fail.
= file.path(ampdriftcorr.folder, "1174_ampdriftcorr.csv")
filename
plot_measurement(file)
<- baseline_corr(filename = filename,
file.baselinecorr corr.type = "manual",
plot.to.screen = TRUE,
write.data = TRUE,
write.PDFs = TRUE,
write.logs = TRUE,
output.folder = baselinecorr.folder,
show.progress = FALSE)
The green line represents a spline along the manually defined points and is subtracted from the original data (gray). The corrected data is plotted in dark green again.
We have used baseline_corr()
on the files
1041_ampdriftcorr.csv
, 1068_ampdriftcorr.csv
,
1174_ampdriftcorr.csv
,
1440_ampdriftcorr.csv
.
After all original files have been processed, we need to separate the
resulting files that we want to keep from files that have been created
along the way. The function sort_files()
looks through a
vector of folders (data.folders
) to identify files with the
same measurement name (i.e., the same character string in the file names
before the first underscore (_
) and keeps the version of
the file that is found in the folder that is most closely to the end of
the list of data.folders
. Thus, it is important that the
folders are parsed in the correct order. In our example, this order
would look like this:
"./example_data/"
"./example_data/cropped/"
"./example_data/cropped/ampdriftcorr/"
"./example_data/cropped/ampdriftcorr/baselinecorr"
If a file is present in
"./example_data/cropped/ampdriftcorr/baselinecorr"
, this
will be copied (move = FALSE
) or moved
(move = TRUE
) to a folder specified in
path.corrected
, e.g.,
"./example_data/corrected/"
. Versions of the same
measurement in the other folders will be ingored. This is iteratively
repeated for the rest of the folders in reverse order of the
data.folders
vector.
<- c(data.folder,
data.folders file.path(data.folder, "/cropped"),
file.path(data.folder, "/cropped/ampdriftcorr"),
file.path(data.folder, "/cropped/ampdriftcorr/baselinecorr"))
<- file.path(data.folder, "/corrected/")
results.folder
sort_files(data.folders = data.folders,
results.folder = results.folder,
move = FALSE)
Now we will load the first file in the results
folder
using load_single()
. The function removes all columns
except the one defined in columns = c(1:2)
and renames them
to t
and y
. Then, it adds a
filename
column based on the file name of the data.
<- list.files(results.folder, pattern = "csv", full.names = TRUE)
file.list .1 <- load_single(file = file.list[1],
dfcolumns = c(1:2))
df.1
#> # A tibble: 129,001 x 3
#> t y filename
#> <dbl> <dbl> <chr>
#> 1 0 0.00447 0979_ampdriftcorr
#> 2 1 0.00954 0979_ampdriftcorr
#> 3 2 0.0146 0979_ampdriftcorr
#> 4 3 0.0146 0979_ampdriftcorr
#> 5 4 0.00951 0979_ampdriftcorr
#> 6 5 0.00951 0979_ampdriftcorr
#> 7 6 0.00950 0979_ampdriftcorr
#> 8 7 0.00949 0979_ampdriftcorr
#> 9 8 0.00948 0979_ampdriftcorr
#> 10 9 0.00948 0979_ampdriftcorr
#> # ... with 128,991 more rows
unique(df.1$filename)
#> [1] "0979_ampdriftcorr"
Alternatively, We can rapidly load all files in the
results
folder by using load_mult()
:
<- load_mult(folder = results.folder,
df.all columns = c(1:2),
show.progress = TRUE)
df.all
#> # A tibble: 615,724 x 3
#> t y filename
#> <dbl> <dbl> <chr>
#> 1 0 0.00447 0979_ampdriftcorr
#> 2 1 0.00954 0979_ampdriftcorr
#> 3 2 0.0146 0979_ampdriftcorr
#> 4 3 0.0146 0979_ampdriftcorr
#> 5 4 0.00951 0979_ampdriftcorr
#> 6 5 0.00951 0979_ampdriftcorr
#> 7 6 0.00950 0979_ampdriftcorr
#> 8 7 0.00949 0979_ampdriftcorr
#> 9 8 0.00948 0979_ampdriftcorr
#> 10 9 0.00948 0979_ampdriftcorr
#> # ... with 615,714 more rows
unique(df.all$filename)
#> [1] "0979_ampdriftcorr"
#> [2] "0980_ampdriftcorr"
#> [3] "0981_ampdriftcorr"
#> [4] "0982_ampdriftcorr"
#> [5] "1041_baselinecorr"
#> [6] "1068_baselinecorr"
#> [7] "1174_baselinecorr"
#> [8] "1440_baselinecorr"
The new tibble contains the measurement of all files in the selected folder.
To allow users to test the package and follow the analysis steps of
this vignette without having to load actual files, we have simulated
data using the forceR
function
simulate_bites()
and stored them within
forceR
. We will replace our previous
df.all
tibble with the internal forceR
tibble. You may, however, continue with your just created
version of df.all
.
# create/replace df.all
<- forceR::df.all
df.all head(df.all)
#> # A tibble: 6 × 3
#> t y measurement
#> <int> <dbl> <chr>
#> 1 1 0 m_01
#> 2 2 0 m_01
#> 3 3 0 m_01
#> 4 4 0 m_01
#> 5 5 0 m_01
#> 6 6 0 m_01
Now let us plot the simulated measurements ot get a feeling for them:
# plot simulated measurements
ggplot(df.all,
aes(x = t ,
y = y,
colour=measurement)) +
geom_line()
We can see that some bites have more (reddish colors) or less (blueish colors) plateau-like peaks, whereas others have more sinusoidal peaks with early (orange) or late peaks (green).
df.all
has a sampling frequency of 1000 Hz. This is more
than is often needed to analyse a time series. To reduce the sampling
frequency of all measurements that are above a desired frequency of
e.g. 200 Hz
, we can use the function
reduce.frq()
. However, please not that the simulated bites
are only 50 ms long, which is much shorter than, e.g., actual insect
bites. This will keep all calculations quick, but will result in
measurement series that are more downsampled than we would recommend.
For real insect bites, though, a sampling frequency of 200 Hz is
sufficient, so we will proceed as if we were working with such real
bites:
# reduce frequency to 200 Hz
.200 <- reduce_frq(df = df.all,
df.allHz = 200,
measurement.col = "measurement")
head(df.all.200)
#> # A tibble: 6 × 3
#> # Groups: t [6]
#> t y measurement
#> <dbl> <dbl> <chr>
#> 1 0 0 m_01
#> 2 5 0 m_01
#> 3 10 0.0439 m_01
#> 4 15 0.299 m_01
#> 5 20 0.644 m_01
#> 6 25 0.821 m_01
If measurement.col
is not defined, the whole input data
frame will be treated as if it was just one single time series. This is
okay for data frames like df.1
that only contain one
measurement, but for data frames with multiple time series measurements
(like df.all
), we need to define a grouping column - in our
case the column measurement
.
For further analyses, we need a classifier
in which data
on the species, specimens and measurements of our dataset are
stored:
# create a classifier
<- 4
number_of_species <- 3
number_of_specimens_per_species <- 2
number_of_measurements_per_specimen <- number_of_species *
number_of_rows *
number_of_specimens_per_species
number_of_measurements_per_specimen
<- sort(rep(paste0("species_", LETTERS[1:number_of_species]),
species length=number_of_rows))
<- sort(rep(paste0("speciemen_", letters[1:(number_of_species*number_of_specimens_per_species)]),
specimens length=number_of_rows))
<- tibble(species = species,
classifier specimen = specimens,
measurement = paste0("m_", str_pad(string= 1:number_of_rows, width = 2, pad = "0")),
amp = c(rep(0.5, number_of_rows/2), rep(2, number_of_rows/2)),
lever.ratio = rep(0.5, number_of_rows))
head(classifier)
#> # A tibble: 6 × 5
#> species specimen measurement amp lever.ratio
#> <chr> <chr> <chr> <dbl> <dbl>
#> 1 species_A speciemen_a m_01 0.5 0.5
#> 2 species_A speciemen_a m_02 0.5 0.5
#> 3 species_A speciemen_b m_03 0.5 0.5
#> 4 species_A speciemen_b m_04 0.5 0.5
#> 5 species_A speciemen_c m_05 0.5 0.5
#> 6 species_A speciemen_c m_06 0.5 0.5
To convert a voltage measurement to a force measurement [Newton], we need an amplification value and, depending on the measurement setup, the lever ratio of the lever forwarding the force from the force plate to the sensor.
Now we can convert the y-values of df.all.200
to force
data and add the specimen
and species
info
from the classifier
using the y_to_force()
function:
200.tax <- y_to_force(df = df.all.200,
df.all.classifier = classifier,
measurement.col = "measurement")
head(df.all.200.tax)
#> # A tibble: 6 × 4
#> # Groups: t [6]
#> measurement specimen t force
#> <chr> <chr> <dbl> <dbl>
#> 1 m_01 speciemen_a 0 0
#> 2 m_01 speciemen_a 5 0
#> 3 m_01 speciemen_a 10 0.0439
#> 4 m_01 speciemen_a 15 0.299
#> 5 m_01 speciemen_a 20 0.644
#> 6 m_01 speciemen_a 25 0.821
Finally, we can start with the actual analysis of the data. First, we want to calculate the minimum, maximum and standard deviation of force for each measurement and, in case there were several measurements per specimen, also for each specimen.
From the previous steps, df.all.200.tax already contains the columns
measurement
, specimen
and
species
. Now, we can use the function
summarize_measurements()
to create a summary tibble by
adding the column names for which we want a summary in var1
and var2
. var1
must name the column that
stores the unique measurement IDs, e.g. measurement number, to calculate
minimal and maximal force values per measurement. As var2
we will use ‘specimen’ to calculate the summary data per specimen from
the min. and max. values of the measurements of that specimen. Thus, the
resulting tibble will contain summaries of all measurements that were
performed with the same specimen.
# add species to df.all.200.tax
200.tax <- df.all.200.tax %>%
df.all.left_join(classifier %>%
select(species, measurement))
= "measurement"
var1 = "specimen"
var2 <- summarize_measurements(df.all.200.tax,
df.summary.specimen
var1,
var2)head(df.summary.specimen)
#> # A tibble: 6 × 8
#> measurement specimen species max.F.meas…¹ mean.…² max.F…³ sdv.m…⁴ n.mea…⁵
#> <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <int>
#> 1 m_01 speciemen_a species_A 1.24 1.24 1.24 0.00116 2
#> 2 m_02 speciemen_a species_A 1.24 1.24 1.24 0.00116 2
#> 3 m_03 speciemen_b species_A 1.13 1.13 1.13 0.00658 2
#> 4 m_04 speciemen_b species_A 1.12 1.13 1.13 0.00658 2
#> 5 m_05 speciemen_c species_A 1.09 1.18 1.26 0.117 2
#> 6 m_06 speciemen_c species_A 1.26 1.18 1.26 0.117 2
#> # … with abbreviated variable names ¹max.F.measurement, ²mean.F.specimen,
#> # ³max.F.specimen, ⁴sdv.max.F.specimen, ⁵n.measurements.in.specimen
# boxplot of maximum force in specimens
ggplot(data = df.summary.specimen, mapping = aes(x=specimen,y=max.F.measurement)) +
geom_jitter(aes(color='blue'),alpha=0.7, width = 0.2, height = 0.0) +
geom_boxplot(fill="bisque",color="black",alpha=0.3) +
# scale_y_log10() +
labs(y="max(F)/specimen") +
guides(color="none") +
theme_minimal() +
theme(axis.text.x = element_text(angle = 45, hjust=1))
To get a summary on species-wise info, we have to calculate with the
summarized specimen info. We are not using the
summarize_measurements()
functions because this would
ignore the fact the some measurements of one species may come from the
same specimen, but we only want to consider one maximum force value per
specimen and not one per measurement.
<- df.summary.specimen %>%
df.summary.species # find max Fs of species
group_by(species) %>%
# calculate force values for each species
mutate(max.F.species = max(max.F.specimen),
mean.F.species = round(mean(max.F.specimen),6),
sdv.max.F.species = sd(max.F.specimen)) %>%
ungroup() %>%
# count specimens / species
group_by(species) %>%
mutate(n.specimens.in.species = length(unique(specimen))) %>%
ungroup()
df.summary.species#> # A tibble: 24 × 12
#> measurement specimen species max.F…¹ mean.…² max.F…³ sdv.m…⁴ n.mea…⁵ max.F…⁶
#> <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <int> <dbl>
#> 1 m_01 specieme… specie… 1.24 1.24 1.24 0.00116 2 1.26
#> 2 m_02 specieme… specie… 1.24 1.24 1.24 0.00116 2 1.26
#> 3 m_03 specieme… specie… 1.13 1.13 1.13 0.00658 2 1.26
#> 4 m_04 specieme… specie… 1.12 1.13 1.13 0.00658 2 1.26
#> 5 m_05 specieme… specie… 1.09 1.18 1.26 0.117 2 1.26
#> 6 m_06 specieme… specie… 1.26 1.18 1.26 0.117 2 1.26
#> 7 m_07 specieme… specie… 2.19 2.10 2.19 0.125 2 2.44
#> 8 m_08 specieme… specie… 2.01 2.10 2.19 0.125 2 2.44
#> 9 m_09 specieme… specie… 2.27 2.28 2.29 0.0125 2 2.44
#> 10 m_10 specieme… specie… 2.29 2.28 2.29 0.0125 2 2.44
#> # … with 14 more rows, 3 more variables: mean.F.species <dbl>,
#> # sdv.max.F.species <dbl>, n.specimens.in.species <int>, and abbreviated
#> # variable names ¹max.F.measurement, ²mean.F.specimen, ³max.F.specimen,
#> # ⁴sdv.max.F.specimen, ⁵n.measurements.in.specimen, ⁶max.F.species
# boxplot of maximum force in species
ggplot(data = df.summary.species, mapping = aes(x=species,y=max.F.specimen)) +
geom_jitter(aes(color='blue'),alpha=0.7, width = 0.2, height = 0.0) +
geom_boxplot(fill="bisque",color="black",alpha=0.3) +
# scale_y_log10() +
labs(x='species', y="max(F)/specimen") +
guides(color="none") +
theme_minimal()
Now we come to one of the core functions of forceR
:
find_strongest_peaks()
. This function identifies the
strongest peaks of each measurement. But first, we need to create some
folders to store the data and plots that are produced in all subsequent
analyses of this example:
# create folders to save df and results
<- paste0(data.folder, "/plots/")
path.plots ifelse(!dir.exists(path.plots), dir.create(path.plots), "./plots already exists")
<- paste0(data.folder, "/plots/initial_peak_finding/")
path.plots.initial_peak_finding ifelse(!dir.exists(path.plots.initial_peak_finding), dir.create(path.plots), "./plots/initial_peak_finding already exists")
<- paste0(data.folder, "/data/")
path.data ifelse(!dir.exists(path.data), dir.create(path.data), "./data already exists")
Now we let find_strongest_peaks()
identify all actual
peak curves in all measurements:
<- find_strongest_peaks(df = df.all.200.tax,
peaks.df no.of.peaks = 5,
path.data = path.data,
path.plots = path.plots,
show.progress = TRUE)
head(peaks.df)
#> # A tibble: 4 × 4
#> species measurements starts ends
#> <chr> <chr> <chr> <chr>
#> 1 species_A m_06; m_01; m_02; m_01; m_06 265; 265; 265; 335; 70 325; 325; 325; …
#> 2 species_B m_11; m_11; m_12; m_10; m_09 140; 5; 205; 5; 340 200; 65; 265; 6…
#> 3 species_C m_17; m_16; m_18; m_16; m_18 5; 70; 70; 200; 5 65; 130; 130; 2…
#> 4 species_D m_19; m_22; m_24; m_21; m_20 135; 0; 265; 330; 330 200; 65; 330; 3…
The initial peak finding in the first iteration of the function
find_strongest_peaks()
identifies peaks via a the
initial.threshold
which is by default set to
0.05
(= 5%) of the maximum force of the respective
measurement. The threshold is indicated as a green line in the plots.
Bite starts are indicated as blue points, peak ends as orange
points:
In a second iteration, the function
find_strongest_peaks()
optimizes the peak starts and ends
found in the first iteration. Starting from each start, a sliding window
of size slope.length.start
(in time steps) moves backwards
in time and calculates the slope within the current window. The
parameter slope.thresh.start
defines below which slope the
process is stopped. The time point where the threshold is reached is
treated as the actual start of that peak. The same is done in forward
direction with the peak ends. Here, slope.length.end
(in
time steps) defines the sliding window size, and
slope.thresh.end
defines the slope threshold. All these
settings have default values which proved as good choices for insect
bites. The peak optimization will not be done for all peaks, but only
for the strongest peaks per species (not per measurement!), specified by
no.of.peaks
. If fewer than no.of.peaks
peaks
are found for a species, the results table will contain fewer peaks for
this species. No warning is issued.
The results are not automatically plotted. This can be done with the
function plot_peaks()
:
plot_peaks(df.peaks = peaks.df,
df.data = df.all.200.tax,
additional.msecs = 2000,
path.plots = path.plots)
The first page of the resulting PDF looks like this, showing the five
strongest peaks of species_A
, which was measured in
measurements m_01 - m_06
, and the strongest peak of
species_A
which occured in measurement m_11
.
Plots of further peaks of species_A
and the peaks of the
other species follow on the next pages.
We strongly recommend saving df.peaks
as a csv file on
your computer and keep a copy of it save. The following procedures
overwrite values in this table, so the peak-finding step needs to be
repeated to restore original values.
For some measurements, the automatic finding identification of peak
starts and ends may fail. Changing the parameters of
find_strongest_peaks()
might help, but in some cases, the
starts and ends have to be manually defined. For these cases, the
function correct_peak()
has been written. Set
path.data
to the folder where you stored
peaks.df
before. This will overwrite the info of the
bite you want to correct (here: bite 1
of measurement
m_01
!
<- correct_peak(df.peaks = peaks.df,
peaks.df df.data = df.all.200.tax,
measurement = "m_01",
peak = 1,
additional.msecs = 100,
path.data = path.data)
[1] "Select new peak start and end. If more than two points are selected, the operation is automatically terminated."
This functions prompts the user to define a new start and end of the
peak. Define the measurement
ID (as a character string) in
which the peak to correct is located, and the peak
number
(as a numerical value). Both can be found in the respective title of the
peak curve plot created by plot_peaks()
. If the desired
start and/or end lies outside of the plot window, increase
additional.msecs
.
The function overwrites the start and end values of the selected peak in the selected measurement and plots the corrected peak curve. Again, note that this example is taken from real measurements and not the simulated data we use in the other analysis steps.
Note that the peak in the figures above looks differently from the one in your plots, but the principles are the same.
In the next step, we will use rescale_peaks()
to rescale
the time and force data so that they both range from 0 to 1. This
function takes the data frame that contains the individual peak starts
and ends, and the measurement data. Both data frames are linked by the
‘measurement’ column, so this column needs to be present in both data
frames.
<- rescale_peaks(df.peaks = peaks.df,
peaks.df.norm df.data = df.all.200.tax,
plot.to.screen = TRUE,
path.data = path.data,
show.progress = TRUE)
head(peaks.df.norm)
#> # A tibble: 6 × 6
#> measurement peak t.norm force.norm specimen species
#> <chr> <int> <dbl> <dbl> <chr> <chr>
#> 1 m_06 1 0 0 speciemen_c species_A
#> 2 m_06 1 0.0833 0 speciemen_c species_A
#> 3 m_06 1 0.167 0.0764 speciemen_c species_A
#> 4 m_06 1 0.25 0.467 speciemen_c species_A
#> 5 m_06 1 0.333 0.836 speciemen_c species_A
#> 6 m_06 1 0.417 1 speciemen_c species_A
Now we will plot the resulting data. You will note that the peaks are
not very smooth. This is because the simulated peaks of
df.all
where so short that the reduction to 200 Hz in one
of the previous steps resulted in very few time steps per bite. It is,
thus, important to choose the sampling frequency wisely. For our
example, however, these few time steps are sufficient.
# plot all normalized peaks
ggplot(peaks.df.norm %>%
mutate(color.column = paste0(measurement, "__bite_", peak)),
aes(x = t.norm ,
y = force.norm,
colour=color.column)) +
geom_line()
peaks.df.norm
now contains the same data as
df.all.200.tax
, only with the additional columns of
t.norm
and force.norm
which contain the
rescaled time and force data, both ranging from 0 to 1.
To reduce the data to 100 observations (time steps), we will use the
function red_peaks_100()
. Note that, because the peaks of
the example dataset are so short, we now actually increase the
number of time steps. In real data, however,
red_peaks_100()
usually boils down the number of
observations.
.100 <- red_peaks_100(df = peaks.df.norm,
peaks.df.normplot.to.screen = TRUE,
path.data = path.data,
path.plots = path.plots,
show.progress = TRUE)
head(peaks.df.norm.100)
head(peaks.df.norm.100)
#> # A tibble: 6 × 6
#> species measurement specimen peak index force.norm.100
#> <chr> <chr> <chr> <int> <dbl> <dbl>
#> 1 species_A m_06 speciemen_c 1 1 0
#> 2 species_A m_06 speciemen_c 1 2 0.00568
#> 3 species_A m_06 speciemen_c 1 3 0.00891
#> 4 species_A m_06 speciemen_c 1 4 0.0101
#> 5 species_A m_06 speciemen_c 1 5 0.00972
#> 6 species_A m_06 speciemen_c 1 6 0.00814
Again, we will plot the result:
head(peaks.df.norm.100)
#> # A tibble: 6 × 6
#> species measurement specimen peak index force.norm.100
#> <chr> <chr> <chr> <int> <dbl> <dbl>
#> 1 species_A m_06 speciemen_c 1 1 0
#> 2 species_A m_06 speciemen_c 1 2 0.00568
#> 3 species_A m_06 speciemen_c 1 3 0.00891
#> 4 species_A m_06 speciemen_c 1 4 0.0101
#> 5 species_A m_06 speciemen_c 1 5 0.00972
#> 6 species_A m_06 speciemen_c 1 6 0.00814
# plot normalized peaks: 5 bites per measurement
ggplot(peaks.df.norm.100 %>%
mutate(color.column = paste0(measurement, "-bite", peak)),
aes(x = index ,
y = force.norm.100,
colour=color.column)) +
geom_line()
The resulting tibble consists of 100 rows per peak, and within each
peak, the index
column runs from 1 to 100, and the
force.norm.100
column consists of the rescaled peak curves,
each of whose values range from 0 to 1. Since we included 4 species in
our dataset, the resulting tibble is 2,000 rows long
(species * peaks/species * time steps per peak = 4 * 5 * 100 = 2,000
).
Next, we will calculate the average peak curve per species and normalize the curves again:
100.avg <- avg_peaks(df = peaks.df.norm.100,
peaks.df.path.data = path.data)
head(peaks.df.100.avg)
#> # A tibble: 6 × 3
#> # Groups: species [1]
#> species index force.norm.100.avg
#> <chr> <dbl> <dbl>
#> 1 species_A 1 0.00531
#> 2 species_A 2 0.00267
#> 3 species_A 3 0.000855
#> 4 species_A 4 0
#> 5 species_A 5 0.000247
#> 6 species_A 6 0.00173
# plot averaged normalized curves per species
ggplot(peaks.df.100.avg, aes(x = index ,
y = force.norm.100.avg,
colour=species)) +
geom_line()
To find out the minimum number of coefficients that well describe all
curves, we use the function find_best_fits()
. It fits
polynomial functions with 1 to 20 coefficients and uses the Akaike
Information Criterion (AIC) to evaluate the goodness of the fits. A
model is considered a good fit when the percentage of change from one
model to the next (e.g. a model with 6 coefficients to a model with 7
coefficients) is <= 5%
. The first four models meeting
this criterion are plotted as colored graphs and the AICs of these
models are visualized in a second plot for each curve. All first four
coefficients per curve that fulfill the criterion are stored and in the
end, a histogram of how often which coefficients were good fits is
plotted as well. The function returns the numerical value of the
coefficient that fulfilled the criterion of a good fit in most curves.
If write.PDFs == TRUE
, then the plots are saved as PDFs in
path.plots
. Resulting data is saved in
path.data
.
<- find_best_fits(df = peaks.df.100.avg,
best.fit.poly plot.to.screen = TRUE,
path.data = path.data,
path.plots = path.plots)
best.fit.poly#> [1] 10
Both 4 and 8 coefficients most often were under the first 4 models
that were followed by an AIC-change below 5%
:
The first image shows the polynomial fits to the average force curves
of species_A and species_B. The second image shows a frequency
distribution of how often a polynomial function with n
coefficients was followed by a function with n+1 coefficients
that only brought an AIC-change below 5%
.
Given the fact that we only analyzed four species in this example,
the histogram looks a bit scarce
(n = 4 species * 4 coefficients = 16
). If more species are
analyzed, this may rather look like this:
In the last step of this vignette, we convert the average peak curve
shape into polynomial models in which all have the same amount of
coefficients (defined by the previously identified
best.fit.poly
):
<- peak_to_poly(df = peaks.df.100.avg,
models coeff = best.fit.poly,
path.data = path.data,
show.progress = TRUE)
# create tibble with model data
<- NULL
models.df for(i in 1:length(models)){
<- tibble(species = rep(names(models)[i], 100),
model.df index = 1:100,
y = predict(models[[i]]))
<- rbind(models.df, model.df)
models.df
}
# plot all polynomial models
ggplot(models.df,
aes(x = index ,
y = y,
colour=species)) +
geom_line()
Congratulations - you have made it to the end of this vignette. We
hope this package helps you with your data. In case of any remarks or
questions, please feel free to contact Peter T. Rühr at
peter.ruehr at gmail.com
.
forceR
workflow exampleBelow we have compiled a self-sufficient R
script to run
all functions of forceR
after file loading and drift
corrections. Instead of loading data from in vivo measurements,
we simulate bite series with the function simulate_bites()
and store it in df.all
. The variable names used are the
same as used in the rest of this vignette. They are also stored in the
forceR
package internally and can be restored by calling,
e.g. df.all <- forceR::df.all
.
## ---------------------------
##
## PURPOSE OF SCRIPT:
## Simulate data and test various functions of the forceR package.
##
##
## AUTHOR:
## Peter T. Rühr
##
## DATE CREATED:
## 2022-04-07
##
## Copyright (c) Peter T. Rühr, 2022
## Email: peter.ruehr@gmail.com
##
## ---------------------------
##
## NOTES:
##
##
##
## -------------------------------------------------------------------------
## DEPENDENCIES
# to use viewport()
require(grid)
# forceR
install.packages("C:/Users/pruehr.EVOLUTION/Documents/forceR",
repos = NULL,
type = "source")
library(forceR)
# various plotting functions
require(ggplot2)
# load tidyverse for its various conveniences
require(tidyverse)
## -------------------------------------------------------------------------
## FUNCTIONS
# get data as string for saving files
<- function(){
today <- gsub("-", "_", substring(as.character(as.POSIXct(Sys.time())), 1, 10))
date.string return(date.string)
}
# plot linear regression and return relevant values
<- function(x, y,
plot.linear.regression logarithmic = F,
x.axis.label = "x",
title = NULL,
x.lim = NULL, y.lim = NULL){
if(logarithmic == "10"){x <- log10(x); y <- log10(y)}
if(logarithmic == "e"){x <- log(x); y <- log(y)}
<- lm(y ~ x)
lin.mod <- summary(lin.mod)
lin.mod.sum <- lin.mod.sum$adj.r.squared
lin.mod.r2 <- lin.mod.sum$coefficients[2,4]
lin.mod.p <- lin.mod$coefficients[1]
lin.mod.intercept <- lin.mod$coefficients[2]
lin.mod.slope <- bquote(italic(R)^2 == .(format(lin.mod.r2, digits = 3)))
lin.mod.label.r2 if(lin.mod.p > 0.05) {lin.mod.p.ast <- lin.mod.p}
if(lin.mod.p <= 0.05 & lin.mod.p > 0.01) {lin.mod.p.ast <- "*"}
if(lin.mod.p <= 0.01 & lin.mod.p > 0.001) {lin.mod.p.ast <- "**"}
if(lin.mod.p <= 0.001) {lin.mod.p.ast <- "***"}
<- bquote(italic(p) == .(lin.mod.p.ast))
lin.mod.label.p
if(is.null(xlim)){
= c(min(x), max(x))
x.lim
}if(is.null(ylim)){
= c(min(y), max(y))
y.lim
}
if(lin.mod.p >= 0.001) p.print <- paste0("p = ", round(lin.mod.p, 3))
if(lin.mod.p < 0.001) p.print <- "p < 0.001"
plot(x, y, pch = 16, xlab = paste0(x.axis.label, ": ", p.print,
"; R2 = ", round(lin.mod.r2,3),
"; m = ", round(lin.mod.slope,3),
"; y = ", round(lin.mod.intercept,3)),
xlim = x.lim, ylim = y.lim)
abline(lin.mod, col = "red")
if(!is.null(title)){
title(main = title)
}print(x.axis.label)
print(paste0(p.print))
print(paste0("R2 = ", round(lin.mod.r2,3)))
print(paste0("m = ", round(lin.mod.slope,3)))
print(paste0("y = ", round(lin.mod.intercept,3)))
<- tibble(p = lin.mod.p,
res r2 = lin.mod.r2,
intercept = lin.mod.intercept,
slope = lin.mod.slope)
return(res)
}
# add leading zeros to number and return as string
<- function(number, length){
add_eading_zeros require(stringr)
str_pad(number, length, pad = "0")
}
## -------------------------------------------------------------------------
## Simulate bite force data
# set seed for randomization so results are reproducible
set.seed(1)
## -------------------------------------------------------------------------
## Create classifier to store data (1 row per measurement)
<- tibble(bite.type = c(rep("sinosoidal", 5), rep("plateau", 3), rep("inter", 4)),
classifier peak.position = c("early","early","center","late","late","center","center","center","center","center","early","late"),
species = NA,
specimen = NA,
measurement = NA,
body_mass = NA,
force_in = NA,
length.of.bite = 4000,
peak.pos = c(20, 25, 50, 65, 70, rep(NA, 7)),
slope.perc.starts = c(rep (NA, 5), 10,15,20,30,40,20,50),
slope.perc.ends = c(rep (NA, 5), 10,15,20,30,40,50,20),
type = c(rep("sin", 5), rep("plat", 7)),
no.of.bites = 7,
amp = 1,
lever.ratio = 1,
length.of.series = 35000)
# amend classifier
<- classifier %>%
classifier mutate(no = row_number())
classifier
# define maximum forces per bite simulation type and add additional rows to classifier
<- c(1, 5, 15)
forces for(i in 1:nrow(classifier)){
$force_in[i] <- forces[1]
classifier<- classifier %>%
classifier add_row(classifier[i, ])
$force_in[nrow(classifier)] <- forces[1]
classifierfor(f in 2:3){
<- classifier %>%
classifier add_row(classifier[i, ])
$force_in[nrow(classifier)] <- forces[f]
classifier<- classifier %>%
classifier add_row(classifier[i, ])
$force_in[nrow(classifier)] <- forces[f]
classifier
}
}
# arrange classifier by original bite.type and remove bite.typeing column
<- classifier %>%
classifier arrange(no) %>%
select(-no)
# create vector of species names and add to classifier
<- NULL
species.names for(i in 1:(nrow(classifier)/2)){
<- c(species.names,
species.names rep(paste0("S", add_eading_zeros(i, 2)), 2))
}$species <- species.names
classifier
$specimen <- paste0("s", add_eading_zeros(1:nrow(classifier), 3))
classifier$measurement <- paste0("m", add_eading_zeros(1:nrow(classifier), 3))
classifier
# assign body mass according to the maximum force
$body_mass[classifier$force_in == forces[1]] <- 1
classifier$body_mass[classifier$force_in == forces[2]] <- 6
classifier$body_mass[classifier$force_in == forces[3]] <- 25
classifier
# add jitter to force and body mass to replicate biological variation
for(i in 1:nrow(classifier)){
$force_in[i] <- round(classifier$force_in[i] +
classifierrnorm(1, -0.2, 0.2)) * classifier$force_in[i]), 2)
(($body_mass[i] <- round(classifier$body_mass[i] +
classifierrnorm(1, -0.2, 0.2)) * classifier$body_mass[i]), 2)
((
}
# -------------------------------------------------------------------------
# data processing
# get overview of input data before simulating bite series
<- plot.linear.regression(x = classifier$body_mass,
BFQ.regression_in y = classifier$force_in,
logarithmic = "10",
x.axis.label = "body mass")
# jitter for variation in maximum bite force within a bite series
# this was set to 0 when checking if the package finds the correct max. force values
# and to 15 to increase bite shape diversity when checking if the package can tell the different
# bite shapes apart
= 0 # 0 15
max.y.jit
# jitter to make the bite curve more unstable
# this was set to 0 when checking if the package finds the correct max. force values
# and to 1 to increase bite shape diversity when checking if the package can tell the different
# bite shapes apart
= 0 # 0 1
jit
# create tibble with simulated time series with different
# bite characteristics for each measurement, specimen and species
# path.plots <- "Z:/PAPERS/PTR_Bite force METHODS/R/package_tests/plots/"
# print(paste0("Saving plots at ", path.plots, "/", today(),"_bite_series.pdf..."))
# pdf(paste0(path.plots, "/", today(),"_bite_series.pdf..."), onefile = TRUE, paper = "a4", height = 14)
par(mfrow = c(3,2))
<- NULL
df.all for(i in 1:nrow(classifier)){