#### 2022-06-10

Olink® Analyze is an R package that provides a versatile toolbox to enable fast and easy handling of Olink® NPX data for your proteomics research. Olink® Analyze provides functions for using Olink data, including functions for importing Olink® NPX datasets exported from the NPX Manager, as well as quality control (QC) plot functions and functions for various statistical tests. This package is meant to provide a convenient pipeline for your Olink NPX data analysis.

# Installation

You can install Olink® Analyze from CRAN.

install.packages("OlinkAnalyze")

# List of functions

Preprocessing

• olink_plate_randomizer Randomize samples on plate

Statistical analysis

• olink_ttest Function which performs a t-test per protein
• olink_wilcox Function which performs a Mann-Whitney U Test per protein
• olink_anova Function which performs an ANOVA per protein
• olink_anova_posthoc Function which performs an ANOVA post-hoc test per protein
• olink_one_non_parametric Function which performs a Kruskal-Wallis Test or Friedman Test per protein
• olink_one_non_parametric_posthoc Function which performs post-hoc test for one way non-parametric test
• olink_ordinalRegression Function which performs an ordinal regression per protein
• olink_ordinalRegression_posthoc Function which performs an ordinal regression post-hoc test per protein
• olink_lmer Function which performs a linear mixed model per protein
• olink_lmer_posthoc Function which performs a linear mixed model post-hoc per protein
• olink_pathway_enrichment Function which performs GSEA or ORA pathway enrichment using outcome from other statistical tests

Visualization

• olink_boxplot Function which plots boxplots of a selected variable
• olink_dist_plot Function to plot the NPX distribution by panel
• olink_lmer_plot Function which performs a point-range plot per protein on a linear mixed model
• olink_pathway_visualization Function which plots a bar graph for pathways of interest
• olink_pathway_heatmap Function which plots estimates of proteins associated with pathways of interest
• olink_pca_plot Function to plot a PCA of the data
• olink_qc_plot Function to plot an overview of a sample cohort per Panel
• olink_heatmap_plot Function which generate a heatmap over all proteins
• set_plot_theme Function to set plot theme

Sample datasets

• npx_data1 NPX Data in Long format
• npx_data2 NPX Data in Long format, Follow-up
• manifest A sample manifest including Sample ID, Subject ID and clinical variables

# Usage

# Load OlinkAnalyze

# Load other libraries used in Vignette
library(dplyr)
library(ggplot2)
library(stringr)

# Preprocessing

The read_NPX function imports an NPX file of wide format that has been exported from Olink® NPX Manager and converts the data into the (preferred by R) long format. The wide format is the most common way Olink® delivers data for Olink® Target 96, however, for data analysis a long format is preferred. No prior alterations to the output of the NPX Manager should be made for this function to work as expected.

### Function arguments

• filename: Path to the NPX Manager output file.
data <- read_NPX("~/NPX_file_location.xlsx")

### Function output

A tibble in long format containing:

• SampleID: Sample names or IDs.
• Index: Unique number for each SampleID. It is used to make up for non unique sample IDs.
• OlinkID: Unique ID for each assay assigned by Olink. In case the assay is included in more than one panels it will have a different OlinkID in each one.
• UniProt: UniProt ID.
• Assay: Common gene name for the assay.
• MissingFreq: Missing frequency for the OlinkID, i.e. frequency of samples with NPX value below limit of detection (LOD).
• Panel_Version: Version of the panel. A new panel version might include some different or improved assays.
• PlateID: Name of the plate.
• LOD: Limit of detection (LOD) is the minimum level of an individual protein that can be measured. LOD is defined as 3 times the standard deviation over background.
• NPX: Normalized Protein eXpression, is Olink’s unit of protein expression level in a log2 scale. The majority of the functions of this package use NPX values for calculations. Read more about NPX here: https://www.olink.com/faq/what-is-npx/.

# Statistical analysis

olink_anova_posthoc performs a post-hoc ANOVA test using the function emmeans from the R library emmeans with Tukey p-value adjustment per assay (by OlinkID) at confidence level 0.95.

The function handles both factor and numerical variables and/or covariates. The post-hoc test for a numerical variable compares the difference in means of the outcome variable (default: NPX) for 1 standard deviation (SD) difference in the numerical variable, e.g. mean NPX at mean (numerical variable) versus mean NPX at mean (numerical variable) + 1*SD (numerical variable).

### Function arguments

• df: NPX data frame in long format should minimally contain protein name (Assay), OlinkID, UniProt, Panel and an outcome factor with at least 3 levels.
• olinkid_list: Character vector of OlinkID’s on which to perform the post-hoc analysis. If not specified, all assays in df are used.
• variable: Single character value or character array. In case of single character then that should represent a column in the df. Otherwise, if length > 1, the included variable names will be used in crossed analyses. It can also accept the notations ‘:’ or ’*’.
• covariates: Single character value or character array. Default: NULL. Confounding factors to include in the analysis. In case of single character then that should represent a column in the df. It can also accept the notations ‘:’ or ’*’, while crossed analysis will not be inferred from main effects.
• outcome: Name of the column from df that contains the dependent variable. Default: NPX.
• effect: Term on which to perform the post-hoc analysis. Character vector. Must be subset of or identical to the variable and no adjustment is performed.
• mean_return: Logical. If true, returns the mean of each factor level rather than the difference in means (default). Note that no p-value is returned for mean_return = TRUE.
• verbose: Logical. Default: True. If information about removed samples, factor conversion and final model formula is to be printed to the console.
# calculate the p-value for the ANOVA
variable = 'Site')
# extracting the significant proteins
anova_results_oneway_significant <- anova_results_oneway %>%
filter(Threshold == 'Significant') %>%
variable = 'Site',
effect = 'Site')

### Function output

A tibble with the following columns:

• Assay <chr>: Assay name.
• UniProt <chr>: UniProt ID.
• term <chr>: Name of the variable that was used for the p-value calculation. The “:” between variables indicates interaction between variables.
• contrast <chr>: Variables (in term) that are compared.
• estimate <dbl>: Difference in mean NPX between variables (from contrast).
• conf.low <dbl>: Low bound of the confidence interval for the mean.
• conf.high <dbl>: High bound of the confidence interval for the mean.
• Threshold <chr>: Text indication if assay is significant (adjusted p-value < 0.05).

## Post-hoc one way non-parametric analysis (olink_one_non_parametric_posthoc)

olink_one_non_parametric_posthoc performs a post-hoc Wilcoxon test using the function wilcox_test from the R library rstatix with Benjamini & Hochberg p-value adjustment per assay (by OlinkID) at confidence level 0.95. The function handles both factor and numerical variables and/or covariates.

### Function arguments

• df: NPX data frame in long format should minimally contain protein name (Assay), OlinkID, UniProt, Panel and an outcome factor with at least 3 levels.
• olinkid_list: Character vector of OlinkID’s on which to perform the post-hoc analysis. If not specified, all assays in df are used.
• variable: Single character value or character array. In case of single character then that should represent a column in the df.
• verbose: Logical. Default: True. If information about removed samples, factor conversion and final model formula is to be printed to the console.
#Friedman Test
Friedman_results <- olink_one_non_parametric(npx_df, "Time", dependence = TRUE)

#Filtering out significant and relevant results.
significant_assays <- Friedman_results %>%
filter(Threshold == 'Significant') %>%
distinct() %>%
pull()

#Posthoc test for the results from Friedman Test
friedman_posthoc_results <- olink_one_non_parametric_posthoc(npx_df, variable = c("Time"), olinkid_list = significant_assays)

### Function output

A tibble with the following columns:

• Assay <chr>: Assay name.
• UniProt <chr>: UniProt ID.
• term <chr>: Name of the variable that was used for the p-value calculation.
• contrast <chr>: Variables (in term) that are compared.
• estimate <dbl>: Difference in mean NPX between variables (from contrast).
• conf.low <dbl>: Low bound of the confidence interval for the location parameter.
• conf.high <dbl>: High bound of the confidence interval for the location parameter.
• Threshold <chr>: Text indication if assay is significant (adjusted p-value < 0.05).

## Post-hoc of regression models for ordinal data analysis (olink_ordinalRegression_posthoc)

olink_ordinalRegression_posthoc performs a post-hoc ANOVA test using the function emmeans from the R library emmeans with Tukey p-value adjustment per assay (by OlinkID) at confidence level 0.95. The function handles both factor and numerical variables and/or covariates.

### Function arguments

• df: NPX data frame in long format should minimally contain protein name (Assay), OlinkID, UniProt, Panel and an outcome factor with at least 3 levels.
• olinkid_list: Character vector of OlinkID’s on which to perform the post-hoc analysis. If not specified, all assays in df are used.
• variable: Single character value or character array. In case of single character then that should represent a column in the df. Otherwise, if length > 1, the included variable names will be used in crossed analyses. It can also accept the notations ‘:’ or ’*’.
• covariates: Single character value or character array. Default: NULL. Confounding factors to include in the analysis. In case of single character then that should represent a column in the df. It can also accept the notations ‘:’ or ’*’, while crossed analysis will not be inferred from main effects.
• outcome: Name of the column from df that contains the dependent variable. Default: NPX.
• effect: Term on which to perform the post-hoc analysis. Character vector. Must be subset of or identical to the variable and no adjustment is performed.
• mean_return: Logical. If true, returns the mean of each factor level rather than the difference in means (default). Note that no p-value is returned for mean_return = TRUE.
• verbose: Logical. Default: True. If information about removed samples, factor conversion and final model formula is to be printed to the console.
# Two-way Ordinal Regression
variable="Treatment:Time")
# extracting the significant proteins
significant_assays <- ordinalRegression_results %>%
filter(Threshold == 'Significant' & term == 'Treatment:Time') %>%
distinct() %>%
pull()
# Posthoc test for the model NPX~Treatment*Time,
variable=c("Treatment:Time"),
covariates="Site",
effect = "Treatment:Time")

### Function output

A tibble with the following columns:

• Assay <chr>: Assay name.
• UniProt <chr>: UniProt ID.
• term <chr>: Name of the variable that was used for the p-value calculation. The “:” between variables indicates interaction between variables.
• contrast <chr>: Variables (in term) that are compared.
• estimate <dbl>: Difference in mean NPX between variables (from contrast).
• Threshold <chr>: Text indication if assay is significant (adjusted p-value < 0.05).

## Post-hoc linear mixed effects model analysis (olink_lmer_posthoc)

The olink_lmer_posthoc function is similar to olink_lmer but performs a post-hoc analysis based on a linear mixed model effects model using the function lmer from the R library lmerTest and the function emmeans from the R library emmeans. The function handles both factor and numerical variables and/or covariates. Differences in estimated marginal means are calculated for all pairwise levels of a given output variable. Degrees of freedom are estimated using Satterthwaite’s approximation. The post-hoc test for a numerical variable compares the difference in means of the outcome variable (default: NPX) for 1 standard deviation difference in the numerical variable, e.g. mean NPX at mean(numerical variable) versus mean NPX at mean(numerical variable) + 1*SD(numerical variable). The output tibble is arranged by ascending adjusted p-values.

### Function arguments

• df: NPX data frame in long format should minimally contain protein name (Assay), OlinkID, UniProt, Panel and 1-2 variables with at least 2 levels and subject ID.
• variable: Single character value or character array. In case of single character then that should represent a column in the df. Otherwise, if length > 1, the included variable names will be used in crossed analyses. It can also accept the notations ‘:’ or ’*’.
• olinkid_list: Character vector of OlinkID’s on which to perform the post-hoc analysis. If not specified, all assays in df are used.
• effect: Term on which to perform the post-hoc analysis. Character vector. Must be subset of or identical to the variable.
• outcome: Name of the column from df that contains the dependent variable. Default: NPX.
• random: Single character value or character array with random effects.
• covariates: Single character value or character array. Default: NULL. Confounding factors to include in the analysis. In case of single character then that should represent a column in the df. It can also accept the notations ‘:’ or ’*’, while crossed analysis will not be inferred from main effects.
• mean_return: Logical. If true, returns the mean of each factor level rather than the difference in means (default). Note that no p-value is returned for mean_return = TRUE and no adjustment is performed.
• verbose: Logical. Default: True. If information about removed samples, factor conversion and final model formula is to be printed to the console.
# Linear mixed model with two variables.
variable = c('Site', 'Treatment'),
random = 'Subject')
# extracting the significant proteins
lmer_results_twoway_significant <- lmer_results_twoway %>%
filter(Threshold == 'Significant', term == 'Treatment') %>%
# performing post-hoc analysis
variable = c('Site', 'Treatment'),
random = 'Subject',
effect = 'Treatment') 

### Function output

A tibble with the following columns:

• Assay <chr>: Assay name.
• UniProt <chr>: UniProt ID.
• term <chr>: Name of the variable that was used for the p-value calculation. The “:” between variables indicates interaction between variables.
• contrast <chr>: Variables (in term) that are compared.
• estimate <dbl>: Difference in mean NPX between variables (from contrast).
• conf.low <dbl>: Low bound of the confidence interval for the mean.
• conf.high <dbl>: High bound of the confidence interval for the mean.
npx_data1 %>%
set_plot_theme()