HDX-MS is a high-information-content method. It is both a curse and a blessing. A lot can be learned, but data analysis is tedious, as each experiment in HDX-MS provides information on hundreds of peptides. Each experiment should be done at least in triplicate. Very often, multiple time points for each protein state are performed. Finally, multiple protein states are compared, as the goal of performing HDX-MS is often to compare how ligand binding or mutation affects the wild-type apo protein state. This can easily lead to a dataset with multiple thousand peptides that need to be matched between the sets compared and deemed statistically relevant or not. HDXBoxeR provides a statistical framework to identify peptides that are significantly different between given protein states. HDXBoxeR automatically matches and compares sets with each other. It calculates if one protein state is different from another using Welch’s T-test and the Critical Interval statistical framework (Hageman and Weis; Anal. Chem. 2019, 91, 13, 8008–8016). It allows for fast data export and generates scripts for Pymol. Finally, it returns all the calculations required for publication, as recommended by Masson et al. in Nature Methods volume 16, pages 595–602 (2019).
HDXBoxeR facilitates multiple aspects of HDX data analysis:
The vignette is divided into several sections that introduce different aspects of analysis:
The HDXBoxeR requires data! The package can be tested using the data included as a reference and example with the package, so don’t worry. However, to be able to perform your analysis, you need to import your data into R. Input for the HDXBoxeR should be generated using HDXExaminer from Sierra Scientifics.
Input preparation: 1. Open your HDexaminer file. 2. Under the Peptides tab > Pool > right-click peptide: delete peptides with no usable data. 3. Tools > Options > Display > Deuteration Table: Display all columns (we did not select show low-confidence results). 4. Tools > Export > All results table.
HDXBoxeR significantly facilitates the comparison of multiple Protein States in the time series and the number of replicates. We recommend using at least 3 replicates for a robust statistical analysis. All the Protein States imported are required to have the same number of replicates. Another recommendation is to use only peptides the user is confident in. There is no point in analyzing bad data.
The example looks as follows:
The input file should have the following columns: Protein State, Deut Time, Experiment, Start, End, Sequence, Charge, Search RT, # Deut, Deut %. If any of these columns is missing data will not be loaded. Another thing to make sure of is that the columns are properly aligned, meaning there is the name number of headers and columns, as sometimes the Confidence column gets shifted in the AllResultsTable.csv file. We keep typically just peptides with high and medium confidence. Example data compares two Protein States: Unbound and Bound. Each of the sets has three replicates. Data also should have one replicate of undeuterated and fully deuterated sets labeled as 0s and FD respectively for allowing full analysis.
First, ensure that you have R and RStudio installed on your computer. Then you can proceed with the installation of the HDXBoxeR package. There are two methods to install the HDXBoxeR package:
To install, use the following command:
####installation using CRAN
install.packages("HDXBoxeR") #execute only once
The HDXBoxeR package is available on GitHub and can be installed using the devtools package. This method of installation is an alternative to the CRAN method.
####installation using Github.
#run only if devtools package is not installed on your machine
install.packages("devtools")
library(devtools) #run next two commends only once
devtools::install_github("mkajano/HDXBoxeR")
#Once installed, you can load the HDXBoxeR package using the following command:
library(HDXBoxeR)
To load data into R, you will need input as described in section 1. Example input is provided as a reference, and to start the fun. Data, when loaded, will be reprocessed into different formats depending on the type of analysis to be performed. There are several options in which input can be processed, as listed below. Function names reflect the chosen option.
Input can be formatted to:
output_tp
vs output_tc
functions).percent=F/T
).states
and times
,
respectively).replicates
option in
the function).match_seq
.Examples:
# Path to example input
library(HDXBoxeR)
file_nm<-system.file("extdata", "All_results_table.csv", package = "HDXBoxeR")
### Later change the path to the path you your data!!!
##########Deuteration uptake vs percent deuteration file preparation
#### Input for uptake deuteration. Default will use all common states, all common Deuteration Times, max. common number of replicates, no csv will be written, and peptide sequences will be matched.
input_uptake_timepoint<-output_tp(filepath = file_nm)
#> Protein.States used: Unbound bound
#>
#> Deut.times used: 3.00s 60.00s 1800.00s 72000.00s
#>
#> Number of replicates used: 3
#>
#> no csv file written
# input for percent deuteration
input_deut_timepoint<-output_tp(filepath = file_nm, percent=T)
#> Protein.States used: Unbound bound
#>
#> Deut.times used: 3.00s 60.00s 1800.00s 72000.00s
#>
#> Number of replicates used: 3
#>
#> no csv file written
#################
####Time courses
#Deuteration uptake
input_uptake_timecourse<-output_tc(file_nm)
#> Protein.States used: Unbound bound
#>
#> Deut.times used: 3.00s 60.00s 1800.00s 72000.00s
#>
#> Number of replicates used: 3
#>
#> no csv file written
# Time courses comparisons
input_deut_timecourse<-output_tc(file_nm, percent=T)
#> Protein.States used: Unbound bound
#>
#> Deut.times used: 3.00s 60.00s 1800.00s 72000.00s
#>
#> Number of replicates used: 3
#>
#> no csv file written
#################
####Analyze selected protein.states, deuteration times, replicates.
names_states<- nm_states(file_nm) ## returns names of the states in file.
input_two_states_two_timepoints<-output_tp(filepath=file_nm, replicates=3, states=names_states[c(1,2)], times=c("3.00s", "72000.00s"), percent=FALSE)
#> Deut.times used: 3.00s 72000.00s
#>
#> Number of replicates used: 3
#>
#> no csv file written
###add option to match sequence and to save as csv (here not used)
input_subset_states<-output_tp(filepath=file_nm, replicates=3, states=names_states[2], times=c("3.00s", "72000.00s"), percent=FALSE, seq_match=FALSE, csv="NA")
#> Deut.times used: 3.00s 72000.00s
#>
#> Number of replicates used: 3
#>
#> no csv file written
Now that the data is loaded let’s start the fun!
Great! Now that the hardest part is done, let’s delve into the actual analysis.
HDXBoxeR provides a statistical framework to identify peptides that are significantly different between given protein states. To conduct statistics, we recommend providing at least 3 replicates for each protein state (duplicate data also work). The number of replicates and Deut Time will be trimmed to be the same for all Protein States in the analysis.
HDXBoxeR employs a hybrid statistical approach using the Welch T-test and Critical interval statistical framework proposed by Hageman and Weis (Hageman and Weis; Anal. Chem. 2019, 91, 13, 8008–8016). To be deemed significantly different, peptides must meet significance criteria on both the individual peptide level (Welch T-test) and surpass the global critical interval for all peptides in the set.
For detailed information, please refer to the cited paper (Hageman and Weis; Anal. Chem. 2019, 91, 13, 8008–8016). In short, the Welch T-test determines if the Null hypothesis can be rejected (Null hypothesis: the average of two means for individual peptides are equal). If the individual critical interval was below the p-value, the null hypothesis was rejected – peptides were deemed significant from an individual peptide standpoint. For the global significance threshold, the propagated standard error of the pooled standard deviation of means for all peptides in two sets was calculated and scaled to the Student distribution. This involves calculating the pooled standard deviation for two sets separately, propagating the error overall standard deviation of these two sets, and finally, scaling it by T-distribution to determine the global critical interval that is the global threshold. If the peptides passed both criteria for the whole set and individual peptides, they were deemed significant.
Most of the statistical analysis is already implemented in plot and other functions, but it is good to have a way to do it yourself if needed. HDXBoxeR allows for the facile calculation of averages and the standard deviation for each set. Additionally, it allows for the calculation of global and individual critical intervals.
Individual p-values for each peptide are calculated against
the first protein state based on the location from the input file. If
the user desires calculation against different protein states, it can be
done by loading HDXExaminer, taking advantage of the states
parameter in output_tp()
or output_tc()
functions.
library(HDXBoxeR)
file_nm<-system.file("extdata", "All_results_table.csv", package = "HDXBoxeR")
input_uptake_timepoint<-output_tp(filepath = file_nm)
#> Protein.States used: Unbound bound
#>
#> Deut.times used: 3.00s 60.00s 1800.00s 72000.00s
#>
#> Number of replicates used: 3
#>
#> no csv file written
###average calculation for all peptides
av1<-ave_timepoint(input_uptake_timepoint)
head(av1)
#> Deut.Time Start End Sequence Search.RT Charge av_Unbound_X..Deut
#> 1 3.00s 40 58 GPLGSKAVVPGPAEHPLQY 6.83 2 8.327667
#> 2 3.00s 40 59 GPLGSKAVVPGPAEHPLQYN 6.44 2 8.717333
#> 3 3.00s 47 63 VVPGPAEHPLQYNYTFW 10.86 2 3.974667
#> 4 3.00s 62 77 FWYSRRTPGRPTSSQS 6.12 3 7.820667
#> 5 3.00s 63 77 WYSRRTPGRPTSSQS 5.33 3 7.625333
#> 6 3.00s 64 77 YSRRTPGRPTSSQS 4.64 2 7.799333
#> av_bound_X..Deut
#> 1 7.013667
#> 2 7.468667
#> 3 2.785667
#> 4 7.779333
#> 5 7.788667
#> 6 7.635333
###average differences (against first Protein State in the file)
da1<-dif_ave(av1)
###standard deviation calculation for all peptides
sd1<-sd_timepoint(input_uptake_timepoint)
head(sd1)
#> Deut.Time Start End Sequence Search.RT Charge sd_Unbound_X..Deut
#> 1 3.00s 40 58 GPLGSKAVVPGPAEHPLQY 6.83 2 0.13838473
#> 2 3.00s 40 59 GPLGSKAVVPGPAEHPLQYN 6.44 2 0.15430597
#> 3 3.00s 47 63 VVPGPAEHPLQYNYTFW 10.86 2 0.10868456
#> 4 3.00s 62 77 FWYSRRTPGRPTSSQS 6.12 3 0.13059990
#> 5 3.00s 63 77 WYSRRTPGRPTSSQS 5.33 3 0.22095324
#> 6 3.00s 64 77 YSRRTPGRPTSSQS 4.64 2 0.05689757
#> sd_bound_X..Deut
#> 1 0.05379901
#> 2 0.10442382
#> 3 0.04764801
#> 4 0.08114391
#> 5 0.01357694
#> 6 0.03754109
### Global critical interval for 1 protein sets
CI_single(s1 =sd1[,7], replicates = 3 )
#> [1] 0.1789317
##Global critical interval for 2 protein sets
CI_2pts(s1 = sd1[,7], s2=sd1[,8], replicates = 3)
#> [1] 0.2706825
##individual peptide p-value calculation against first set in the input file
pv1<-pv_timepoint(input_uptake_timepoint)
head(pv1)
#> Deut.Time Start End Sequence Search.RT Charge pv_Unbound_X..Deut
#> 1 3.00s 40 58 GPLGSKAVVPGPAEHPLQY 6.83 2 1
#> 2 3.00s 40 59 GPLGSKAVVPGPAEHPLQYN 6.44 2 1
#> 3 3.00s 47 63 VVPGPAEHPLQYNYTFW 10.86 2 1
#> 4 3.00s 62 77 FWYSRRTPGRPTSSQS 6.12 3 1
#> 5 3.00s 63 77 WYSRRTPGRPTSSQS 5.33 3 1
#> 6 3.00s 64 77 YSRRTPGRPTSSQS 4.64 2 1
#> pv_bound_X..Deut
#> 1 0.0012982057
#> 2 0.0006305205
#> 3 0.0006958101
#> 4 0.6701607913
#> 5 0.3287217910
#> 6 0.0188970633
####Analyze selected protein.states, deuteration times, replicates.
names_states<- nm_states(file_nm) ## returns names of the states in file.
### choosing different state as a control for analysis
input_states_reversed<-output_tp(file_nm, states=names_states[c(2,1)] )
#> Deut.times used: 3.00s 60.00s 1800.00s 72000.00s
#>
#> Number of replicates used: 3
#>
#> no csv file written
pv2<-pv_timepoint(input_states_reversed)
head(pv2)
#> Deut.Time Start End Sequence Search.RT Charge pv_bound_X..Deut
#> 1 3.00s 40 58 GPLGSKAVVPGPAEHPLQY 6.83 2 1
#> 2 3.00s 40 59 GPLGSKAVVPGPAEHPLQYN 6.44 2 1
#> 3 3.00s 47 63 VVPGPAEHPLQYNYTFW 10.86 2 1
#> 4 3.00s 62 77 FWYSRRTPGRPTSSQS 6.12 3 1
#> 5 3.00s 63 77 WYSRRTPGRPTSSQS 5.33 3 1
#> 6 3.00s 64 77 YSRRTPGRPTSSQS 4.64 2 1
#> pv_Unbound_X..Deut
#> 1 0.0012982057
#> 2 0.0006305205
#> 3 0.0006958101
#> 4 0.6701607913
#> 5 0.3287217910
#> 6 0.0188970633
There are a few variables that are going to be useful while analyzing the data. It might be a good idea to have them saved in the memory. One can call them as follows:
###load data
states<-arguments_call1(filepath=file_nm)
times<-arguments_call2(filepath=file_nm, states=states)
replicates<-arguments_call3(filepath=file_nm, states=states, times=times)
Sometimes, we do not want to work on or plot the whole imported
dataset. One can either load a smaller subset or choose a part of the
subset from the loaded dataset to work on. The already loaded data can
be subsetted using the select_indices
function or by
choosing desired columns from the loaded data. In the loaded data, the
first 6 columns always describe peptides, and the rest of the columns
contain data.
#################
####Time courses
#Deuteration uptake
input_uptake_timecourse<-output_tc(file_nm)
#> Protein.States used: Unbound bound
#>
#> Deut.times used: 3.00s 60.00s 1800.00s 72000.00s
#>
#> Number of replicates used: 3
#>
#> no csv file written
# Time courses comparisons
small_df1<-input_uptake_timepoint[,c(1:6, 10:12)]
#if one wants to focus just on the few first peptides in the data it is possible to select a few rows from the input data.
small_df2<-input_uptake_timepoint[1:3,]
#To have more control of the selecting one can also use the select_indices functions as follows:
##for output_tp function variables that can be used are:
#below one time of 60s was used for peptides that started at residue 50, ended at residue 100 and at the max 12 residues long.
#not all parameters need to be used.
inda<-select_indices(input_uptake_timepoint, times = c("60.00s"),start = 50, end=100, length=12)
#after the indices are selected the input can be subsetted as normal
small_df3<-input_uptake_timepoint[inda,]
##for the output_tc function allowed are the following parameters
timecourse_output<-output_tc(file_nm)
#> Protein.States used: Unbound bound
#>
#> Deut.times used: 3.00s 60.00s 1800.00s 72000.00s
#>
#> Number of replicates used: 3
#>
#> no csv file written
indb<-select_indices(timecourse_output, states = "bound",start = 50, end=100, length=12)
head(timecourse_output[indb,1:6])
#> Sequence Protein.State Start End Search.RT Charge
#> 16 YEQNIKQIGTF bound 78 88 7.61 2
#> 18 IKQIGTF bound 82 88 6.81 2
#> 20 ASVEQFWRF bound 89 97 10.72 2
HDXBoxeR features functionality to create various types of plots, including volcano plots, heat maps, and robot plots (modified butterfly plots).
Information on how to load input into HDXBoxeR can be found in section 3. If the returned plots have truncated labels, adjust the figure sizes.
Sometimes, it is just useful to look at raw-ish data. Uptake plots are the answer here.
###load data
library(HDXBoxeR)
file_nm<-system.file("extdata", "All_results_table.csv", package = "HDXBoxeR")
input_deut_timepoint<-output_tc(file_nm, percent=TRUE)
#> Protein.States used: Unbound bound
#>
#> Deut.times used: 3.00s 60.00s 1800.00s 72000.00s
#>
#> Number of replicates used: 3
#>
#> no csv file written
# define the timepoint using in the experiment. Here the times in seconds
x<-c(3, 60,1800, 72000)
uptake_plots(input_deut_timepoint[1:2,],x)
Boxplots enable quick comparisons of average uptake or percent deuteration for all protein states.
###load data
# library(HDXBoxeR)
# file_nm<-system.file("extdata", "All_results_table.csv", package = "HDXBoxeR")
# input_uptake_timepoint<-output_tp(file_nm)
### Returns boxplots for all the time points and all protein states.
boxplot_tp(input_uptake_timepoint, col= c("gold2", "dodgerblue"))
###load data
library(HDXBoxeR)
file_nm<-system.file("extdata", "All_results_table.csv", package = "HDXBoxeR")
# ## input for deuteration uptakeinput_uptake_timepoint<-output_tp(file_nm)
# # input for procent deuteration
input_deut_timepoint<-output_tp(file_nm, percent=T)
#> Protein.States used: Unbound bound
#>
#> Deut.times used: 3.00s 60.00s 1800.00s 72000.00s
#>
#> Number of replicates used: 3
#>
#> no csv file written
## deuteration uptake for time points
plots_av_tp(input_uptake_timepoint)
### average plots with colors chosen by user.
plots_av_tp(input_uptake_timepoint,replicates=3, cola=c(1:10))
### average percent deuteration
plots_av_tp_proc(input_deut_timepoint)
The average difference plots depict deuteration uptake (or percent
deuteration) between different Protein states. By default, the first
Protein State from the input file is chosen as a state from which other
values are subtracted. For example, if the input file has three Protein
States: State1, State2, State2, average difference plots will show two
curves for the difference in deuteration uptake: (1) State1-State2, (2)
State1-State3. If other permutations are desired, use the
states
parameter in output_tp
or
output_tc
functions. All time-points are drawn on separate
plots. The function is useful when comparing multiple Protein
States.
# ###load data
# file_nm<-system.file("extdata", "All_results_table.csv", package = "HDXBoxeR")
# ## input for deuteration uptake
# input_uptake_timepoint<-output_tp(file_nm)
## difference in uptake deuterations between two sets
plots_diff_tp(input_uptake_timepoint)
#Load deut data
# input for procent deuteration
# input_deut_timepoint<-output_tp(file_nm, percent=T)
### difference in procent deuteration
plots_diff_tp_proc(input_deut_timepoint,replicates=3, cola=4)
#input with different order states
names_states<- nm_states(file_nm)
input_reversed_states<-output_tp(file_nm, states = rev(names_states))
#> Deut.times used: 3.00s 60.00s 1800.00s 72000.00s
#>
#> Number of replicates used: 3
#>
#> no csv file written
### average plots for deuteration uptake where control state was chosen differently
plots_diff_tp(input_reversed_states, col="darkgreen")
Volcano plots were introduced by the Weis laboratory (Analytical Chemistry 2019, 91 (13), 8008-8016 DOI: 10.1021/acs.analchem.9b01325). The strength of the volcano plots is that they allow for a quick view to see how many peptides are between the sets and provide statistics for the differences. The Volcano plots take advantage of the p-value and critical interval calculation. The p-value is calculated per each peptide using the Welch t-test for the desired number of replicates. The global critical interval is calculated as described in Hageman and Weis (Analytical Chemistry 2019, 91 (13), 8008-8016 DOI: 10.1021/acs.analchem.9b01325). As the default alpha is set to 0.99, the p-value cut off is marked by horizontal lines, and the global critical interval by vertical lines. Significantly different peptides between the sets are located in the shaded area. The volcano plots are only prepared for differences in deuterons uptake. The critical intervals for all time-points are returned in the terminal.
###load data
file_nm<-system.file("extdata", "All_results_table.csv", package = "HDXBoxeR")
## input for deuteration uptake
a<-output_tp(file_nm)
#> Protein.States used: Unbound bound
#>
#> Deut.times used: 3.00s 60.00s 1800.00s 72000.00s
#>
#> Number of replicates used: 3
#>
#> no csv file written
### basic volcano plot
plots_vol_tp(a)
#> CI @ 3.00s : 0.35
#> CI @ 60.00s : 0.27
#> CI @ 1800.00s : 0.24
#> CI @ 72000.00s : 0.2
# change colors for significant peptides in volcano plots
plots_vol_tp(a, cola=c(2,3), replicates=3)
#> CI @ 3.00s : 0.35
#> CI @ 60.00s : 0.27
#> CI @ 1800.00s : 0.24
#> CI @ 72000.00s : 0.2
#change pv_cutoff from 0.01 to 0.1
plots_vol_tp(a, pv_cutoff = 0.1)
#> CI @ 3.00s : 0.16
#> CI @ 60.00s : 0.13
#> CI @ 1800.00s : 0.11
#> CI @ 72000.00s : 0.09
The significant peptides plots draw peptides along the covered sequence of the protein. Peptides deemed significantly different using the Welch test and the global critical interval criterion are colored in a red-blue scheme. Red-colored peptides are more exposed compared to the control set, while blue colors correspond to increased protection compared to the control set. Non-significant peptides are colored black.
The plots can be drawn using either percent deuteration or
deuteration uptake input sets. If the percent deuteration input is used,
the difference between the two sets is colored according to the chosen
scheme (function plot_peptide_sig_tp_proc
). If using
deuteration uptake input, data is converted to percents using the
equation:
(Uptake(state1) - Uptake(state2)) / Uptake(state1) * 100%
(function: plot_peptide_sig_tp
).
Percent differences are divided into ranges, each corresponding to a specific color in the red-blue color scheme. Range values can be changed as desired within the red-blue color scheme. In the terminal, ranges for each set are returned to help choose ranges of the color scheme.
Plots are generated separately for each time-point and each pair of
the difference values (State1-State2, State1-State3…). The
plot_peptide_sig_tp
generates multiple plots, which usually
calls for using the plotting parameter mfrow
.
###load data
file_nm<-system.file("extdata", "All_results_table.csv", package = "HDXBoxeR")
## input for deuteration uptake
a<-output_tp(file_nm)
#> Protein.States used: Unbound bound
#>
#> Deut.times used: 3.00s 60.00s 1800.00s 72000.00s
#>
#> Number of replicates used: 3
#>
#> no csv file written
### significantly different peptides are colored in red-blue scheme.
#par(mfrow=c(4,1)) # uncomment if desired
plot_peptide_sig_tp(a,replicates = 3)
#> For timepoint 3.00s
#> values range from 0 to 5.68 %
#> Unbound bound range 0 5.68
#> For timepoint 60.00s
#> values range from -0.28 to 4.04 %
#> Unbound bound range -0.28 4.04
#> For timepoint 1800.00s
#> values range from -0.31 to 1.96 %
#> Unbound bound range -0.31 1.96
#> For timepoint 72000.00s
#> values range from -0.99 to 6.01 %
#> Unbound bound range -0.99 6.01
### Plot where 18 peptides per row are drawn (nb_pep_row=18, default=50),
#p-value&critial interval was made more stingent (0.001)
# % ranges are colored were changed.
#par(mfrow=c(4,1)) #uncomment if desired
plot_peptide_sig_tp(a,nb_pep_row = 18, ranges = c(-Inf, -10, -2.5,0,2.5, 10, Inf), pv_cutoff = 0.001)
#> For timepoint 3.00s
#> values range from 0 to 5.68 %
#> Unbound bound range 0 5.68
#> For timepoint 60.00s
#> values range from 0 to 4.04 %
#> Unbound bound range 0 4.04
#> For timepoint 1800.00s
#> values range from 0 to 1.96 %
#> Unbound bound range 0 1.96
#> For timepoint 72000.00s
#> values range from -0.99 to 1.49 %
#> Unbound bound range -0.99 1.49
## Legend for significant peptides plot
#default ranges for figures does not require argument
legend_sig_peptides()
### Using different range scheme
legend_sig_peptides(ranges = c(-Inf, -10, -2.5,0,2.5, 10, Inf))
Differential Heat Maps are plots where peptides significantly different between the sets are translated into residue-specific information. Heat maps are a fast, highly informative, and elegant way to see which regions are perturbed in the sets. However, it means that some information is lost (coverage). There are two options on how these calculations are performed:
Differential heat maps can be calculated using the difference in percent deuteration between two sets or uptake of (control - other sets) / (uptake of control) * 100.
The example scheme for how average uptake and maximum uptake per residue are generated is shown below.