---
title: "Indicator species analysis"
author: "Miquel De Cáceres, CREAF"
date: "`r Sys.Date()`"
output:
rmarkdown::html_vignette:
toc: TRUE
bibliography: indicspecies.bib
vignette: >
%\VignetteIndexEntry{Indicator species analysis}
%\VignetteEngine{knitr::rmarkdown}
%\VignettePackage{indicspecies}
%\VignetteEncoding{UTF-8}
---
```{r setup, include = FALSE}
knitr::opts_chunk$set(
collapse = TRUE,
comment = "#>"
)
```
## Introduction
Determining the occurrence or abundance of a small set of indicator species, as an alternative to sampling the entire community, has been particularly useful in longterm environmental monitoring for conservation or ecological management. Species are chosen as indicators if they:
i. reflect the biotic or abiotic state of the environment;
ii. provide evidence for the impacts of environmental change; or
iii. predict the diversity of other species, taxa or communities within an area.
In this tutorial we will show how to use the functions included in package `indicspecies` to conduct indicator species analysis. This package was originally created as a supplementary material to @DeCaceres2009, but has been developing since then and now `indicspecies` updates are distributed from CRAN and GitHub repositories.
Before doing anything else, we need to load the functions of the package:
```{r}
library(indicspecies)
```
## Data required for indicator species analysis
Indicator species are often determined using an analysis of the relationship between the species occurrence or abundance values from a set of sampled sites and the classification of the same sites into site groups, which may represent habitat types, community types, disturbance states, etc. Thus, there are two data elements in an indicator species analysis: (1) the community data matrix; and (2) the vector that describes the classification of sites into groups.
### The community data matrix
This is a matrix (or a data frame) with **sites in rows** and **species in columns**. Normally, we will use functions like `read.table()` or `read.csv()` to read our data set from a file. In this example we load our example dataset into the workspace using:
```{r}
data(wetland)
```
The wetland data set describes the vegetation of the Adelaide river alluvial plain (Australia), as sampled by @Bowman1987. It contains the abundance values of 33 species (columns) in 41 sites (rows).
### Defining the classification of sites
In order to run an indicator species analysis we need a vector containing the classification of the sites into groups. The intepretation of these site groups is left to the user. Like community data, site classifications can be read from a data file. A vector of site groups can be also created manually, for example, using the R functions `c()` and `rep()`:
```{r}
groups <- c(rep(1, 17), rep(2, 14), rep(3,10))
groups
```
Alternatively, one can obtain a classification using non-hierarchical cluster analysis:
```{r}
wetkm <- kmeans(wetland, centers=3)
groupskm <- wetkm$cluster
groupskm
```
If the site classification vector is obtained independently of species data, the significance of statistical tests carried out on the indicator species will be meaningful. For example, one could classify the sites using environmental data before indicator species analysis. An example is found in @Borcard2011.
## Indicator species analysis using `multipatt()`
Function `multipatt()` is the most commonly used function of `indicspecies`. It allows determining lists of species that are associated to particular groups of sites (or combinations of those). Once we have the two data components mentioned in the previous section, we are ready to run an indicator species analysis using `multipatt()`.
### Indicator Value analysis with site group combinations
When the aim is to determine which species can be used as indicators of certain site group an approach commonly used in ecology is the *Indicator Value* [@Dufrene1997]. These authors defined an Indicator Value (IndVal) index to measure the association between a species and a site group. The method of @Dufrene1997 calculates the IndVal index between the species and each site group and then looks for the group corresponding to the highest association value. Finally, the statistical significance of this relationship is tested using a permutation test. IndVal is the default index used to measure the association between a species and a group of sites in `multipatt()`. However, by default `multipatt()` uses an extension of the original Indicator Value method, because the function looks for indicator species of both individual site groups *and* combinations of site groups, as explained in @DeCaceres2010.
Indicator species analysis (with site group combinations) can be run using:
```{r}
indval <- multipatt(wetland, groups,
control = how(nperm=999))
```
As mentioned before, by default `multipatt()` uses the IndVal index (`func = "IndVal.g"`) as test statistic. Actually, the square root of IndVal is returned by the `multipatt()` function. The option `control = how(nperm=999)` allows choosing the number of random permutations required for the permutational test (this number affects the precision of the p-value). Function `how()` from the `permute` package allows defining more complex permutational designs.
#### Displaying the results
When the indicator species analysis is completed, we can obtain the list of indicator species for each site group (or site group combination) using:
```{r}
summary(indval)
```
In our wetland community data, 'Ludads' is strongly and significantly associated with Group 1, whereas 'Pancam' would be a good indicator of Group 3. In addition, there are some species whose patterns of abundance are more associated with a combination of groups. For example, 'Melcor' is strongly associated with the combination of Groups 2 and 3.
It is important to stress that the indicator species analysis is conducted for each species independently, although the results are often summarized for all species. User should bear in mind possible problems of multiple testing when making community-level statements such as '*the number of indicator species is X*' [@DeCaceres2009; @Legendre2012].
#### Examining the indicator value components
If the association index used in `multipatt()` is `func = "IndVal"` or `func = "IndVal.g"`, one can also inspect the indicator value components when displaying the results. Indeed, the indicator value index is the product of two components, called 'A' and 'B' [@Dufrene1997; @DeCaceres2009]:
1. Component 'A' is sample estimate of the probability that the surveyed site belongs to the target site group given the fact that the species has been found. This conditional probability is called the *specificity* or *positive predictive value* of the species as indicator of the site group.
2. Component 'B' is sample estimate of the probability of finding the species in sites belonging to the site group. This second conditional probability is called the *fidelity* or *sensitivity* of the species as indicator of the target site group.
To display the indicator value components 'A' and 'B' one simply uses:
```{r}
summary(indval, indvalcomp=TRUE)
```
This gives us additional information about why species can be used as indicators. For example, 'Ludads' is a good indicator of Group 1 because it occurs in sites belonging to this group only (i.e., A = 1.0000), although not all sites belonging to Group 1 include the species (i.e., B = 0.8235). In contrast, 'Pancam' can be used to indicate Group 3 because it appears in all sites belonging to this group (i.e., B = 1.0000) and it is largely (but not completely) restricted to it (i.e., A = 0.8278).
#### Inspecting the indicator species analysis results for all species
In our previous calls to `summary()` only the species that were significantly associated with site groups (or site group combinations) were shown. One can display the result of the indicator species analysis for all species, regardless of whether the permutational test was significant or not. This is done by changing the significance level in the summary:
```{r}
summary(indval, alpha=1)
```
Parameter `alpha` is by default set to `alpha = 0.05`, and hides all species association that are not significant at this level. By setting `alpha = 1` we say we want to display the group to which each species is associated, regardless of whether the association significant or not. However, note that in our example we obtain the results of 29 (21+8) species. As there are 33 species in the data set, there are still four species missing in this summary. This happens because those species have their highest IndVal value for the set of all sites. In other words, those species occur in sites belonging to all groups. The association with the set of all sites cannot be statistically tested, because there is no external group for comparison. In order to know which species are those, one has to inspect the object `sign` returned by `multipatt()`:
```{r}
indval$sign
```
After accessing the object `indval$sign`, we know that the four species whose highest IndVal corresponded to the set of all sites were 'Valind', 'Sessp.', 'Helind' and 'Ipoaqu', as indicated by the missing values in the `p.value` column of the data frame. The first columns of \texttt{sign} indicate (with ones and zeroes) which site groups were included in the combination preferred by the species. Then, the column \texttt{index} indicates the index of the site group combination (see subsection Excluding site group combinations in `multipatt()` below). The remaining two columns are the association statistic and the p-value of the permutational test.
### Analyzing species ecological preferences with correlation indices
Several other indices can be used to analyze the association between a species and a group of sites [@DeCaceres2009]. Diagnostic (or indicator) species are an important tool in vegetation science, because these species can be used to characterize and indicate specific plant community types. A statistic commonly used to determine the association (also known as *fidelity*, not to be confounded with the indicator value component) between species and vegetation types is Pearson's *phi coefficient of association* [@Chytry2002a]. This coefficient is a measure of the correlation between two binary vectors. It is possible to calculate the phi coefficient in `multipatt()` after transforming our community data to presence-absence:
```{r}
wetlandpa <- ifelse(wetland>0,1,0)
phi <- multipatt(wetlandpa, groups, func = "r",
control = how(nperm=999))
```
What would be the association index if we had used abundance values instead of presence and absences (i.e. `wetland` instead of `wetlandpa`)? The abundance-based counterpart of the phi coefficient is called the *point biserial correlation coefficient*.
It is a good practice to correct the phi coefficient for the fact that some groups have more sites than others [@Tichy2006]. To do that, we need to use `func = "r.g"` instead of `func = "r"`:
```{r}
phi <- multipatt(wetlandpa, groups, func = "r.g",
control = how(nperm=999))
```
Remember that the default association index of `multipatt()` is `func = "IndVal.g"`, which also includes ".g". In fact, the Indicator Value index defined by @Dufrene1997 already incorporated a correction for unequal group sizes. It is possible to avoid this correction by calling `multipatt()` with `func = "IndVal"`. However, in general we recommend using either `func = "IndVal.g"` or `func = "r.g"` for indicator species analysis.
Indicator value and correlation indices usually produce similar results. Indeed, if we display the results of the phi coefficient of association we see that they are qualitatively similar to those of IndVal:
```{r}
summary(phi)
```
Nevertheless, there are some differences between indicator values and correlation indices [@DeCaceres2008; @DeCaceres2009]. Correlation indices are used for determining the ecological preferences of species among a set of alternative site groups or site group combinations. Indicator value indices are used for assessing the predictive values of species as indicators of the conditions prevailing in site groups, e.g. for field determination of community types or ecological monitoring.
An advantage of the phi and point biserial coefficients is that they can take negative values. When this happens, the value of the index is expressing the fact that a species tends to 'avoid' particular environmental conditions. We will find negative association values if we inspect the strength of association in the results of `multipatt()` when these coefficients are used:
```{r}
round(head(phi$str),3)
```
In contrast, indicator values are always non-negative:
```{r}
round(head(indval$str),3)
```
The fact that correlation indices take negative values means that they can be used to detect *negative associations* (i.e. site groups or site group combinations that are statistically avoided by the species). Function `multipatt()` can be used for this purpose without modification. The reason is that strength of correlation for a given site group combination has always the same absolute value than the correlation for the complementary combination (you can verify this in the table above displaying the strength of association results for `phi`). This entails that the site group combination that maximizes positive association is the complement of the site group combination that minimizes (i.e. maximizes the negative value) negative association. For example, in species 'Pancam' the maximum positive phi correlation is for Group 3 and the minimum negative phi correlation is for the combination of Group 1 and Group 2 (i.e. the complement of Group 3). Hence, one could use the complement of site group combinations indicated by `multipatt()` for positive associations as results for negative associations. Following the same example, species 'Pancam' would be negatively associated to the combination of Group 1 and Group 2 with the same p-value as obtained for its positive association with Group 3.
Unlike with indicator value coefficients, the set of all sites can never be considered with the phi or point biserial coefficients, because these coefficients always require a set of sites for comparison, besides the target site group or site group combination of interest.
### Excluding site group combinations in `multipatt()`
There are two situations where site group combinations may be excluded from the analysis:
a. When conducting indicator species analysis, it may happen that some combinations of site groups are **difficult to interpret ecologically**. In those cases, we may decide to exclude those combinations from the analysis, so our species may appear associated to other (more interpretable) ecological conditions.
b. The combinations to be explored will need to be restricted if **the initial number of groups is large** (e.g. more than 15), because the number of possible combinations grows exponentially and calculations will become computationally unfeasible if combinations are not restricted.
There are three ways to restrict the site group combinations to be considered in `multipatt()`, as detailed in the following subsections.
#### Indicator species analysis without site groups combinations
The original Indicator Value method of @Dufrene1997 did not consider combinations of site groups. In other words, the only site group combinations permitted in the original method were singletons. When using `multipatt()` it is possible to avoid considering site group combinations, as in the original method, by using `duleg = TRUE`:
```{r}
indvalori <- multipatt(wetland, groups, duleg = TRUE,
control = how(nperm=999))
summary(indvalori)
```
#### Restricting the order of site groups combinations
The second way to exclude site group combinations from a `multipatt()` analysis is to indicate the maximum order of the combination to be considered. Using the option `max.order` we can restrict site group combinations to be, for example, singletons (`max.order = 1`, which is equal to `duleg=TRUE`), singletons and pairs (`max.order = 2`), or singletons, pairs and triplets (`max.order = 3`). In the follow example, only singletons and pairs are considered:
```{r}
indvalrest <- multipatt(wetland, groups, max.order = 2,
control = how(nperm=999))
summary(indvalrest)
```
In this case the output looks like a the output of an unrestricted `multipatt()` execution, because the only combination that is excluded is the set of all sites, which cannot be tested for significance and thus never appears in the summary.
Limiting the maximum order of combinations is the commonest way to restrict combinations to avoid computational problems derived from a large number of site groups.
#### Specifying the site groups combinations to be considered
There is a third, more flexible, way of restricting site group combinations. The input parameter vector `restcomb` allows specifying the combinations of site groups that are permitted in `multipatt()`. In order to learn how to use parameter `restcomb`, we must first understand that inside `multipatt()` site groups and site group combinations are referred to with integers. Site group combinations are numbered starting with single groups and then increasing the order of combinations. For example, if there are three site groups, the first three integers '1' to '3' identify those groups. Then, '4' identifies the combination of Group 1 and Group 2, '5' identifies the combination of Group 1 and Group 3, and '6' identifies the combination of Group 2 and Group 3. Finally, '7' identifies the combination of all three groups.
The numbers composing the vector passed to `restcomb` indicate the site groups and site group combinations that we want `multipatt()` to considered as valid options. For example, if we do not want to consider the combination of Group 1 and Group 2, we will exclude combination '4' from vector `restcomb`:
```{r}
indvalrest <- multipatt(wetland, groups, restcomb = c(1,2,3,5,6),
control = how(nperm=999))
summary(indvalrest)
```
If we compare these last results with those including all possible site group combinations, we will realize that species 'Elesp.' was formerly an indicator of Group 1 and Group 2, and now it does not appear in the list of indicator species. If fact, if we examine the results more closely we see that the highest IndVal for 'Elesp' is achieved for group 1, but this relationship is not significant:
```{r}
indvalrest$sign
```
Restricting site group combinations is also possible with the phi and point biserial coefficients.
## Additional functions to estimate and test the association between species and groups of sites
Although `multipatt()` is a user-friendly function for indicator species analysis, other functions are also useful to study the association between species and site groups.
### The function `strassoc()`
Function `strassoc()` allows calculating a broad hand of association indices, described in @DeCaceres2009. For example, we can focus on the 'A' component of IndVal:
```{r}
prefstat <- strassoc(wetland, cluster=groups, func="A.g")
round(head(prefstat),3)
```
A feature of `strassoc()` that is lacking in `multipatt()` is the possibility to obtain confidence interval limits by bootstrapping. In this case, the function returns a list with three elements: `stat`, `lowerCI` and `upperCI`:
```{r}
prefstat <- strassoc(wetland, cluster=groups, func="A.g",
nboot.ci = 199)
round(head(prefstat$lowerCI),3)
round(head(prefstat$upperCI),3)
```
For example, the 95\% confidence interval for the 'A' component of the association between 'Pancam' and Group 3 is [`r round(prefstat$lowerCI[4,3],3)`, `r round(prefstat$upperCI[4,3],3)`].
### The function `signassoc()`
As we explained before, `multipatt()` statistically tests the association between the species and its more strongly associated site group (or site group combination). By contrast, `signassoc()` allows one to test the association between the species and each group of sites, regardless of whether the association value was the highest or not. Moreover, the function allows one to test both one-sided and two-sided hypotheses. For example, the following line tests whether the frequency of the species in each site group is higher or lower than random:
```{r}
prefsign <- signassoc(wetland, cluster=groups, alternative = "two.sided",
control = how(nperm=199))
head(prefsign)
```
The last columns of the results indicate the group for which the p-value was the lowest, and the p-value corrected for multiple testing using the Sidak method.
## Determining how well target site groups are covered by indicators
Besides knowing what species can be useful indicators of site groups (or site group combinations), it is sometimes useful to know *the proportion of sites of a given site group where one or another indicator is found*. We call this quantity *coverage* of the site group. Determining the coverage of site groups can be useful for habitat or vegetation types encompassing a broad geographic area [@DeCaceres2012], because there may exist some areas where none of the valid indicators can be found.
### The function `coverage()`
The coverage can be calculated for all the site groups of a `multipatt()` object using the function `coverage()`:
```{r}
coverage(wetland, indvalori)
```
Note that to obtain the coverage we need to input both the community data set and the object of class `multipatt()`. In this case the coverage was complete (i.e. 100\%) for site groups '1' and '3'. In contrast, group '2' has a lower coverage because only one species, 'Phynod', can be considered indicator of the site group, and this species does not always occur in sites of the group.
The coverage of site groups depends on how many and which indicators are considered as valid. Statistical significance (i.e., `alpha`) determined in `multipatt()` can be used to determine what indicators are valid, but we can add more requirements to the validity of indicator species by specifying additional parameters to the function `coverage()`. For example, if we want to know the coverage of our site groups with indicators that are significant and whose 'A' value is equal or higher than 0.8, we can use:
```{r}
coverage(wetland, indvalori, At = 0.8, alpha = 0.05)
```
Note that, after adding this extra requirement, group '2' has 0\% coverage and the coverage of group '1' has also decreased.
### The function `plotcoverage()`
It is possible to know how the coverage changes with 'A' threshold used to select good indicators. This is obtained by drawing the coverage values corresponding to different threshold values. This is what the `plotcoverage()` function does for us:
```{r, fig = TRUE, fig.width = 5, fig.height = 5}
plotcoverage(x=wetland, y=indvalori, group="1", lty=1)
plotcoverage(x=wetland, y=indvalori, group="2", lty=2, col="blue", add=TRUE)
plotcoverage(x=wetland, y=indvalori, group="3", lty=3, col="red", add=TRUE)
legend(x = 0.1, y=30,
legend=c("group 1","group 2", "group 3"),
lty=c(1,2,3), col=c("black","blue","red"), bty="n")
```
As you can see in the example, function `plotcoverage()` has to be called for one group at a time. However, several plots can be drawn one onto the other using the option `add=TRUE`.
## Species combinations as indicators of site groups
Ecological indicators can be of many kinds. @DeCaceres2012 recently explored the indicator value of *combinations of species* instead of just considering individual species. The rationale behind this approach is that two or three species, when found together, bear more ecological information than a single one.
### Generating species combinations
The association between species combinations and groups of sites is studied in the same way as for individual species. However, instead of analyzing a site-by-species matrix, we need a matrix with as many rows as there are sites and as many columns as there are species combinations. We can obtain that matrix using the function `combinespecies()`:
```{r}
wetcomb <- combinespecies(wetland, max.order = 2)$XC
dim(wetcomb)
```
The resulting data frame has the same number of sites (i.e. `r nrow(wetcomb)`) but as many columns as species combinations (in this case `r nrow(wetcomb)` columns). Each element of the data frame contains an abundance value, which is the minimum abundance value among all the species forming the combination, for the corresponding site. In our example, we used `max.order = 2` to limit the order of combinations. Therefore, only pairs of species were considered. Once we have this new data set, we can use it in `multipatt()`:
```{r}
indvalspcomb <- multipatt(wetcomb, groups, duleg = TRUE,
control = how(nperm=999))
summary(indvalspcomb, indvalcomp = TRUE)
```
The best indicators for both Group 1 and Group 3 are individual species ('Ludads' and 'Pancam'). However, Group 2 is best indicated if we find, in the same community, 'Phynod' and 'Elesp'. Note that the species forming the indicator combination do not need to be good single-species indicators themselves. In our example 'Phynod' is a good indicator of Group 2 but 'Elesp.' is not. As an aside, note that we used the option `duleg=TRUE` in this last example, hence excluding site group combinations, for simplicity.
### The function `indicators()`
In the previous example, there were many combinations of species that were significantly associated any of the site groups. There is another, more efficient, way of exploring the potential indicators for a given target site group. Say, for example, that we want to determine indicators for our Group 2, and we want to consider not only species pairs but also species trios. We can run the indicator analysis using:
```{r}
sc <- indicators(X=wetland, cluster=groups, group=2,
max.order = 3, verbose=TRUE,
At=0.5, Bt=0.2)
```
We can discard species combinations with low indicator values by setting thresholds for components A and B (in our example using `At=0.5` and `Bt=0.2`). The parameter `verbose = TRUE` allowed us to obtain information about the analysis process. Note that, by default, the `indicators()` function will consider species combinations up to an order of 5 (i.e. `max.order = TRUE`). This can result in long computation times if the set of candidate species is not small. Similarly to `multipatt()`, we can print the results of `indicators()` for the most useful indicators, using:
```{r}
print(sc, sqrtIVt = 0.6)
```
The species combinations are listed in decreasing indicator value order. Function `indicators()` also calculates the statistical significance of indicator combinations, by making an internal call to `signassoc()`. In this case, we obtain that a combination of 'Phynod', 'Helind' and 'Elesp.' is even a better indicator than 'Phynod' and 'Elesp.'. Note that the 'A' and IndVal values for the pair 'Phynod' and 'Elesp.' do not exactly match those obtained before. This is because `indicators()` uses 'IndVal' and not 'IndVal.g' as default association statistic. In contrast to `multipatt()`, in `indicators()` association statistics are restricted to 'IndVal' and 'IndVal.g'.
Although we did not show it here, `indicators()` can be run with the option requesting for bootstrap confidence intervals (using option `nboot.ci`). This allow us to know the reliability of the indicator value estimates, which is specially important for site groups of small size [@DeCaceres2012].
### Determining the coverage for objects of class `indicators()`
We can determine the proportion of sites of the target site group where one or another indicator is found, using, as shown before for objects of class `multipatt()`, the function `coverage()`:
```{r}
coverage(sc)
```
In this case the coverage was complete (i.e. 100\%). Like we did for objects of class `multipatt()`, we can add requirements to the validity of indicators. For example:
```{r}
coverage(sc, At=0.8, alpha =0.05)
```
While we do not show here, in the case of objects of class `indicators()` we recommend to explore the coverage using the lower boundary of confidence intervals, using option `type` of function `coverage()`.
Finally, we can also plot coverage values corresponding to different thresholds:
```{r, fig = TRUE, fig.width = 5, fig.height = 5}
plotcoverage(sc)
plotcoverage(sc, max.order=1, add=TRUE, lty=2, col="red")
legend(x=0.1, y=20, legend=c("Species combinations","Species singletons"),
lty=c(1,2), col=c("black","red"), bty="n")
```
The coverage plot tells us that if we want to use a very large `A' threshold (i.e., if we want to be very strict to select valid indicators), we won't have enough valid indicators to cover all the area of our target site group. This limitation is more severe if only single species are considered.
### The function `pruneindicators()`
As there may be many species combinations that could be used as indicators, the function `pruneindicators()` helps us to reduce the possibilities. First, the function selects those indicators (species or species combinations) that are valid according to the input thresholds `At`, `Bt` and `alpha`. Second, the function discards those valid indicators whose occurrence patterns are nested within other valid indicators. Third, the function evaluates the coverage of the remaining set of indicators. Finally, it explores subsets of increasing number of indicators, until the same coverage as the coverage of the complete set is attained and the subset of indicators is returned.
```{r}
sc2 <- pruneindicators(sc, At=0.8, Bt=0.2, verbose=TRUE)
print(sc2)
```
In our example, and using these thresholds, the best indicators for group '2' would be: (a) the combination of 'Phynod' and 'Elesp.'; (b) 'Cyprot' and (c) 'Echpas' and 'Phynod'. The three indicators together cover 93\% of the sites belonging to the target site group.
### The function `predict.indicators()`
The function `indicators()` provides a model that can be used to predict a target site group. Once a given combination of species has been found, the corresponding 'A' value is an estimate of the probability of being in the target site group given the combination of species has been found. Therefore, the set of indicators could be used to predict the likelihood of the target site group in a new data set. The function `predict.indicators()` has exactly this role. It takes the results of `indicators()` and a new community data matrix as input. Given this input, the function calculates the probability of the target site group for each site. The following code exemplifies the use of `predict.indicators()` with the same data that was used for the calibration of the indicator model (a new community data matrix should be used in normal applications):
```{r}
p <- predict(sc2, wetland)
```
Of course, this will return biased estimates, as the data set used to build the model is the same as the target data set for which probabilities are desired. The same would be obtained using:
```{r}
p <- predict(sc2)
```
which uses the original data set storied in the `indicators()` object. If we want to evaluate these probability estimates in a cross-validated fashion (i.e., excluding each target site for the calculation of positive predictive values and the probability of the target site group), we can use:
```{r}
pcv <- predict(sc2, cv=TRUE)
```
We can compare the probabilities (evaluated by resubstitution and cross-validation) with the original group memberships:
```{r}
data.frame(Group2 = as.numeric(wetkm$cluster==2), Prob = p, Prob_CV = pcv)
```
Cross-validated probabilities will tend to be lower than probabilities evaluated by resubstitution for sites originally belonging to the target site group and higher for other sites.
## Bibliography