Classification metrics in `yardstick`

where both the
`truth`

and `estimate`

columns are factors are
implemented for the binary and the multiclass case. The multiclass
implementations use `micro`

, `macro`

, and
`macro_weighted`

averaging where applicable, and some metrics
have their own specialized multiclass implementations.

Macro averaging reduces your multiclass predictions down to multiple
sets of binary predictions, calculates the corresponding metric for each
of the binary cases, and then averages the results together. As an
example, consider `precision`

for the binary case.

\[ Pr = \frac{TP}{TP + FP} \]

In the multiclass case, if there were levels `A`

,
`B`

, `C`

and `D`

, macro averaging
reduces the problem to multiple one-vs-all comparisons. The
`truth`

and `estimate`

columns are recoded such
that the only two levels are `A`

and `other`

, and
then precision is calculated based on those recoded columns, with
`A`

being the “relevant” column. This process is repeated for
the other 3 levels to get a total of 4 precision values. The results are
then averaged together.

The formula representation looks like this. For `k`

classes:

\[ Pr_{macro} = \frac{Pr_1 + Pr_2 + \ldots + Pr_k}{k} = Pr_1 \frac{1}{k} + Pr_2 \frac{1}{k} + \ldots + Pr_k \frac{1}{k} \]

where \(PR_1\) is the precision
calculated from recoding the multiclass predictions down to just
`class 1`

and `other`

.

Note that in macro averaging, all classes get equal weight when
contributing their portion of the precision value to the total (here
`1/4`

). This might not be a realistic calculation when you
have a large amount of class imbalance. In that case, a *weighted
macro average* might make more sense, where the weights are
calculated by the frequency of that class in the `truth`

column.

\[ Pr_{weighted-macro} = Pr_1 \frac{\#Obs_1}{N} + Pr_2 \frac{\#Obs_2}{N} + \ldots + Pr_k \frac{\#Obs_k}{N} \]

Micro averaging treats the entire set of data as an aggregate result,
and calculates 1 metric rather than `k`

metrics that get
averaged together.

For precision, this works by calculating all of the true positive results for each class and using that as the numerator, and then calculating all of the true positive and false positive results for each class, and using that as the denominator.

\[
Pr_{micro} = \frac{TP_1 + TP_2 + \ldots + TP_k}{(TP_1 + TP_2 + \ldots +
TP_k) + (FP_1 + FP_2 + \ldots + FP_k)}
\] In this case, rather than each *class* having equal
weight, each *observation* gets equal weight. This gives the
classes with the most observations more power.

Some metrics have known analytical multiclass extensions, and do not need to use averaging to get an estimate of multiclass performance.

Accuracy and Kappa use the same definitions as their binary counterpart, with accuracy counting up the number of correctly predicted true values out of the total number of true values, and kappa being a linear combination of two accuracy values.

Matthews correlation coefficient (MCC) has a known multiclass generalization as well, sometimes called the \(R_K\) statistic. Refer to this page for more details.

ROC AUC is an interesting metric in that it intuitively makes sense
to perform macro averaging, which computes a multiclass AUC as the
average of the area under multiple binary ROC curves. However, this
loses an important property of the ROC AUC statistic in that its binary
case is insensitive to class distribution. To combat this, a multiclass
metric was created that retains insensitivity to class distribution, but
does not have an easy visual interpretation like macro averaging. This
is implemented as the `"hand_till"`

method, and is the
default for this metric.