---
title: "Expectation Maximization"
output: rmarkdown::html_vignette
vignette: >
%\VignetteIndexEntry{Expectation Maximization}
%\VignetteEngine{knitr::rmarkdown}
%\VignetteEncoding{UTF-8}
---
```{r setup, include = FALSE}
knitr::opts_chunk$set(
collapse = TRUE,
message = FALSE,
warning = FALSE,
comment = "#>",
eval = FALSE,
include = TRUE
)
```
The Expectation-Maximization (EM) algorithm is available through the function `survival_ln_mixture_em` and it's a frequentist method in alternative to the Bayesian approach. It handles better big data situations when the Bayesian approach will run out of memory or take a lot of time to finish.
Using it is similar to using the Bayesian method, `survival_ln_mixture`, sharing a lot of similar parameters and specifications. Here follows a really basic code to fit the model using the EM algorithm.
```{r eval=TRUE, include=TRUE}
library(lnmixsurv)
library(tidyr)
library(dplyr)
library(ggplot2)
library(readr)
set.seed(8)
data <- simulate_data(6000,
mixture_components = 3, k = 2,
percentage_censored = 0.3
)$data |>
rename(x = cat, y = t)
model_em <- survival_ln_mixture_em(Surv(y, delta) ~ x,
data = data,
iter = 200,
starting_seed = 20,
number_em_search = 0
)
gg <- plot_fit_on_data(model_em, data)$ggplot
```
The parameters `number_em_search` is used to find initial values closer to the maximum likelihood, avoid local maximas. Here, we are just disabling it to show it's impact.
Unlike the Bayesian approach, which samples from the posteriori via Gibbs sampler, the EM algorithm is a maximum likelihood method, moving, in each iteration, closer the parameters values closer to the model's maximum likelihood. The function `plot` can be used to visualize the iterations of the algorithm, assessing for convergence.
```{r fig.width=7, eval = TRUE}
plot(model_em)
```
When using the Expectation-Maximization algorithm to fit the model, you can use the function `plot_fit_on_data()` to quickly visualize the model's estimated survival (or hazard) on the data used to fit the model.
```{r}
plot_fit_on_data(model_em, data = data, type = "survival")$ggplot
```
```{r fig.width = 7, echo=FALSE, eval = TRUE}
gg
```
As expected, the fitted model isn't that great. Increasing the parameter `number_em_search` helps to find better initial values, and thus, better fits, in exchange of computational time. We can see how the initials values likelihood change setting the parameter `show_progress = TRUE`.
```{r eval = TRUE}
model_em <- survival_ln_mixture_em(Surv(y, delta) ~ x,
data = data,
iter = 200,
starting_seed = 20,
number_em_search = 200,
show_progress = TRUE
)
```
```{r include=FALSE, eval = TRUE}
gg <- plot_fit_on_data(model_em, data)$ggplot
```
Now, we have a maximum likelihood estimator that tries to avoid local maximas. As before, we can use the function `plot_fit_on_data()` to visualize the model's estimated survival on the data used to fit the model.
```{r}
plot_fit_on_data(model_em, data = data, type = "survival")$ggplot
```
```{r fig.width = 7, echo = FALSE, eval = TRUE}
gg
```
In fact, the model is now much better. The EM algorithm is a good alternative to the Bayesian approach when dealing with big data, but it's important to note that it's a maximum likelihood method, and thus, it doesn't provide credible nor confidence intervals.