Introduction to openEBGM

Background

William DuMouchel (1, 2) created an empirical Bayes (EB) data mining approach for finding “interestingly large” counts in contingency tables. DuMouchel’s approach works well even when most of the counts are zero or one (i.e., a sparse table). The benefit of DuMouchel’s model over simpler approaches such as the relative reporting ratio, $$RR$$, is that Bayesian shrinkage corrects for the high variability in $$RR$$ that results from small counts.

The rows and columns of the table represent levels of two different variables, such as food or drug products and adverse events:

Product A 0 1 0
Product B 4 0 0
Product C 1 0 9

The relative reporting ratio is calculated as $$RR=\frac{N}{E}$$, where $$N$$ is the actual count for a cell in the table and $$E$$ is the expected count under the assumption of independence between rows and columns. When $$RR = 1$$, you observe the exact count you would expect to observe if no association exists between the two variables. When $$RR > 1$$, you observe a larger count than expected. This approach works well for large counts; however, small counts cause $$RR$$ to become quite unstable. For instance, an actual count of $$N = 1$$ with an expected count of $$E = 0.01$$ gives us $$RR = 100$$ – which seems large – but a single event could easily occur simply by chance. The EB approach shrinks large $$RR$$s that result from small counts to a value much closer to the “null hypothesis” value of 1. The shrinkage is smaller for larger counts and negligible for very large counts. Shrinkage gives results that are more stable than the simple $$RR$$ measurement.

DuMouchel’s model uses a Poisson($$\mu_{ij}$$) likelihood (i.e. data distribution) for the actual cell counts, $$N_{ij}$$, in row i and column j. The expected cell counts, $$E_{ij}$$, are treated as constants. We are interested in the ratio $$\lambda_{ij}=\frac{\mu_{ij}}{E_{ij}}$$, which is analagous to $$RR=\frac{N}{E}$$. The prior on $$\lambda$$ is a mixture of two gamma distributions, resulting in a posterior distribution for $$\lambda$$ which is a mixture of two gamma distributions. Hence, the model is sometimes referred to as the Gamma-Poisson Shrinker (GPS) model. The posterior distribution of $$\lambda$$ can be thought of as a Bayesian representation of $$RR$$. Summary statsistics from the posterior distribution are used as attenuated versions of $$RR$$.

Each cell in the contingency table will have its own posterior distribution determined both by that cell’s actual and expected counts (the data) and by the distribution of actual and expected counts in the entire table (the prior). Often, the Empirical Bayes Geometric Mean $$(EBGM)$$ of the posterior distribution is used in place of $$RR$$. Alternatively, the more conservative percentiles (5th, 10th, etc.) can be used. The percentiles can also be used to construct Bayesian credible intervals. Similar to $$RR$$, an $$EBGM$$ (or lower percentile) much bigger than 1 represents an actual count much bigger than expected. Such cases might represent signals of interest, and the product/event pair can be further examined by subject matter experts to determine if the association might actually be causal in nature.

An extension of the GPS model, the Multi-Item Gamma-Poisson Shrinker (MGPS) model (2001), is currently being used by the U.S. Food and Drug Administration (FDA) to find higher-than-expected reporting of adverse events associated with food, drugs, etc. For instance, FDA’s Center for Food Safety and Applied Nutrition (CFSAN) uses the MGPS model to mine data from the CFSAN Adverse Events Reporting System (CAERS): https://www.fda.gov/Food/ComplianceEnforcement. (The variables forming the rows and columns of the contingency table are product and adverse event.) MGPS allows for product interactions, unlike the GPS model implemented in openEBGM (3), which can only use individual product-event pairs.

Purpose

The openEBGM package implements DuMouchel’s approach with some small differences. For example, the expected counts are calculated by counting unique “transactions” (2) in each row and column, not actual marginal totals. In the CAERS data, a unique report is a transaction. In some applications, a single transaction could occur in several cells. For instance, a single CAERS report might mention multiple products and/or adverse events. Using simple marginal totals would then count a single report multiple times.

This document teaches you how to prepare your data for use by openEBGM’s functions. Other vignettes give explanations and examples of more advanced topics:

• Raw data processing: Process your data to find counts and simple disproportionality measures.

• Hyperparameter estimation: Estimate the hyperparameters of the prior distribution.

• Empirical Bayes metrics: This is the ultimate goal. Calculate Empirical Bayes metrics ($$EBGM$$ and quantile scores) based on the posterior distribution.

• Object-oriented features: Create objects of a special class (openEBGM) to use with generic functions such as plot().