Introduction

The skim() function summarizes data types contained within data frames. It comes with a set of default summary functions for a wide variety of data types, but this is not comprehensive. Package authors can add support for skimming their specific data types in their packages, and they can provide different defaults in their own summary functions.

This example will illustrate this by creating support for the sf object produced by the “sf: Simple Features for R” package. For any object this involves two required elements and one optional element.

• experiment with interactive changes
• create methods to get_skimmers for different objects within this package
• if needed, define any custom statistics

If you are adding skim support to a package you will also need to add skimr to the list of imports. Note that to run the code in this vignette you will need to install the sf package. We suggest not doing that, and instead substitute whatever package you are working with.

library(skimr)
library(sf)
## Warning: package 'sf' was built under R version 4.0.2
## Linking to GEOS 3.8.1, GDAL 3.1.4, PROJ 6.3.1
nc <- st_read(system.file("shape/nc.shp", package = "sf"))
## Reading layer nc' from data source /Library/Frameworks/R.framework/Versions/4.0/Resources/library/sf/shape/nc.shp' using driver ESRI Shapefile'
## Simple feature collection with 100 features and 14 fields
## geometry type:  MULTIPOLYGON
## dimension:      XY
## bbox:           xmin: -84.32385 ymin: 33.88199 xmax: -75.45698 ymax: 36.58965
## geographic CRS: NAD27
class(nc)
## [1] "sf"         "data.frame"
class(nc$geometry) ## [1] "sfc_MULTIPOLYGON" "sfc" Unlike the example of having a new type of data in a column of a simple data frame in the “Using skimr” vignette, this is a different type of object with special attributes. In this object there is also a column of a class that does not have default skimmers. By default, skimr falls back to use the sfl for character variables. skim(nc$geometry)
## Warning: Couldn't find skimmers for class: sfc_MULTIPOLYGON, sfc; No user-
## defined sfl provided. Falling back to character.
 Name nc$geometry Number of rows 100 Number of columns 1 _______________________ Column type frequency: sfc 1 ________________________ Group variables None Variable type: sfc skim_variable n_missing complete_rate n_unique valid crs geometry 0 1 100 100 epsg: proj4string: ’’ While this works for any data type and you can also include it within any package (assuming your users load skimr), there is an even better approach in this case. To take full advantage of skimr, we’ll dig a bit into its API. Adding new methods skimr has a lookup mechanism, based on the function get_skimmers(), to find default summary functions for each class. This is based on the S3 class system. You can learn more about it in Advanced R. This requires that you add skimr to your list of dependencies. To export a new set of defaults for a data type, create a method for the generic function get_skimmers. Each of those methods returns an sfl, a skimr function list. This is the same list-like data structure used in the skim_with() example above. But note! There is one key difference. When adding a generic we also want to identify the skim_type in the sfl. You will probably want to use skimr::get_skimmers.sfc() but that will not work in a vignette. #' @importFrom skimr get_skimmers #' @export get_skimmers.sfc <- function(column) { sfl( skim_type = "sfc", n_unique = n_unique, valid = ~ sum(sf::st_is_valid(.)), crs = get_crs ) } The same strategy follows for other data types. • Create a method • return an sfl • make sure that the skim_type is there Users of your package should load skimr to get the skim() function (although you could import and reexport it). Once loaded, a call to get_default_skimmer_names() will return defaults for your data types as well! get_default_skimmer_names() ##$AsIs
## [1] "n_unique"   "min_length" "max_length"
##
## $Date ## [1] "min" "max" "median" "n_unique" ## ##$POSIXct
## [1] "min"      "max"      "median"   "n_unique"
##
## $Timespan ## [1] "min" "max" "median" "n_unique" ## ##$character
## [1] "min"        "max"        "empty"      "n_unique"   "whitespace"
##
## $complex ## [1] "mean" ## ##$difftime
## [1] "min"      "max"      "median"   "n_unique"
##
## $factor ## [1] "ordered" "n_unique" "top_counts" ## ##$list
## [1] "n_unique"   "min_length" "max_length"
##
## $logical ## [1] "mean" "count" ## ##$numeric
## [1] "mean" "sd"   "p0"   "p25"  "p50"  "p75"  "p100" "hist"
##
## $sfc ## [1] "n_unique" "valid" "crs" ## ##$ts
##  [1] "start"      "end"        "frequency"  "deltat"     "mean"
##  [6] "sd"         "min"        "max"        "median"     "line_graph"

They will then be able to use skim() directly.

skim(nc)
## although coordinates are longitude/latitude, st_union assumes that they are planar
## although coordinates are longitude/latitude, st_union assumes that they are planar
## although coordinates are longitude/latitude, st_union assumes that they are planar
 Name nc Number of rows 100 Number of columns 15 _______________________ Column type frequency: character 2 numeric 12 sfc 1 ________________________ Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
NAME 0 1 3 12 0 100 0
FIPS 0 1 5 5 0 100 0

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
AREA 0 1 0.13 0.05 0.04 0.09 0.12 0.15 0.24 ▆▇▆▃▂
PERIMETER 0 1 1.67 0.48 1.00 1.32 1.61 1.86 3.64 ▇▇▂▁▁
CNTY_ 0 1 1985.96 106.52 1825.00 1902.25 1982.00 2067.25 2241.00 ▇▆▆▅▁
CNTY_ID 0 1 1985.96 106.52 1825.00 1902.25 1982.00 2067.25 2241.00 ▇▆▆▅▁
FIPSNO 0 1 37100.00 58.02 37001.00 37050.50 37100.00 37149.50 37199.00 ▇▇▇▇▇
CRESS_ID 0 1 50.50 29.01 1.00 25.75 50.50 75.25 100.00 ▇▇▇▇▇
BIR74 0 1 3299.62 3848.17 248.00 1077.00 2180.50 3936.00 21588.00 ▇▁▁▁▁
SID74 0 1 6.67 7.78 0.00 2.00 4.00 8.25 44.00 ▇▂▁▁▁
NWBIR74 0 1 1050.81 1432.91 1.00 190.00 697.50 1168.50 8027.00 ▇▁▁▁▁
BIR79 0 1 4223.92 5179.46 319.00 1336.25 2636.00 4889.00 30757.00 ▇▁▁▁▁
SID79 0 1 8.36 9.43 0.00 2.00 5.00 10.25 57.00 ▇▂▁▁▁
NWBIR79 0 1 1352.81 1976.00 3.00 250.50 874.50 1406.75 11631.00 ▇▁▁▁▁

Variable type: sfc

skim_variable n_missing complete_rate n_unique valid crs
geometry 0 1 100 100 epsg: proj4string: ’’

Conclusion

This is a very simple example. For a package such as sf the custom statistics will likely be much more complex. The flexibility of skimr` allows you to manage that.

Thanks to Jakub Nowosad, Tiernan Martin, Edzer Pebesma, Michael Sumner, and Kyle Butts for inspiring and helping with the development of this code.