# Create missingness indicator variables and non-informatively fill in missing values

Source:`R/fill.NAs.R`

`fill.NAs.Rd`

Given a `data.frame`

or `formula`

and data,
`fill.NAs()`

returns an expanded data frame, including a new
missingness flag for each variable with missing values and
replacing each missing entry with a value representing a reasonable
default for missing values in its column. Functions in the formula
are supported, with transformations happening before `NA`

replacement. The expanded data frame is useful for propensity
modeling and balance checking when there are covariates with
missing values.

## Arguments

- x
Can be either a data frame (in which case the data argument should be

`NULL`

) or a formula (in which case data must be a data.frame)- data
If x is a formula, this must be a data.frame. Otherwise it will be ignored.

- all.covs
Should the response variable be imputed? For formula

`x`

, this is the variable on the left hand side. For`data.frame`

`x`

, the response is considered the first column.- contrasts.arg
(from

`model.matrix`

) A list, whose entries are values (numeric matrices or character strings naming functions) to be used as replacement values for the`contrasts`

replacement function and whose names are the names of columns of`data`

containing`factor`

s.

## Value

A `data.frame`

with all `NA`

values replaced with
mean values and additional indicator columns for each column
including missing values. Suitable for directly passing to

`lm`

or other model building functions to build
propensity scores.

## Details

`fill.NAs`

prepares data for use in a model or matching
procedure by filling in missing values with minimally invasive
substitutes. Fill-in is performed column-wise, with each column
being treated individually. For each column that is missing, a new
column is created of the form “ColumnName.NA” with
indicators for each observation that is missing a value for
“ColumnName”. Rosenbaum and Rubin (1984, Sec. 2.4 and
Appendix B) discuss propensity score models using this data
structure.

The replacement value used to fill in a missing value is simple
mean replacement. For transformations of variables, e.g. ```
y ~
x1 * x2
```

, the transformation occurs first. The transformation
column will be `NA`

if any of the base columns are
`NA`

. Fill-in occurs next, replacing all missing values with
the observed column mean. This includes transformation columns.

Data can be passed to `fill.NAs`

in two ways. First, you can
simply pass a `data.frame`

object and `fill.NAs`

will
fill every column. Alternatively, you can pass a `formula`

and
a `data.frame`

. Fill-in will only be applied to columns
specifically used in the formula. Prior to fill-in, any functions
in the formula will be expanded. If any arguments to the functions
are `NA`

, the function value will also be `NA`

and
subject to fill-in.

By default, `fill.NAs`

does not impute the response
variable. This is to encourage more sophisticated imputation
schemes when the response is a treatment indicator in a matching
problem. This behavior can be overridden by setting ```
all.covs
= TRUE
```

.

## References

Rosenbaum, Paul R. and Rubin, Donald B. (1984) ‘Reducing
Bias in Observational Studies using Subclassification on the
Propensity Score,’ *Journal of the American Statistical
Association*, **79**, 516 -- 524.

Von Hipple, Paul T. (2009) ‘How to impute interactions,
squares, and other transformed variables,’ *Sociological
Methodology*, **39**(1), 265 -- 291.

## Examples

```
data(nuclearplants)
### Extract some representative covariates:
np.missing <- nuclearplants[c('t1', 't2', 'ne', 'ct', 'cum.n')]
### create some missingness in the covariates
n <- dim(np.missing)[1]
k <- dim(np.missing)[2]
for (i in 1:n) {
missing <- rbinom(1, prob = .1, size = k)
if (missing > 0) {
np.missing[i, sample(k, missing)] <- NA
}
}
### Restore outcome and treatment variables:
np.missing <- data.frame(nuclearplants[c('cost', 'pr')], np.missing)
### Fit a propensity score but with missing covariate data flagged
### and filled in, as in Rosenbaum and Rubin (1984, Appendix):
np.filled <- fill.NAs(pr ~ t1 * t2, np.missing)
# Look at np.filled to establish what missingness flags were created
head(np.filled)
#> pr t1 t2 t1:t2 t1.NA t2.NA
#> H 0 14.00000 46.00000 644.0000 FALSE FALSE
#> I 0 10.00000 73.00000 730.0000 FALSE FALSE
#> A 1 10.00000 85.00000 850.0000 FALSE FALSE
#> J 0 13.89655 67.00000 848.0769 TRUE FALSE
#> B 1 11.00000 78.00000 858.0000 FALSE FALSE
#> K 0 13.00000 63.31034 848.0769 FALSE TRUE
(np.glm <- glm(pr ~ ., family=binomial, data=np.filled))
#> Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
#>
#> Call: glm(formula = pr ~ ., family = binomial, data = np.filled)
#>
#> Coefficients:
#> (Intercept) t1 t2 `t1:t2` t1.NATRUE t2.NATRUE
#> -47.7116 2.1598 0.5996 -0.0247 -19.8688 -29.9756
#>
#> Degrees of Freedom: 31 Total (i.e. Null); 26 Residual
#> Null Deviance: 39.75
#> Residual Deviance: 21.72 AIC: 33.72
(glm(pr ~ t1 + t2 + `t1:t2` + t1.NA + t2.NA,
family=binomial, data=np.filled))
#> Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
#>
#> Call: glm(formula = pr ~ t1 + t2 + `t1:t2` + t1.NA + t2.NA, family = binomial,
#> data = np.filled)
#>
#> Coefficients:
#> (Intercept) t1 t2 `t1:t2` t1.NATRUE t2.NATRUE
#> -47.7116 2.1598 0.5996 -0.0247 -19.8688 -29.9756
#>
#> Degrees of Freedom: 31 Total (i.e. Null); 26 Residual
#> Null Deviance: 39.75
#> Residual Deviance: 21.72 AIC: 33.72
# In a non-interactive session, the following may help, as long as
# the formula passed to `fill.NAs` (plus any missingness flags) is
# the desired formula for the glm.
(glm(formula(terms(np.filled)), family=binomial, data=np.filled))
#> Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
#>
#> Call: glm(formula = formula(terms(np.filled)), family = binomial, data = np.filled)
#>
#> Coefficients:
#> (Intercept) t1 t2 `t1:t2` t1.NATRUE t2.NATRUE
#> -47.7116 2.1598 0.5996 -0.0247 -19.8688 -29.9756
#>
#> Degrees of Freedom: 31 Total (i.e. Null); 26 Residual
#> Null Deviance: 39.75
#> Residual Deviance: 21.72 AIC: 33.72
### produce a matrix of propensity distances based on the propensity model
### with fill-in and flagging. Then perform pair matching on it:
pairmatch(match_on(np.glm, data=np.filled), data=np.filled)
#> Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
#> H I A J B K L M C N O P Q R S T
#> <NA> 1.2 1.1 <NA> 1.2 <NA> <NA> <NA> 1.3 1.4 <NA> <NA> 1.8 <NA> <NA> 1.6
#> U D V E W F X G Y Z d e f a b c
#> 1.7 1.4 1.9 1.5 1.1 1.6 1.5 1.7 1.10 <NA> <NA> <NA> 1.3 1.8 1.9 1.10
## fill NAs without using treatment contrasts by making a list of contrasts for
## each factor ## following hints from https://stackoverflow.com/a/4569239/161808
np.missing$t1F<-factor(np.missing$t1)
cov.factors <- sapply(np.missing[,c("t1F","t2")],is.factor)
cov.contrasts <- lapply(
np.missing[,names(cov.factors)[cov.factors],drop=FALSE],
contrasts, contrasts = FALSE)
## make a data frame filling the missing covariate values, but without
## excluding any levels of any factors
np.noNA2<-fill.NAs(pr~t1F+t2,data=np.missing,contrasts.arg=cov.contrasts)
```