Recoding Variables

The recodes() functions makes it very easy to recode one or more variables in the your data frame. The format is

newdata <- recodes(olddata, variables, from values, to values)

Original dataset

Consider the following data set (below). Lets make the following changes.

sex race outcome Q1 Q2 age rating
1 b better 20 15 12 1
2 w worse 30 23 20 2
1 a same 44 18 33 5
2 b same 15 86 55 3
2 w better 50 99 30 4
2 h worse 99 35 100 5

sex

For sex, set 1 to “Male” and 2 to “Female”.

df <- recodes(data=df, vars="sex",
from=c(1,2), to=c("Male", "Female"))
sex race outcome Q1 Q2 age rating
Male b better 20 15 12 1
Female w worse 30 23 20 2
Male a same 44 18 33 5
Female b same 15 86 55 3
Female w better 50 99 30 4
Female h worse 99 35 100 5

race

Recode race to “White” vs. “Other”.

df <- recodes(data=df, vars="race",
from=c("w", "b", "a", "h"),
to=c("White", "Other", "Other", "Other"))
sex race outcome Q1 Q2 age rating
Male Other better 20 15 12 1
Female White worse 30 23 20 2
Male Other same 44 18 33 5
Female Other same 15 86 55 3
Female White better 50 99 30 4
Female Other worse 99 35 100 5

outcome

Recode outcome to 1 (better) vs. 0 (not better).

df <- recodes(data=df, vars="outcome",
from=c("better", "same", "worse"),
to=c(1, 0, 0))
sex race outcome Q1 Q2 age rating
Male Other 1 20 15 12 1
Female White 0 30 23 20 2
Male Other 0 44 18 33 5
Female Other 0 15 86 55 3
Female White 1 50 99 30 4
Female Other 0 99 35 100 5

Q1 and Q2

For Q1 and Q2 set values of 86 and 99 to missing.

df <- recodes(data=df, vars=c("Q1", "Q2"),
from=c(86, 99), to=NA)
#> Note: 'from' is longer than 'to', so 'to' was recycled.
sex race outcome Q1 Q2 age rating
Male Other 1 20 15 12 1
Female White 0 30 23 20 2
Male Other 0 44 18 33 5
Female Other 0 15 NA 55 3
Female White 1 50 NA 30 4
Female Other 0 NA 35 100 5

age

For age, set values

• less than 20 or greater than 90 to missing,
• 20 <= age <= 30 to “Younger”,
• 30 < age <= 50 to “Middle Aged”, and
• 50 < age <= 90 to “Older”.

You can use expressions in your from fields. When they are TRUE, the corresponding to values will be applied. We will use the dollar sign ($) to represent the variable (age in this case). The symbols ( |, & ) mean OR and AND respectively. df <- recodes(data=df, vars="age", from=c("$ <   20 | $> 90", "$ >=  20 & $<= 30", "$ >   30 & $<= 50", "$ >   50 & $<= 90"), to=c(NA, "Younger", "Middle Aged", "Older")) We can also write this as df <- recodes(data=df, vars="age", from=c("$ < 20", "$<= 30", "$ <= 50", "$<= 90", "$ > 90"),
to=  c(NA, "Younger", "Middle Aged", "Older", "NA"))

This works because once the age value for an observations meets a criteria that is TRUE (working left to right), it is recoded. It isn’t changed again by later criteria in the same recodes statement.

sex race outcome Q1 Q2 age rating
Male Other 1 20 15 NA 1
Female White 0 30 23 Younger 2
Male Other 0 44 18 Middle Aged 5
Female Other 0 15 NA Older 3
Female White 1 50 NA Younger 4
Female Other 0 NA 35 NA 5

rating

Finally, for the rating variable, reverse the scoring so that 1 to 5 becomes 5 to 1.

df <- recodes(data=df, vars="rating", from=1:5, to=5:1)
sex race outcome Q1 Q2 age rating
Male Other 1 20 15 NA 5
Female White 0 30 23 Younger 4
Male Other 0 44 18 Middle Aged 1
Female Other 0 15 NA Older 3
Female White 1 50 NA Younger 2
Female Other 0 NA 35 NA 1

Note

Remember that recodes returns a data frame, not a variable.

• df <- recodes(data=df, vars="rating", from=1:5, to=5:1) is correct.

• df\$rating <- recodes(data=df, vars="rating", from=1:5, to=5:1) is not.

This allows you to apply the same recoding scheme to more than one variable at a time (e.g., Q1 and Q2 above).

And that’s it (APPLAUSE, APPLAUSE)!