This list is a collection of specific commands to do specific,
often-needed (at least by me) data cleaning and prepping tasks in R.
Most of the commands use the ‘tidyverse’ packages. So
to make sure commands work, you need the corresponding packages, most
simply done by install.packages(tidyverse)
to get them
all.
Also check out the data wrangling related cheat-sheets here at R Studio’s website.
Conventions: The data is assumed to be a data frame
called dat
, I’m using the pipe-operator notation, the
outcome of interest is called outcome
, variables are
var1
, var2
, etc. or just refered to as
varname
newdat <- dat %>% rename(newvarname = oldvarname)
In this example, the value infctd in varname is replaced by ‘infected’
newdat <- dat %>% mutate(varname = replace(varname, varname=='infctd','infected')
While usually it doesn’t matter at which position which variable is, sometimes it’s more convenient to for instance have the outcome at the beginning. This can be done as follows:
newdat <- dat %>% select(outcome, everything() )
More general sorting/arranging can be done like this (if you don’t have too many variables):
newdat <- dat %>% select(var1, var2, var3, var4, ... )
Either just pick the ones you want to keep
newdat <- dat %>% select ( c(var1,var2,var3))
Or delete the ones you don’t want to keep
newdat <- dat %>% select ( -c(var4,var5,var6) )
newdat <- dat %>% mutate( newvar = var3*var4)
newdat <- dat %>% filter( var == 3)
newdat <- dat %>% filter( !is.na(var) )
Lists numbers of NA for each column/variable
print(colSums(is.na(dat)))
Nice graphical way using the Amelia package
library(Amelia); missmap(dat)
add