Each statistics and biostatistics department around the world, in conjunction with their collaborators (for us at the University of Washington this includes departments like Epidemiology and Global Health), must choose a statistical software to teach in introductory courses. At the University of Washington, STATA has been taught for many years. While STATA is a powerful software, with relatively userfriendly syntax, it is not very flexible. Only one data set can be read in at a time, and there is only one data type. Simulations are especially difficult to run. While STATA is an important tool to know, we argue that learning the fundamentals of R opens up many possibilities and a lot of power.
We have written another document, “An Introduction to R”, which serves as an introduction to the R philosophy of programming and lays out some of the basic data manipulation strategies. In this document, we assume that the reader has read “An Introduction to R” or is at least familiar with the basic R data types, data manipulation, basic functions for descriptive statistics, installing and loading packages. We now present a detailed walkthrough of functions in R that perform the same task as functions in STATA, with syntax provided.
Many of these functions are available in the base R package, which is automatically installed when R is installed and automatically loaded each time you boot up R. However, some functions are only available in other packages which you have to install and load manually. In particular, one of the goals of the uwIntroStats
package is to facilitate easy adoption of R for STATA users. We leave introduction of this package to a separate document entitled “Using the uwIntroStats
Package”.
Quick view of functions:
STATA  R (base) 

summarize 
summary(), mean(), sd(), var(), min(), max() 
In STATA, the summarize
command calculates the number of observations, mean, standard deviation, minimum, and maximum value for each variable in the data set. The STATA command conveniently displays all of these summary statistics with one command. In R, this is a bit harder. To find the mean or the median, we simply type mean(data)
or median(data)
. Many of the other simple functions (sum
 compute a sum, dim
 return the dimensions of an object) operate in a similar way. For our example, let’s use the mri
data set. In STATA, we need to download from the internet and read in the file. We can also do this in R (see “An Introduction to R” for more information), but it is much easier and faster to load it in from the uwIntroStats
package.
Example 1
Goal: Calculate summary statistics  mean, standard deviation, minimum, and maximum  for variables in the mri
data.
Data: mri
STATA Code and Output
* First read in the mri data *
infile ptid mridate age male race weight height packyrs yrsquit alcoh physact chf chd stroke diabetes genhlth ldl alb crt plt sbp aai fev dsst atrophy whgrd numinf volinf obstime death using "http://www.emersonstatistics.com/datasets/mri.txt"
* This will throw ''errors''  this is due to the headers in the file *
* Delete this first row of NAs *
drop in 1
/* Now call summarize. Note that normally this will print all variables, but we have not included them all */
summarize
Variable  Obs Mean Std. Dev. Min Max
+
ptid  735 368 212.3205 1 735
mridate  735 76422.93 31896.42 10192 123191
age  735 74.56599 5.451364 65 99
male  735 .4979592 .5003363 0 1
race  735 1.318367 .6659304 1 4
+
R Code and Output:
## Load the uwIntroStats package
library(uwIntroStats)
##
## Attaching package: 'uwIntroStats'
##
## The following object is masked from 'package:base':
##
## tabulate
## Load the mri data
data(mri)
## Calculate summary statistics
mean(mri$age)
## [1] 74.56599
median(mri$age)
## [1] 74
min(mri$age)
## [1] 65
max(mri$age)
## [1] 99
sd(mri$age)
## [1] 5.451364
length(mri$age)
## [1] 735
The R functions give us flexible options in case we ever need only a subset of these summary statistics. If we wanted to apply one of these functions to the entire data, we could use the apply()
function (again, we only display the first five elements for the sake of space):
head(apply(mri, 2, mean, na.rm = TRUE), n = 5)
## ptid mridate age male race
## 3.680000e+02 7.642293e+04 7.456599e+01 4.979592e01 1.318367e+00
The last argument, na.rm = TRUE
, is passed to the mean()
function and makes sure that missing values are removed from the variable when attempting to calculate the mean. Otherwise, the function will return NA
.
Quick view of functions:
STATA  R (base) 

tabulate, chi2 
table(), chisq.test() 
cc 

cs 
In STATA, tabulate
displays tables of frequency counts. If you add the oneway
or twoway
arguments, a chisquared test is performed as well. The chisquared test in this case is a descriptive statistic  it measures how well the observations match up with the expected distribution based on the sample proportions. The cc
and cs
commands also compute chisquared tests, on case control and cohort data respectively.
The base R function chisq.test()
performs the chisquared test on tables, matrices, and vectors. The function table()
returns a frequency table of the variables entered. However, in order to calculate Odds Ratios, Risk Ratios, or other statistics (for example the likelihood ratio, MantelHaenszel statistic, and others) you must use other functions developed for different R packages and piece the information together. For example, the Exact
package and the uwIntroStats
package both have functions for calculating these ratios. These packages give more output than the base R functions, and in the case of uwIntroStats
follow a similar syntax to STATA.
Example 1
Goal: Calculate chisquared statistics on the tabulation of male
and diabetes
.
Data: mri
STATA Code and Output
cc male diabetes
Proportion
 Exposed Unexposed  Total Exposed
++
Cases  53 313  366 0.1448
Controls  26 343  369 0.0705
++
Total  79 656  735 0.1075
 
 Point estimate  [95\% Conf. Interval]
+
Odds ratio  2.233841  1.333028 3.814119 (exact)
Attr. frac. ex.  .5523406  .2498281 .7378163 (exact)
Attr. frac. pop  .0799837 
+
chi2(1) = 10.59 Pr>chi2 = 0.0011
cs male diabetes
 diabetes 
 Exposed Unexposed  Total
++
Cases  53 313  366
Noncases  26 343  369
++
Total  79 656  735
 
Risk  .6708861 .4771341  .4979592
 
 Point estimate  [95\% Conf. Interval]
+
Risk difference  .1937519  .0833099 .304194
Risk ratio  1.406074  1.181535 1.673285
Attr. frac. ex.  .2888  .1536436 .4023731
Attr. frac. pop  .0418208 
+
chi2(1) = 10.59 Pr>chi2 = 0.0011
R Code and Output
table(mri$male, mri$diabetes)
##
## 0 1
## 0 343 26
## 1 313 53
## Use the "correct = FALSE" argument to halt application of continuity correction
chisq.test(mri$male, mri$diabetes, correct = FALSE)
##
## Pearson's Chisquared test
##
## data: mri$male and mri$diabetes
## Xsquared = 10.588, df = 1, pvalue = 0.001138
Quick view of functions:
STATA  R (base) 

graph box 
boxplot() 
The boxplot is usually used as a method of visualizing data quickly, since it reduces the data down to essentially four numbers: minimum, maximum, median, and interquartile range. The syntax for boxplots is quite similar between STATA and R, and each gives approximately the same output. However, it is important to read the documentation for each function to see how ``outliers’’ are computed. In both cases, the default rule for detecting outliers is those observations which lie outside of 1.5 times the IQR. Knowing the rule is important for deciding whether to classify these observations as outliers or not.
Example 1
Goal: Produce a boxplot to describe the distribution of forced exhalation volume (FEV) by smoking group.
Data: fev
STATA Code and Output
* Read in the data *
use "https://courses.washington.edu/b511/Data/FEV1ClinTrial.dta"
* Rename the variables *
rename Y0 FEV1base
rename Y1 FEV1wk24
rename T smoke
* Make the boxplot *
graph box FEV1base, by(smoke) ytitle(FEV)
R Code and Output
## Read in the data
fev < read.table("http://courses.washington.edu/b511/Data/FEV1ClinTrial.dat", sep = "")
## Rename the data
names(fev) < c("FEV1wk24", "FEV1base", "smoke")
## Create the boxplot
boxplot(fev$smoke, fev$FEV1base)
Quick view of functions:
STATA  R (base) 

histogram 
hist() 
The syntax for creating a histogram is very similar between R and STATA. However, if the STATA dataset has labels for the variables, it will automatically put in these labels on the plot. In R, we have to manually label the plot.
Example 1
Goal: Produce a histogram of FEV at week 24.
Data: fev
STATA Code and Output
* Read in the data, rename variables *
use "https://courses.washington.edu/b511/Data/FEV1ClinTrial.dta"
rename Y0 FEV1base
rename Y1 FEV1wk24
rename T smoke
* Create the histogram *
histogram FEV1wk24
R Code and Output
## Read in the data, rename the variables
fev < read.table("http://courses.washington.edu/b511/Data/FEV1ClinTrial.dat", sep = "")
names(fev) < c("FEV1wk24", "FEV1base", "smoke")
## Create the histogram
hist(fev$FEV1wk24, xlab = "FEV1 at week 24", main = "Histogram of FEV1 at week 24")
Quick view of functions:
STATA  R (base) 

scatter 
plot 
scatter.smooth() 
The base R version of the scatterplot takes an x anda y variable (which must be of the same length) and plots the points as specified by these coordinates. If you wish to have least squares lines or other additions to the plot, you must use the abline()
function, or use the scatter.smooth()
function. The STATA version of scatterplot takes a yvariable followed by a xvariable, followed by any arguments.
Example 1
Goal: Produce a scatterplot of FEV at 24 weeks against FEV at baseline, with loess and least squares lines fit to the graph.
Data: fev
STATA Code and Output
* Read in the data, rename variables *
use "https://courses.washington.edu/b511/Data/FEV1ClinTrial.dta"
rename Y0 FEV1base
rename Y1 FEV1wk24
rename T smoke
* Create the scatterplot *
scatter FEV1wk24 FEV1base  lfit FEV1wk24 FEV1base  lowess FEV1wk24 FEV1base
R Code and Output
## Read in the data, rename the variables
fev < read.table("http://courses.washington.edu/b511/Data/FEV1ClinTrial.dat", sep = "")
names(fev) < c("FEV1wk24", "FEV1base", "smoke")
## create the scatterplot  this doesn't have loess curve!
plot(fev$FEV1base, fev$FEV1wk24, xlab = "FEV1 at baseline", ylab = "FEV1 at week 24", main = "Scatterplot of FEV1 at baseline vs week 24")
scatter.smooth(fev$FEV1wk24, fev$FEV1base, xlab = "FEV1 at baseline", ylab = "FEV1 at week 24", main = "Scatterplot of FEV1 at baseline vs week 24")
## add the least squares line  don't worry about how we calculated the lines for now!
fev1.lm < lm(FEV1wk24 ~ FEV1base, data = fev)
abline(fev1.lm, col = "red")
## add a legend
legend(.5, 3.5, legend = c("Loess", "Least Squares"), fill = c("black", "red"))
Quick view of functions:
STATA  R (base) 

correlate 
cor() 
cov() 
In STATA, the command to compute a correlation matrix is correlate
. In base R, there are a few different functions to compute correlation, covariance, or variance. The cor()
function computes the correlation between two vectors or matrices. The cov()
function computes the covariance between two vectors or matrices. In both of these cases, if a matrix is entered, then the covariance (or correlation) is computed between the columns of the matrices. Last, var()
computes variance as we discussed above, but var()
computed on two matrices returns the same as cov()
.
Example 1
Goal: Compute the correlation and covariance between FEV at week 24 and FEV at baseline.
Data: fev
STATA Code and Output
* Read in the data, rename variables *
use "https://courses.washington.edu/b511/Data/FEV1ClinTrial.dta"
rename Y0 FEV1base
rename Y1 FEV1wk24
rename T smoke
* Compute *
correlate FEV1wk24 FEV1base
(obs=520)
 FEV1wk24 FEV1base
+
FEV1wk24  1.0000
FEV1base  0.8932 1.0000
correlate FEV1wk24 FEV1base, cov
(obs=520)
 FEV1wk24 FEV1base
+
FEV1wk24  .413977
FEV1base  .343671 .357618
Notice that the STATA output when we asked for the covariances gave the covariance in the lower left, and the variance of the individual variables along the diagonal.
R Code and Output
## Read in the data, rename variables
fev < read.table("http://courses.washington.edu/b511/Data/FEV1ClinTrial.dat", sep = "")
names(fev) < c("FEV1wk24", "FEV1base", "smoke")
## Compute
cor(fev$FEV1wk24, fev$FEV1base)
## [1] 0.8931929
cov(fev$FEV1wk24, fev$FEV1base)
## [1] 0.3436712
var(fev$FEV1wk24)
## [1] 0.413977
var(fev$FEV1base)
## [1] 0.3576181
To run pairwise correlations, in R you need to add the argument use = "pairwise.complete.obs"
. To run a different method of correlation calculation, use the method
argument, and set it to either "pearson"
(default), "spearman"
, or "kendall"
.
In STATA, use the pwcorr
command, which has all of the same arguments as the correlate
command.
Quick view of functions:
STATA  R (base) 

display distnameprob( statistic) 
p distname( statistic) 
Say we calculate a test statistic (zscore, tstatistic, chisquared statistic) and want to find the pvalue. Recall that for a pvalue we want to find the probability of values at least as extreme as our value, given that the null hypothesis is true. If we calculate a zscore of 1 for our data, then we know that we are dealing with the normal distribution. To find the pvalue, then, we have to call the cumulative distribution function command in R or in STATA. Example 1 has the correct code for this example. The other distributions both follow the convention in this example, and for that reason we have placed distname in italics in the “Quick view” section.
Example 1
Goal: Calculate the pvalue associated with a zscore of 1.
Data: None.
STATA Code and Output
display 1normprob(1)
.15865525
R Code and Output
1  pnorm(1)
## [1] 0.1586553
Quick view of functions:
STATA  R (base) 

ttest 
t.test() 
While the commands for the ttest look very similar in STATA and in R, there are a few differences. Most noticeably, the output is very different. Also, STATA provides inference for each onesided hypothesis (both that the mean is less than the null hypothesized value, and that it is greater) individually, while R defaults to providing inference for the twosided test. Using the alternative
argument to t.test()
allows the user to change the inference by providing either "less"
or "greater"
for a onesided test.
Example 1
Goal: Test that the mean atrophy score is equal to 30.
Data: mri
STATA Code and Output
ttest atrophy == 30
Onesample t test

Variable  Obs Mean Std. Err. Std. Dev. [95% Conf. Interval]
+
atrophy  735 35.98367 .4766699 12.92294 35.04787 36.91947

mean = mean(atrophy) t = 12.5531
Ho: mean = 30 degrees of freedom = 734
Ha: mean < 30 Ha: mean != 30 Ha: mean > 30
Pr(T < t) = 1.0000 Pr(T > t) = 0.0000 Pr(T > t) = 0.0000
R Code and Output
## Twosided test
t.test(mri$atrophy, mu = 0)
##
## One Sample ttest
##
## data: mri$atrophy
## t = 75.49, df = 734, pvalue < 2.2e16
## alternative hypothesis: true mean is not equal to 0
## 95 percent confidence interval:
## 35.04787 36.91947
## sample estimates:
## mean of x
## 35.98367
## Onesided tests
t.test(mri$atrophy, mu = 0, alternative = "less")
##
## One Sample ttest
##
## data: mri$atrophy
## t = 75.49, df = 734, pvalue = 1
## alternative hypothesis: true mean is less than 0
## 95 percent confidence interval:
## Inf 36.76872
## sample estimates:
## mean of x
## 35.98367
t.test(mri$atrophy, mu = 0, alternative = "greater")
##
## One Sample ttest
##
## data: mri$atrophy
## t = 75.49, df = 734, pvalue < 2.2e16
## alternative hypothesis: true mean is greater than 0
## 95 percent confidence interval:
## 35.19863 Inf
## sample estimates:
## mean of x
## 35.98367
Quick view of functions:
STATA  R (base) 

ttest 
t.test() 
Both the ttest
command in STATA and the t.test()
function in R are equipped to handle twosample inference as well. Both functions by default presume that the two groups have equal variances. They also provide the option that the test is a matchedpairs test.
Example 1
Goal: Test whether the mean atrophy score is different between males and females.
Data: mri
STATA Code and Output
ttest atrophy, by(male)
Twosample t test with equal variances

Group  Obs Mean Std. Err. Std. Dev. [95% Conf. Interval]
+
0  369 32.90515 .6363956 12.22476 31.65372 34.15658
1  366 39.08743 .6733026 12.88104 37.76339 40.41147
+
combined  735 35.98367 .4766699 12.92294 35.04787 36.91947
+
diff  6.182283 .9262663 8.000734 4.363831

diff = mean(0)  mean(1) t = 6.6744
Ho: diff = 0 degrees of freedom = 733
Ha: diff < 0 Ha: diff != 0 Ha: diff > 0
Pr(T < t) = 0.0000 Pr(T > t) = 0.0000 Pr(T > t) = 1.0000
R Code and Output
t.test(mri$atrophy[mri$male == 0], mri$atrophy[mri$male == 1])
##
## Welch Two Sample ttest
##
## data: mri$atrophy[mri$male == 0] and mri$atrophy[mri$male == 1]
## t = 6.673, df = 730.33, pvalue = 4.956e11
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## 8.001133 4.363432
## sample estimates:
## mean of x mean of y
## 32.90515 39.08743
The ttest in R requires that we subset our data if we want to test within another group.
Quick view of functions:
STATA  R (base) 

regress 
lm() 
In STATA, the regress
command takes the dependent variable (y) followed by the independent variables (x). In R, we need to give lm()
a formula  an object with a “~”" character. For example, if we wanted to regress atrophy on age in the mri
data, the formula would be
atrophy ~ age
We also need to supply lm()
with a dataset, using the data =
argument.
The call to lm()
creates a lm
object, and in order to print the full inference (like STATA does by default) we need to use the summary()
function. Once we have done this, the output agrees.
Example 1
Goal: Determine the association between age and cerebral atrophy by regressing atrophy on age.
Data: mri
STATA Code and Output
regress atrophy age
Source  SS df MS Number of obs = 735
+ F( 1, 733) = 69.58
Model  10626.648 1 10626.648 Prob > F = 0.0000
Residual  111953.156 733 152.732819 Rsquared = 0.0867
+ Adj Rsquared = 0.0854
Total  122579.804 734 167.002458 Root MSE = 12.359

atrophy  Coef. Std. Err. t P>t [95% Conf. Interval]
+
age  .6979831 .0836783 8.34 0.000 .5337054 .8622609
_cons  16.06213 6.256186 2.57 0.010 28.34431 3.779947

R Code and Output
atr.lm < lm(atrophy ~ age, data = mri)
summary(atr.lm)
##
## Call:
## lm(formula = atrophy ~ age, data = mri)
##
## Residuals:
## Min 1Q Median 3Q Max
## 36.870 8.589 0.870 7.666 51.203
##
## Coefficients:
## Estimate Std. Error t value Pr(>t)
## (Intercept) 16.06213 6.25619 2.567 0.0104 *
## age 0.69798 0.08368 8.341 3.64e16 ***
## 
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 12.36 on 733 degrees of freedom
## Multiple Rsquared: 0.08669, Adjusted Rsquared: 0.08545
## Fstatistic: 69.58 on 1 and 733 DF, pvalue: 3.635e16
Quick view of functions:
STATA  R (base) 

glm, family() link() 
glm(family = ) 
stcox 
coxph() 
Generalized linear models allow the user to perform logistic regression, poisson regression, and proportional hazards regression, to name a few. In both STATA and R, for all of the types besides proportional hazard regression, the function is glm
. In these cases, we need to supply the function with a family so that the correct type of regression is run. We also sometimes need to specify the correct link function. The link function specifies what function of the parameter of interest (usually mean response) is linear in the coefficients. For example, in logistic regression, we use a log link function because our assumption is that (for a model with only one predictor): \[
\log (E(Y)) = \beta_0 + \beta_1 X
\] In both cases, binomial
is the correct family for logistic regression. However, in STATA, we need to specify a link function to differentiate logistic (link(logit)
) from poisson (link(probit)
) regression. In R, we use family = "poisson"
to run poisson regression.
In both languages, proportional hazards regression requires a special function. In STATA this is stcox
, while in R it is coxph()
. Both of these functions have the same syntax as the functions for linear regression in their respective languages.
Example 1
Goal: Examine the relationship between diabetes and age using logistic regression.
Data: mri
STATA Code and Output
glm diabetes age, family(binomial) link(log)
Iteration 0: log likelihood = 347.07514
Iteration 1: log likelihood = 251.86635
Iteration 2: log likelihood = 250.64888
Iteration 3: log likelihood = 250.64853
Iteration 4: log likelihood = 250.64853
Generalized linear models No. of obs = 735
Optimization : ML Residual df = 733
Scale parameter = 1
Deviance = 501.2970586 (1/df) Deviance = .6838978
Pearson = 734.9984355 (1/df) Pearson = 1.002726
Variance function: V(u) = u*(1u) [Bernoulli]
Link function : g(u) = ln(u) [Log]
AIC = .687479
Log likelihood = 250.6485293 BIC = 4336.408

 OIM
diabetes  Coef. Std. Err. z P>z [95% Conf. Interval]
+
age  .0108457 .0201345 0.54 0.590 .0503086 .0286172
_cons  1.423417 1.498109 0.95 0.342 4.359657 1.512823

R Code and Output
diabetes.glm < glm(diabetes ~ age, data = mri, family = "binomial")
summary(diabetes.glm)
##
## Call:
## glm(formula = diabetes ~ age, family = "binomial", data = mri)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## 0.5032 0.4863 0.4780 0.4619 2.2011
##
## Coefficients:
## Estimate Std. Error z value Pr(>z)
## (Intercept) 1.21557 1.66965 0.728 0.467
## age 0.01211 0.02242 0.540 0.589
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 501.59 on 734 degrees of freedom
## Residual deviance: 501.30 on 733 degrees of freedom
## AIC: 505.3
##
## Number of Fisher Scoring iterations: 4
Quick view of functions:
STATA  R (base) 

lincom 

predict 
predict() 
Postestimation  for the most part  deals with analyzing combinations of coefficients or predicting new values. Once you have run a regression analysis, these functions can supply additional insight into the relationship of interest.
In STATA, the main command for analyzing combinations of coefficients is lincom
. This command computes point estimates, tstatistics, confidence intervals, and pvalues for the analysis of a linear combination of the regression coefficients against some null hypothesis. For example, if we were exploring the relationship between atrophy, age, and sex in the mri
data, we could test if the coefficients for age
and male
summed to zero. In base R there is no function to do this. However, in the uwIntroStats
package, the lincom()
function does this.
Example 1
Goal: Using the results from a linear regression model modeling atrophy
on age
and male
, test whether the coefficient estimates of these two variables sum to zero.
Data: mri
STATA Code and Output
lincom age+male
( 1) age + male = 0

atrophy  Coef. Std. Err. t P>t [95% Conf. Interval]
+
(1)  6.645784 .8870476 7.49 0.000 4.904324 8.387245

R Code and Output
Next, suppose we wish to predict the atrophy score for a new participant who is 70 years old and female. We need to use the predict
command in STATA or the predict()
function in base R. Regardless of the type of regression we have run, both of these functions will work.
Example 2
Goal: Using the results from a linear regression model modeling atrophy
on age
and male
, predict the atrophy score of a 70 year old female.
Data: mri
STATA Code and Output
predict preds
list preds in 1/10
Now we have to match the correct predictions to the age and sex combinations, which is a bit tedious. R Code and Output
reg < regress("mean", atrophy ~ age + male, data = mri)
newdata < c(70, 0)
newdata < as.data.frame(newdata)
## we need to switch the rows and columns to make this work
newdata < as.data.frame(t(newdata))
colnames(newdata) < c("age", "male")
predict(reg, newdata)
## newdata
## 29.90039
While we have had to do some extra work in R to create a data frame, once we have this data frame the prediction is quite simple. As is the case in STATA, the new data must have the same column names as the variables used to build the regression model.