In the uwIntroStats
package, we have set out to make regression and analysis easier by:
This function is regress()
. The basic arguments to this function, which unlock all of its potential, are
fnctl
- the functionalformula
- the formula for the linear modeldata
- the data to use for the modelid
- the identification variable in data with repeated measurementsWe use the concept of a functional to handle our first goal. A functional takes a function as its argument and returns a number - hence the mean is a functional, because it takes a distribution as its argument and returns a single number. The allowed functionals to regress()
are
Functional | Type of Regression | Previous command (package) |
---|---|---|
"mean" |
Linear Regression | lm() (stats - base R) |
"geometric mean" |
Linear Regression on logarithmically transformed Y | lm() , with Y log transformed (stats - base R) |
"odds" |
Logistic Regression | glm(family = binomial) (stats - base R) |
"rate" |
Poisson Regression | glm(family = poisson) (stats - base R) |
"hazard" |
Proportional Hazards Regression | coxph() (survival ) |
The formula to regress()
is the same as a formula given to lm()
or any of the other regression commands from base R, survival
, or geepack
, but with one small addition. To address our second goal of allowing the user to specify multiple-partial F-tests, we have added a special function - U()
- which can be added to the formula. The U()
function is documented more fully in “User Specified Multiple-Partial F-tests in Regression”.
The data argument is exactly the same as that in lm()
or any of the other regression commands.
Last, id allows the user to fit a generalized estimating equations (GEE) model while using the same syntax as any of the functionals to regress()
. The GEE framework is a useful way to model correlated data, which often comes in the form of repeated measurements.
As a first example, we run a linear regression analysis of atrophy on age and male, from the mri
data. This dataset is included in the uwIntroStats
package, and its documentation can be found on Scott Emerson’s website here.
## Preparing our R session
library(uwIntroStats)
##
## Attaching package: 'uwIntroStats'
##
## The following object is masked from 'package:base':
##
## tabulate
data(mri)
regress("mean", atrophy ~ age + male, data = mri)
##
## Call:
## regress(fnctl = "mean", formula = atrophy ~ age + male, data = mri)
##
## Residuals:
## Min 1Q Median 3Q Max
## -33.765 -8.582 -0.356 7.344 52.100
##
## Coefficients:
## Estimate Naive SE Robust SE 95%L 95%H
## [1] Intercept -17.83 6.081 6.557 -30.70 -4.959
## [2] age 0.6819 0.08129 0.08769 0.5097 0.8540
## [3] male 5.964 0.8857 0.8845 4.227 7.700
## F stat df Pr(>F)
## [1] Intercept 7.40 1 0.0067
## [2] age 60.47 1 < 0.00005
## [3] male 45.46 1 < 0.00005
##
## Residual standard error: 12 on 732 degrees of freedom
## Multiple R-squared: 0.14, Adjusted R-squared: 0.1376
## F-statistic: 52.18 on 2 and 732 DF, p-value: < 2.2e-16
This call automatically prints the coefficients table. First, notice that by default robust standard error estimates (calculated using the sandwich
package) are returned, in addition to the naive estimates. The robust estimates are also used to perform the inference - thus the confidence intervals, statistics, and p-values use these estimates of the standard error.
If we did not use robust standard error estimates, then in the case of linear regression we would be assuming that all groups have the same variance. Then any inference we make could be on the fact that the variances are different, rather than just on the means - which is usually what we want in linear regression.
F-statistics are also displayed by default. This allows us to display multiple-partial F-tests within the coefficients table, and is more in line with teaching philosophy at the University of Washington.
All of the usual inferential statements apply to our output. Thus we would say:
We analyzed the association between cerebral atrophy, age, and sex by running a linear regression of cerebral atrophy modeled as a continuous variable on age - modeled as a continuous variable - and sex - modeled as a binary variable. We allowed for unequal variances among groups by computing robust standard error estimates using the Huber-White procedure. We calculated 95% confidence intervals and p-values using these same robust standard error estimates.
Based on this linear regression analysis, we estimate that for each one year increase in age, the mean cerebral atrophy score increases by 0.682 units. Based on a 95% confidence interval, the data suggest that this estimate is not unreasonable if the true coefficient for age was in the interval from 0.509 to 0.854. We also estimate that males have an average cerebral atrophy score of 5.96 units higher than females. Based on a 95% confidence interval, the data suggest that this estimate is not unreasonable if the true coefficient for sex was in the interval from 4.227 to 7.700.
However, in this case we did not adjust for multiple comparisons in calculating the individual p-values. If we wanted to make inferential claims using these, we would have to use a correction. We could also use a multiple-partial F-test to test both coefficients simultaneously.
In normal linear regression, we are comparing the mean of the response variable across groups defined by the predictors. However, if we were to log transform the response, we would be comparing the geometric mean across groups. In regress()
, we simply have to use the "geometric mean"
functional.
Using the same function, and the same syntax, we can also run generalized linear regression. For example, if we wanted to examine the odds of having diabetes between males and females, we would run a logistic regression.
regress("odds", diabetes ~ male, data = mri)
##
## Call:
## regress(fnctl = "odds", formula = diabetes ~ male, data = mri)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -0.5593 -0.5593 -0.3823 -0.3823 2.3034
##
## Coefficients:
##
## Raw Model:
## Estimate Naive SE Robust SE F stat df
## [1] Intercept -2.580 0.2034 0.2037 160.39 1
## [2] male 0.8037 0.2519 0.2522 10.15 1
## Pr(>F)
## [1] Intercept < 0.00005
## [2] male 0.0015
##
## Transformed Model:
## e(Est) e(95%L) e(95%H) F stat df
## [1] Intercept 0.07580 0.05082 0.1131 160.39 1
## [2] male 2.234 1.361 3.665 10.15 1
## Pr(>F)
## [1] Intercept < 0.00005
## [2] male 0.0015
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 501.59 on 734 degrees of freedom
## Residual deviance: 490.82 on 733 degrees of freedom
## AIC: 494.82
##
## Number of Fisher Scoring iterations: 5
In all of the generalized linear regression output (and in the output from regression on the geometric mean), by default we see two tables. The Raw Model
table displays coefficients and standard errors for the model after we have transformed the response variable, but we have not transformed back. Recall that in most generalized linear regression cases, we need to back-transform our results to get them in the original units. This is due to using a link function to model the regression. If you want to suppress printing this table, set suppress = TRUE
.
The Transformed Model
table does the back-transform for you. It exponentiates all of the coefficients and the confidence intervals so that they are in the original units.
If we were doing survival analysis, we would perhaps want to look at the relationship between time to death, age, and atrophy. First, we need to create a Surv
variable to reflect time to death, because some observations have been censored - we didn’t observe their death due to the end of the study or because we lost them to followup. To create a survival variable we have to load the appropriate package first.
library(survival)
mri$ttodth <- Surv(mri$obstime, mri$death)
Now that we have our time to death variable, we can examine our relationship of interest.
regress("hazard", ttodth ~ age + atrophy, data = mri)
## Call:
## regress(fnctl = "hazard", formula = ttodth ~ age + atrophy, data = mri)
##
## n= 735, number of events= 133
##
## Raw Model:
## Estimate Naive SE Robust SE F stat df Pr(>F)
## [1] age 0.0492 0.0143 0.0156 9.95 1 0.0017
## [2] atrophy 0.0227 0.00665 0.00692 10.73 1 0.0011
##
## Transformed Model:
## e(Est) e(95%L) e(95%H) F stat df Pr(>F)
## [1] age 1.05 1.02 1.08 9.95 1 0.0017
## [2] atrophy 1.02 1.01 1.04 10.73 1 0.0011
##
## Concordance= 0.638 (se = 0.026 )
## Rsquare= 0.045 (max possible= 0.902 )
## Likelihood ratio test= 33.52 on 2 df, p=5.273e-08
## Wald test = 38.96 on 2 df, p=3.469e-09
## Score (logrank) test = 38.38 on 2 df, p=4.636e-09
We can analyze the results from the Transformed Model
table similar to the results in the Linear Regression section, because this table returns us to the original units.
There are three special functions in uwIntroStats
which allow us to re-parameterize variables:
dummy
- create dummy variableslspline
- create linear splinespolynomial
- create a polynomialEach of these three functions is used to great effect in regress()
. Also, each will give a multiple-partial F-test of the entire variable. This allows you to determine if the variable should be included in the model, rather than having only the coefficient estimates.
For example, we can model race as dummy variables to examine the differences in the odds of having diabetes between races. This allows us to better make comparisons, because modeling as dummy variables essentially creates indicator variables against a reference group.
regress("odds", diabetes ~ dummy(race), data = mri)
##
## Call:
## regress(fnctl = "odds", formula = diabetes ~ dummy(race), data = mri)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -0.6165 -0.4539 -0.4539 -0.4539 2.3459
##
## Coefficients:
##
## Raw Model:
## Estimate Naive SE Robust SE F stat df
## [1] Intercept -2.221 0.1407 0.1411 247.78 1
## dummy(race) 2.11 3
## [2] race.2 0.6568 0.2949 0.2957 4.93 1
## [3] race.3 -0.4648 0.6131 0.6147 0.57 1
## [4] race.4 0.6113 0.7873 0.7894 0.60 1
## Pr(>F)
## [1] Intercept < 0.00005
## dummy(race) 0.0977
## [2] race.2 0.0267
## [3] race.3 0.4498
## [4] race.4 0.4390
##
## Transformed Model:
## e(Est) e(95%L) e(95%H) F stat df
## [1] Intercept 0.1085 0.08227 0.1432 247.78 1
## dummy(race) 2.11 3
## [2] race.2 1.929 1.079 3.446 4.93 1
## [3] race.3 0.6282 0.1879 2.100 0.57 1
## [4] race.4 1.843 0.3912 8.681 0.60 1
## Pr(>F)
## [1] Intercept < 0.00005
## dummy(race) 0.0977
## [2] race.2 0.0267
## [3] race.3 0.4498
## [4] race.4 0.4390
##
## Dummy terms calculated from race, reference = 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 501.59 on 734 degrees of freedom
## Residual deviance: 495.55 on 731 degrees of freedom
## AIC: 503.55
##
## Number of Fisher Scoring iterations: 5
First, notice that below the table regress()
tells us how the dummy variables were calculated. In this case the reference was 1, corresponding to white
in this data set. Next we see the multiple-partial F-test, which is on its own line for dummy(race)
. The coefficient estimates are nested beneath this line to indicate that these coefficients all come from the same variable (race
) but we have modeled them as three variables.
As we mentioned above, the formula in regress()
allows you to specify multiple-partial F-tests. This comes in handy if you want to test a subset of variables all at once in your regression.
For example, say that we are interested in the relationship between atrophy, age, sex, and race. However, we also want to include the sex-age interaction and the sex-race interaction. We also want to model race as dummy variables. Last, we want to determine if all of the variables involved in the sex-age interaction (male
, age
, and the interaction) should be in the model, and similar for the sex-race interaction. We use the U()
function to accomplish this goal.
regress("mean", atrophy ~ U(ma = ~male*age) + U(mr = ~male*dummy(race)), data = mri)
##
## Call:
## regress(fnctl = "mean", formula = atrophy ~ U(ma = ~male * age) +
## U(mr = ~male * dummy(race)), data = mri)
##
## Residuals:
## Min 1Q Median 3Q Max
## -32.682 -8.236 -0.236 7.158 52.546
##
## Coefficients:
## Estimate Naive SE Robust SE 95%L
## [1] Intercept -8.448 8.881 9.005 -26.13
## ma
## [2] male -11.40 12.19 12.88 -36.69
## [3] age 0.5558 0.1193 0.1208 0.3187
## [4] male:age 0.2414 0.1632 0.1730 -0.09826
## mr
## male -11.40 12.19 12.88 -36.69
## dummy(race)
## [5] race.2 -1.003 1.796 1.825 -4.585
## [6] race.3 1.713 2.463 2.067 -2.345
## [7] race.4 2.070 6.040 8.518 -14.65
## male:dummy(race)
## [8] male:race.2 -2.202 2.560 2.488 -7.086
## [9] male:race.3 -2.883 3.664 3.688 -10.12
## [10] male:race.4 -7.558 7.415 9.366 -25.95
## 95%H F stat df Pr(>F)
## [1] Intercept 9.231 0.88 1 0.3485
## ma 34.35 3 < 0.00005
## [2] male 13.89 0.78 1 0.3764
## [3] age 0.7929 21.18 1 < 0.00005
## [4] male:age 0.5811 1.95 1 0.1634
## mr 1.04 7 0.4038
## male 13.89 0.78 1 0.3764
## dummy(race) 0.40 3 0.7539
## [5] race.2 2.579 0.30 1 0.5826
## [6] race.3 5.770 0.69 1 0.4076
## [7] race.4 18.79 0.06 1 0.8081
## male:dummy(race) 0.61 3 0.6084
## [8] male:race.2 2.682 0.78 1 0.3764
## [9] male:race.3 4.357 0.61 1 0.4346
## [10] male:race.4 10.83 0.65 1 0.4200
##
## Dummy terms calculated from race, reference = 1
##
## Residual standard error: 12 on 725 degrees of freedom
## Multiple R-squared: 0.1488, Adjusted R-squared: 0.1382
## F-statistic: 12.15 on 9 and 725 DF, p-value: < 2.2e-16
We have also made use of functionality in U()
which allows us to name the groups for the multiple-partial F-tests. Thus we can see that the F-statistic for simultaneously testing whether male
, age
, and the interaction male:age
are equal to zero is 34.35, with the correct degrees of freedom for the test (3) and a small p-value. We would conclude (without adjusting for multiple comparisons) that (some of) these variables should be in the model. On the other hand, we see that the F-statistic for simultaneously testing whether male
, race
, and the interaction are equal to zero is 1.04. We would conclude (again without adjusting for multiple comparisons) that we cannot reject the null hypothesis that these are all equal to zero.
We also get individual coefficient estimates for each, where we have again nested the estimates within their corresponding groups defined by calls to U()
. Note that we have repeated the line for male
since it appears in both groups. Other than that, the coefficients table is the same as it would be if we had run the formula atrophy ~ male*age + male*dummy(race)
.