In the uwIntroStats package, we have set out to make regression and analysis easier by:

This function is regress(). The basic arguments to this function, which unlock all of its potential, are

We use the concept of a functional to handle our first goal. A functional takes a function as its argument and returns a number - hence the mean is a functional, because it takes a distribution as its argument and returns a single number. The allowed functionals to regress() are

Functional Type of Regression Previous command (package)
"mean" Linear Regression lm() (stats - base R)
"geometric mean" Linear Regression on logarithmically transformed Y lm(), with Y log transformed (stats - base R)
"odds" Logistic Regression glm(family = binomial) (stats - base R)
"rate" Poisson Regression glm(family = poisson) (stats - base R)
"hazard" Proportional Hazards Regression coxph() (survival)

The formula to regress() is the same as a formula given to lm() or any of the other regression commands from base R, survival, or geepack, but with one small addition. To address our second goal of allowing the user to specify multiple-partial F-tests, we have added a special function - U() - which can be added to the formula. The U() function is documented more fully in “User Specified Multiple-Partial F-tests in Regression”.

The data argument is exactly the same as that in lm() or any of the other regression commands.

Last, id allows the user to fit a generalized estimating equations (GEE) model while using the same syntax as any of the functionals to regress(). The GEE framework is a useful way to model correlated data, which often comes in the form of repeated measurements.

Linear Regression

As a first example, we run a linear regression analysis of atrophy on age and male, from the mri data. This dataset is included in the uwIntroStats package, and its documentation can be found on Scott Emerson’s website here.

## Preparing our R session
library(uwIntroStats)
## 
## Attaching package: 'uwIntroStats'
## 
## The following object is masked from 'package:base':
## 
##     tabulate
data(mri)
regress("mean", atrophy ~ age + male, data = mri)
## 
## Call:
## regress(fnctl = "mean", formula = atrophy ~ age + male, data = mri)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -33.765  -8.582  -0.356   7.344  52.100 
## 
## Coefficients:
##                  Estimate  Naive SE  Robust SE    95%L      95%H     
## [1] Intercept     -17.83     6.081     6.557       -30.70    -4.959  
## [2] age            0.6819   0.08129   0.08769       0.5097    0.8540 
## [3] male           5.964     0.8857    0.8845       4.227     7.700  
##                     F stat    df Pr(>F)   
## [1] Intercept            7.40 1    0.0067 
## [2] age                 60.47 1  < 0.00005
## [3] male                45.46 1  < 0.00005
## 
## Residual standard error: 12 on 732 degrees of freedom
## Multiple R-squared:   0.14,  Adjusted R-squared:  0.1376 
## F-statistic: 52.18 on 2 and 732 DF,  p-value: < 2.2e-16

This call automatically prints the coefficients table. First, notice that by default robust standard error estimates (calculated using the sandwich package) are returned, in addition to the naive estimates. The robust estimates are also used to perform the inference - thus the confidence intervals, statistics, and p-values use these estimates of the standard error.

If we did not use robust standard error estimates, then in the case of linear regression we would be assuming that all groups have the same variance. Then any inference we make could be on the fact that the variances are different, rather than just on the means - which is usually what we want in linear regression.

F-statistics are also displayed by default. This allows us to display multiple-partial F-tests within the coefficients table, and is more in line with teaching philosophy at the University of Washington.

All of the usual inferential statements apply to our output. Thus we would say:

We analyzed the association between cerebral atrophy, age, and sex by running a linear regression of cerebral atrophy modeled as a continuous variable on age - modeled as a continuous variable - and sex - modeled as a binary variable. We allowed for unequal variances among groups by computing robust standard error estimates using the Huber-White procedure. We calculated 95% confidence intervals and p-values using these same robust standard error estimates.

Based on this linear regression analysis, we estimate that for each one year increase in age, the mean cerebral atrophy score increases by 0.682 units. Based on a 95% confidence interval, the data suggest that this estimate is not unreasonable if the true coefficient for age was in the interval from 0.509 to 0.854. We also estimate that males have an average cerebral atrophy score of 5.96 units higher than females. Based on a 95% confidence interval, the data suggest that this estimate is not unreasonable if the true coefficient for sex was in the interval from 4.227 to 7.700.

However, in this case we did not adjust for multiple comparisons in calculating the individual p-values. If we wanted to make inferential claims using these, we would have to use a correction. We could also use a multiple-partial F-test to test both coefficients simultaneously.

Regression on the Geometric Mean

In normal linear regression, we are comparing the mean of the response variable across groups defined by the predictors. However, if we were to log transform the response, we would be comparing the geometric mean across groups. In regress(), we simply have to use the "geometric mean" functional.

Generalized Linear Regression

Using the same function, and the same syntax, we can also run generalized linear regression. For example, if we wanted to examine the odds of having diabetes between males and females, we would run a logistic regression.

regress("odds", diabetes ~ male, data = mri)
## 
## Call:
## regress(fnctl = "odds", formula = diabetes ~ male, data = mri)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -0.5593  -0.5593  -0.3823  -0.3823   2.3034  
## 
## Coefficients:
## 
## Raw Model:
##                  Estimate  Naive SE  Robust SE       F stat    df
## [1] Intercept     -2.580     0.2034    0.2037           160.39 1 
## [2] male           0.8037    0.2519    0.2522            10.15 1 
##                  Pr(>F)   
## [1] Intercept    < 0.00005
## [2] male           0.0015 
## 
## Transformed Model:
##                  e(Est)    e(95%L)   e(95%H)         F stat    df
## [1] Intercept     0.07580   0.05082    0.1131           160.39 1 
## [2] male           2.234     1.361     3.665             10.15 1 
##                  Pr(>F)   
## [1] Intercept    < 0.00005
## [2] male           0.0015 
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 501.59  on 734  degrees of freedom
## Residual deviance: 490.82  on 733  degrees of freedom
## AIC: 494.82
## 
## Number of Fisher Scoring iterations: 5

In all of the generalized linear regression output (and in the output from regression on the geometric mean), by default we see two tables. The Raw Model table displays coefficients and standard errors for the model after we have transformed the response variable, but we have not transformed back. Recall that in most generalized linear regression cases, we need to back-transform our results to get them in the original units. This is due to using a link function to model the regression. If you want to suppress printing this table, set suppress = TRUE.

The Transformed Model table does the back-transform for you. It exponentiates all of the coefficients and the confidence intervals so that they are in the original units.

If we were doing survival analysis, we would perhaps want to look at the relationship between time to death, age, and atrophy. First, we need to create a Surv variable to reflect time to death, because some observations have been censored - we didn’t observe their death due to the end of the study or because we lost them to followup. To create a survival variable we have to load the appropriate package first.

library(survival)
mri$ttodth <- Surv(mri$obstime, mri$death)

Now that we have our time to death variable, we can examine our relationship of interest.

regress("hazard", ttodth ~ age + atrophy, data = mri)
## Call:
## regress(fnctl = "hazard", formula = ttodth ~ age + atrophy, data = mri)
## 
##   n= 735, number of events= 133 
## 
## Raw Model:
##                Estimate  Naive SE  Robust SE       F stat    df Pr(>F)   
## [1] age          0.0492    0.0143    0.0156             9.95 1    0.0017 
## [2] atrophy      0.0227   0.00665   0.00692            10.73 1    0.0011 
## 
## Transformed Model:
##                e(Est)    e(95%L)   e(95%H)         F stat    df Pr(>F)   
## [1] age           1.05      1.02      1.08              9.95 1    0.0017 
## [2] atrophy       1.02      1.01      1.04             10.73 1    0.0011 
## 
## Concordance= 0.638  (se = 0.026 )
## Rsquare= 0.045   (max possible= 0.902 )
## Likelihood ratio test= 33.52  on 2 df,   p=5.273e-08
## Wald test            = 38.96  on 2 df,   p=3.469e-09
## Score (logrank) test = 38.38  on 2 df,   p=4.636e-09

We can analyze the results from the Transformed Model table similar to the results in the Linear Regression section, because this table returns us to the original units.

Accounting for Correlated Data

A common way to account for correlated data - for example longitudinal data, where we have multiple measurements on the same subjects over time - is to use the Generalized Estimating Equations (GEE) framework. We chose to use this framework in our regress() function because it gives flexibility and uses common conventions. To make use of this functionality, as we mentioned above, you simply need to specify the id argument in regress().

For an example, we turn to the salary data, again hosted at Scott Emerson’s website. More information can be found in the documentation, but these data essentially deal with university level faculty. In 1995, a survey was run asking the faculty to recall their salaries from the past. Each faculty member had records dating either to their starting year at the university or 1976, whichever was more recent. Thus we have repeated measurements on these individuals, and the variable of interest (salary) is highly correlated across measurements.

One interesting question to ask of these data involves discrimination based on sex. We will not model these data in the most sophisticated way (we leave that to the enterprising reader), but we can regress salary on sex and year, and involve their interaction.

salary <- read.table("http://www.emersonstatistics.com/datasets/salary.txt", header = TRUE, stringsAsFactors = FALSE)

## create an indicator variable
salary$female <- ifelse(salary$sex == "F", 1, 0)

Now that the data is read in, we can run the regression. First we run a regular linear regression which does not properly account for the correlated data, and then we run the GEE based regression.

## linear regression
regress("mean", salary ~ female*year, data = salary)
## 
## Call:
## regress(fnctl = "mean", formula = salary ~ female * year, data = salary)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3727.9  -907.4  -230.0   719.2  7605.5 
## 
## Coefficients:
##                    Estimate  Naive SE  Robust SE    95%L      95%H     
## [1] Intercept       -17417     176.5     167.6       -17745    -17088  
## [2] female            4102     420.7     299.5         3515      4689  
## [3] year             255.5     2.021     2.005        251.6     259.5  
## [4] female:year     -57.88     4.757     3.541       -64.82    -50.94  
##                       F stat    df Pr(>F)   
## [1] Intercept          10802.86 1  < 0.00005
## [2] female               187.59 1  < 0.00005
## [3] year               16235.87 1  < 0.00005
## [4] female:year          267.10 1  < 0.00005
## 
## Residual standard error: 1423 on 19788 degrees of freedom
## Multiple R-squared:  0.487,  Adjusted R-squared:  0.4869 
## F-statistic:  7166 on 3 and 19788 DF,  p-value: < 2.2e-16

From this call to regress, we would think that we had 19792 unique data points, which is quite a lot of faculty to have at a university. If we compare to the results from the GEE based regression, we will see how far off we are.

## GEE
regress("mean", salary ~ female*year, id = id, data = salary)
## 
## Call:
## regress(fnctl = "mean", formula = salary ~ female * year, data = salary, 
##     id = id)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -3727.9   -907.4   -230.0    719.2   7605.5  
## 
## Coefficients:
##                    Estimate  Std Err   95%L         95%H         Wald     
## [1] Intercept       -17417     283.6    -17973       -16861        3771.66
## [2] female            4102     528.7      3065         5138          60.20
## [3] year             255.5     3.536     248.6        262.5        5222.05
## [4] female:year     -57.88     6.370    -70.36       -45.39          82.56
##                    df Pr(>|W|) 
## [1] Intercept      1  < 0.00005
## [2] female         1  < 0.00005
## [3] year           1  < 0.00005
## [4] female:year    1  < 0.00005
## 
##  Estimated Scale Parameters: 
##             Estimate  Std.err
## (Intercept)  2024610 77997.57
## 
##  Correlation: Structure =  independence 
## 
##  Number of Clusters:  1597 
## 
##  Maximum Cluster Size:  20

Now regress() tells us that we only have 1597 faculty members, which is much more reasonable. Also, it tells us that the “maximum cluster size” - i.e. the largest number of observations on any one id - is 20. This makes sense due to the sampling scheme, where some faculty have been at the university for 20 years by the time the survey was taken.

While our estimates of the coefficients are the same between the two calls to regress(), it is our standard error estimates and inference which are drastically different. When we overestimate how many unique samples we have (in the naive linear regression), our standard error is much smaller than it should be, and thus we have much larger F statistics and smaller p-values. While in this case our inference would be the same, since the p-values are on the same order of magnitude in both cases, failing to account for the correlation in these data is still a mistake.

Re-parameterizations of a Variable

There are three special functions in uwIntroStats which allow us to re-parameterize variables:

Each of these three functions is used to great effect in regress(). Also, each will give a multiple-partial F-test of the entire variable. This allows you to determine if the variable should be included in the model, rather than having only the coefficient estimates.

For example, we can model race as dummy variables to examine the differences in the odds of having diabetes between races. This allows us to better make comparisons, because modeling as dummy variables essentially creates indicator variables against a reference group.

regress("odds", diabetes ~ dummy(race), data = mri)
## 
## Call:
## regress(fnctl = "odds", formula = diabetes ~ dummy(race), data = mri)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -0.6165  -0.4539  -0.4539  -0.4539   2.3459  
## 
## Coefficients:
## 
## Raw Model:
##                    Estimate  Naive SE  Robust SE       F stat    df
## [1] Intercept       -2.221     0.1407    0.1411           247.78 1 
##     dummy(race)                                             2.11 3 
## [2]    race.2        0.6568    0.2949    0.2957             4.93 1 
## [3]    race.3       -0.4648    0.6131    0.6147             0.57 1 
## [4]    race.4        0.6113    0.7873    0.7894             0.60 1 
##                    Pr(>F)   
## [1] Intercept      < 0.00005
##     dummy(race)      0.0977 
## [2]    race.2        0.0267 
## [3]    race.3        0.4498 
## [4]    race.4        0.4390 
## 
## Transformed Model:
##                    e(Est)    e(95%L)   e(95%H)         F stat    df
## [1] Intercept        0.1085   0.08227    0.1432           247.78 1 
##     dummy(race)                                             2.11 3 
## [2]    race.2        1.929     1.079     3.446              4.93 1 
## [3]    race.3        0.6282    0.1879    2.100              0.57 1 
## [4]    race.4        1.843     0.3912    8.681              0.60 1 
##                    Pr(>F)   
## [1] Intercept      < 0.00005
##     dummy(race)      0.0977 
## [2]    race.2        0.0267 
## [3]    race.3        0.4498 
## [4]    race.4        0.4390 
## 
##  Dummy terms calculated from race, reference = 1 
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 501.59  on 734  degrees of freedom
## Residual deviance: 495.55  on 731  degrees of freedom
## AIC: 503.55
## 
## Number of Fisher Scoring iterations: 5

First, notice that below the table regress() tells us how the dummy variables were calculated. In this case the reference was 1, corresponding to white in this data set. Next we see the multiple-partial F-test, which is on its own line for dummy(race). The coefficient estimates are nested beneath this line to indicate that these coefficients all come from the same variable (race) but we have modeled them as three variables.

User-specified Multiple-partial F-tests

As we mentioned above, the formula in regress() allows you to specify multiple-partial F-tests. This comes in handy if you want to test a subset of variables all at once in your regression.

For example, say that we are interested in the relationship between atrophy, age, sex, and race. However, we also want to include the sex-age interaction and the sex-race interaction. We also want to model race as dummy variables. Last, we want to determine if all of the variables involved in the sex-age interaction (male, age, and the interaction) should be in the model, and similar for the sex-race interaction. We use the U() function to accomplish this goal.

regress("mean", atrophy ~ U(ma = ~male*age) + U(mr = ~male*dummy(race)), data = mri)
## 
## Call:
## regress(fnctl = "mean", formula = atrophy ~ U(ma = ~male * age) + 
##     U(mr = ~male * dummy(race)), data = mri)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -32.682  -8.236  -0.236   7.158  52.546 
## 
## Coefficients:
##                            Estimate  Naive SE  Robust SE    95%L     
## [1]  Intercept              -8.448     8.881     9.005       -26.13  
##      ma                                                              
## [2]    male                 -11.40     12.19     12.88       -36.69  
## [3]    age                   0.5558    0.1193    0.1208       0.3187 
## [4]    male:age              0.2414    0.1632    0.1730     -0.09826 
##      mr                                                              
##        male                 -11.40     12.19     12.88       -36.69  
##        dummy(race)                                                   
## [5]       race.2            -1.003     1.796     1.825       -4.585  
## [6]       race.3             1.713     2.463     2.067       -2.345  
## [7]       race.4             2.070     6.040     8.518       -14.65  
##        male:dummy(race)                                              
## [8]       male:race.2       -2.202     2.560     2.488       -7.086  
## [9]       male:race.3       -2.883     3.664     3.688       -10.12  
## [10]      male:race.4       -7.558     7.415     9.366       -25.95  
##                            95%H         F stat    df Pr(>F)   
## [1]  Intercept               9.231           0.88 1    0.3485 
##      ma                                     34.35 3  < 0.00005
## [2]    male                  13.89           0.78 1    0.3764 
## [3]    age                   0.7929         21.18 1  < 0.00005
## [4]    male:age              0.5811          1.95 1    0.1634 
##      mr                                      1.04 7    0.4038 
##        male                  13.89           0.78 1    0.3764 
##        dummy(race)                           0.40 3    0.7539 
## [5]       race.2             2.579           0.30 1    0.5826 
## [6]       race.3             5.770           0.69 1    0.4076 
## [7]       race.4             18.79           0.06 1    0.8081 
##        male:dummy(race)                      0.61 3    0.6084 
## [8]       male:race.2        2.682           0.78 1    0.3764 
## [9]       male:race.3        4.357           0.61 1    0.4346 
## [10]      male:race.4        10.83           0.65 1    0.4200 
## 
##  Dummy terms calculated from race, reference = 1 
## 
## Residual standard error: 12 on 725 degrees of freedom
## Multiple R-squared:  0.1488, Adjusted R-squared:  0.1382 
## F-statistic: 12.15 on 9 and 725 DF,  p-value: < 2.2e-16

We have also made use of functionality in U() which allows us to name the groups for the multiple-partial F-tests. Thus we can see that the F-statistic for simultaneously testing whether male, age, and the interaction male:age are equal to zero is 34.35, with the correct degrees of freedom for the test (3) and a small p-value. We would conclude (without adjusting for multiple comparisons) that (some of) these variables should be in the model. On the other hand, we see that the F-statistic for simultaneously testing whether male, race, and the interaction are equal to zero is 1.04. We would conclude (again without adjusting for multiple comparisons) that we cannot reject the null hypothesis that these are all equal to zero.

We also get individual coefficient estimates for each, where we have again nested the estimates within their corresponding groups defined by calls to U(). Note that we have repeated the line for male since it appears in both groups. Other than that, the coefficients table is the same as it would be if we had run the formula atrophy ~ male*age + male*dummy(race).