This document is meant to give the user an introduction to R, illustrate the use of some of the major `uwIntroStats`

functions, and go through an example data analysis. The examples will be shown with R code and output. For more options, see the help files provided with the package.

We also assume that the user is using RStudio, found here - however, most of the output should be the same regardless of the graphical user interface. Here we introduce the format for each section and subsection below:

*Goal:* Load the required packages (first install them if you have not already done so).

*Function(s):* The `library()`

and `attach()`

functions (and `install.packages()`

if we need to install a package).

*Interpretation:* `uwIntroStats`

uses some functions from other packages, and builds upon others. The packages that these functions come from are: `Exact`

, `plyr`

, `geepack`

, `sandwich`

, and `survival`

. They must be installed for `uwIntroStats`

to work. However, you do not need to load them for the functions in `uwIntroStats`

to work. Some, though, are not at their full potential unless one of these is loaded.

`library(uwIntroStats)`

```
##
## Attaching package: 'uwIntroStats'
##
## The following object is masked from 'package:base':
##
## tabulate
```

You must load `uwIntroStats`

each time that you start a new R session, but you only have to install packages once. If you do not install one of the packages, R will give you errors.

`fev`

dataset*Goal:* We want to load the `fev`

dataset into R for our analysis.

*Functions:* We will be working with the `fev`

dataset. To prepare that (we will see why we name this `fevDat`

later):

```
fevDat <- read.table("http://www.emersonstatistics.com/Datasets/fev.txt", header = TRUE)
attach(fevDat)
```

We use the `read.table()`

function, the `attach()`

function, and the `View()`

function. *Interpretation:* We have now loaded the data into R, and can use it in an analysis. The reference manual for the `fev`

dataset can be found on here. Make sure that we have read in the data correctly by using the `View(fev)`

function. This will bring up a window showing the data. Everything looks good, so we can continue. *Note: Generally it is good practice to clean up the data immediately upon reading it in. However, since we are introducing methods, we will go through data cleanup in the following sections.*

*Goal:* Describe the variables in the dataset.

*Function(s):* The `descrip()`

function will do all of this for us.

*Interpretation:* If we wanted to na"ively get a description of the whole dataset, we could type

`descrip(fevDat)`

```
## N Msng Mean Std Dev Min 25% Mdn
## seqnbr: 654 0 327.5 188.9 1.000 164.2 327.5
## subjid: 654 0 37170 23691 201.0 15811 36071
## age: 654 0 9.931 2.954 3.000 8.000 10.00
## fev: 654 0 2.637 0.8671 0.7910 1.981 2.547
## height: 654 0 61.14 5.704 46.00 57.00 61.50
## sex: 654 0 1.486 0.5002 1.000 1.000 1.000
## smoke: 654 0 1.901 0.2994 1.000 2.000 2.000
## 75% Max
## seqnbr: 490.8 654.0
## subjid: 53638 90001
## age: 12.00 19.00
## fev: 3.119 5.793
## height: 65.50 74.00
## sex: 2.000 2.000
## smoke: 2.000 2.000
```

Notice that we get output for `seqnbr`

and `subjid`

. If we read the documentation for the FEV dataset, we would know that these two variables are label variables, and thus none of the descriptives are of interest. For example, the standard deviation for `subjid`

is 23691, because we have 654 patients with different subject ids. Thus we can ignore these two labelling variables for descriprive purposes. However they, and all of the other variables, show us that there are no missing data as well. Now if we are interested in seeing that all subjects were unique:

```
cnt <- length(unique(subjid))
print(cnt)
```

`## [1] 654`

`cnt`

`## [1] 654`

We already knew that there were 654 observations, and now that we know `cnt`

is 654, we know that all of the observations are unique (in other words, no subject was counted twice). Notice that here we typed two different things to display `cnt`

. This illustrates that when we type `cnt`

into R, it actuall calls `print(cnt)`

. That is why R is called a *functional* language.

We might also want to check for outliers within the distribution of a single variable. Look at the distribution of `age`

:

`descrip(age)`

```
## N Msng Mean Std Dev Min 25% Mdn
## age: 654 0 9.931 2.954 3.000 8.000 10.00
## 75% Max
## age: 12.00 19.00
```

The minimum value is 3. This might be odd, since why would a 3 year old be enrolled in a smoking study? A 3 year old is incredibly unlikely to smoke. How can we tell if this is an outlier? If we look at the mean and the median, we see that they are quite close together - 9.931 and 10, respectively. A large difference between the mean and median would indicate the presence of an outlier, because the mean is heavily influenced by outliers. To illustrate, notice that the formula for the mean \[ \frac{1}{n}\sum_{i=1}^n X_i, \] where the \(X_i\) are each personâ€™s age in this example, is heavily influenced by a single point. If we made one age 100, then the mean would be pulled to the right. The median, however, is robust to outliers. The middle value will not change if we add in an age of 100.

Now letâ€™s look closer at `sex`

and `smoke`

.

`descrip(sex, smoke)`

```
## N Msng Mean Std Dev Min 25% Mdn
## sex: 654 0 1.486 0.5002 1.000 1.000 1.000
## smoke: 654 0 1.901 0.2994 1.000 2.000 2.000
## 75% Max
## sex: 2.000 2.000
## smoke: 2.000 2.000
```

Each of these variables takes on either a 1 or a 2. Notice that the mean of `sex`

is 1.486. The interpretation of this value is that 48.6% of the values in `sex`

are 2. Similarly, 90.1% of the values in `smoke`

are 2. This is easy to see, because if all of the values were 1 then the mean would be 1, and if one value is 2 it pulls the mean up. However, having the values coded as 1â€™s and 2â€™s is not very useful to us, because we have to remember the documentation each time we look at the data to remember whether `sex = 1`

denotes a male or a female. Thus good practice is to create *indicator variables* to reflect the actual data. Usually we say that a 1 is the trait that we want (i.e.Â if we make a variable called `male`

, then a 1 will reflect a male) while a 0 reflects the opposite. Now in the FEV dataset, we know from the documentation that `sex`

is coded with 1 = male, so we can recode as follows:

`male <- ifelse(sex == 1, 1, 0)`

which is simply saying that if `sex`

is a 1 at a given position, set `male`

at that position equal to 1. If `sex = 2`

, set `male = 0`

. We want to create a similar indicator variable out of `smoke`

. In this case we want to know if someone smokes or not, so our indicator variable will take on a 1 if they smoke and a 0 if not. In this data, `smoke = 2`

for a nonsmoker, so we have to be careful when recoding:

`smoker <- ifelse(smoke == 2, 0, 1)`

Our data now have a much more intuitive interpretation. It is always good practice to clean up your data before running any analysis. If we run `descrip()`

again we can see this:

`descrip(male, smoker)`

```
## N Msng Mean Std Dev Min 25% Mdn
## male: 654 0 0.5138 0.5002 0.0000 0.0000 1.000
## smoker: 654 0 0.09939 0.2994 0.0000 0.0000 0.0000
## 75% Max
## male: 1.000 1.000
## smoker: 0.0000 1.000
```

See now that the means have changed. In fact, we see that 51% of the data in `sex`

are males, which is a much more informative description than before. However, in practice, the only appropriate descriptives from this output are the count and the number of missing observations.

*Goal:* We want to check for bimodality in any of the variables.

*Function(s):* We use histograms, which plot the frequency of each unique value in the data (or places the data into bins decided by the function). The `hist()`

function does all of this.

*Interpretation:* We can add the arguments `main="maintitle"`

, `xlab="x label"`

, and `ylab="y label"`

to the `hist()`

function to set a title, x-axis label, and y-axis label, respectively. By default R will set the x and y-axis labels to the variable(s) entered. Notice that we are allowed to use the name `fev`

here because we have named the dataset `fevDat`

. If we had named it `fev`

, we would be in trouble because R would attempt to create a histogram of the whole dataset.

`hist(age, main = "Age")`

`hist(height, main = "Height")`

`hist(fev, main = "FEV")`

There is no bimodality in the data (each variable has only one peak) and thus we do not need to account for this in our analysis.

*Goal:* Sometimes we believe that a variable has an effect on another. For example, we believe that sex has an effect on height. To see this effect, we need to stratify the height variable - that is, examine height in each sex individually.

*Function(s):* The `descrip()`

function allows us to do this easily. We can stratify on any number of variables by adding them to the `strata`

argument.

*Interpretation:*

`descrip(age, height, strata = male)`

```
## N Msng Mean Std Dev Min 25%
## age: All 654 0 9.931 2.954 3.000 8.000
## age: Str 0 318 0 9.843 2.933 3.000 8.000
## age: Str 1 336 0 10.01 2.976 3.000 8.000
## height: All 654 0 61.14 5.704 46.00 57.00
## height: Str 0 318 0 60.21 4.792 46.00 57.50
## height: Str 1 336 0 62.03 6.331 47.00 57.00
## Mdn 75% Max
## age: All 10.00 12.00 19.00
## age: Str 0 10.00 12.00 19.00
## age: Str 1 10.00 12.00 19.00
## height: All 61.50 65.50 74.00
## height: Str 0 61.00 63.50 71.00
## height: Str 1 62.00 67.50 74.00
```

This produces the same descriptive statistics as our call to `descrip()`

above, but this time breaks down by sex category.

*Goal:* We want a graphical representation of the data with the mean, standard deviation, range, and potential outliers displayed.

*Function(s):* The `bplot()`

function builds on the base R function `boxplot()`

. Now we can overlay data, means, and standard deviations onto the boxplot.

*Interpretation* First, the na"ive box plot for both age and height by sex - which means use the `male`

variable (we only present code for the last plot, with the default values):

However, we can add jittered data to make this plot more informative. Jittered data are created by taking the raw data and adding a tiny bit of random â€śnoiseâ€ť in both the *x* and *y* dimension, so that the points are more easily distinguished from each other. The `bplot()`

function does this by default, though we can turn off one of the dimensions if we choose. We can also set the x-axis labels:

We can also add the mean and standard deviation onto our box plot (`bplot()`

does this by default as well) to get the maximum information from the plot:

`bplot(age, strata = male, main = "Age by Sex", xlab = "Sex (0 denotes Female, 1 denotes Male)")`

`bplot(height, strata = male, main = "Height by Sex", xlab = "Sex (0 denotes Female, 1 denotes Male)")`

Similar to the `bplot()`

function, the `scatter()`

function defaults to include some useful information. However, we will go through the derivation of the default values, so the only code shown will be the last (with defaults). The most basic version of the scatterplot does not differentiate between ties in the data (and thus we see much less than the 654 points on the plot):

However, we like to jitter the data so that we can see all of the points:

This allows us to break the ties - the places where multiple data points lie on the same *(x,y)* coordinate point - and see all of the data, but still see the discreteness in the sampling. If we wanted to see a line of best fit, we could create one with the `plotLsfit`

argument to our `scatter()`

function:

If we look closely at the graph, the data doesnâ€™t look quite linear. We can solve this issue by plotting a lowess smooth, which will draw attention to the curves in the data:

`scatter(height, age)`

There might be potential confounders in the data. For instance, if smoking stunts growth, and only older kids smoke, then at higher ages we will have two populations separated by some vertical distance, while at lower ages we have only one population. It is generally useful to plot a stratified scatterplot with lowess smooths to check for confounders. By default, `scatter()`

will use different colors for the strata. We can also label the axes and/or the graph:

`scatter(height, age, strata = smoker, main = "Height vs Age by Smoking Status")`

The legend gives us the color and type of point plotted for each stratum. Now if we wanted to see the lowess smooths for males and females, we can add them as a stratification variable:

`scatter(height, age, strata = cbind(smoker, male), main = "Height vs Age by Smoking Status")`

Now notice that we can compare groups both across sex (`sex 1`

is male and `sex 2`

is female) or across smoking.

Convinced that smoking is not too much of a confounder, we now look at age vs fev. If we do so first with no lowess smooth, it is hard to tell what the trend in the data is:

`scatter(fev, age, plotLowess = FALSE, main = "FEV vs Age")`

Any pattern is incredibly hard to find. We might think that the data is linear, but putting a lowess smooth on the plot shows us differently:

`scatter(fev, age, main = "FEV vs Age")`

Now we can see a hint of an S-shaped trend in the data. We believe this, because if we believe that FEV is actually best predicted by height, then at young ages height is roughly linear with age but at older ages height is flat. In order to really understand what is going on, we must look at the relationship between FEV and height:

`scatter(fev, height, main = "FEV vs Height")`

This is definitely not a linear relationship. In fact, it looks like a cubic relationship. Now if we look at the plot stratified by sex, we donâ€™t see too much of a sex effect:

`scatter(fev, height, strata = male, main = "FEV vs Height by Sex")`

If we create a new variable called `htcub`

(for height cubed) we can see that there is more of a straight line relationship between cubed height and FEV:

```
htcub <- height ^ 3
scatter(fev, htcub, main = "FEV by Cubed Height")
```

There is still heteroscedasticity, however. To remedy this, we can try plotting the cube root of FEV vs Height:

```
cubrtfev <- fev ^ (1/3)
scatter(cubrtfev, height, main = "Cube Root of FEV by Height")
```

`fev`

dataset revisitedNow that we know all of these methods, what would we do if presented with the `fev`

dataset? All of our data cleanup would happen immediately when we read in the data, based on the documentation and what we can see in R.

*Goal:* Read in the `fev`

dataset and make it ready for use and analysis.

*Function(s):* We first read in the dataset, renaming it as `fevDat`

because we know that one of the variables in the data is named `fev`

and we donâ€™t want confusion when we attach the data:

```
fevDat <- read.table("http://www.emersonstatistics.com/Datasets/fev.txt", header = TRUE)
attach(fevDat)
```

```
## The following objects are masked from fevDat (pos = 3):
##
## age, fev, height, seqnbr, sex, smoke, subjid
```

Now we view the data:

`View(fevDat)`

After noticing that `sex`

and `smoke`

are not coded as indicator variables, we recode them (using the documentation online for reference):

```
male <- ifelse(sex == 1, 1, 0)
smoker <- ifelse(smoke == 2, 0, 1)
```

We also would know which variables are likely confounders or effect modifiers before starting the data analysis. Good practice in statistics dictates the choice of these variables before analysis, so that we are not using our data for exploratory purposes when we want to do inference.

*Interpretation:* Now we have loaded the dataset, cleaned it up, and know what our confounding variables and effect modifiers are. Now we are ready to create plots and run our analysis.