## Computing correlation in R

*Author: Lenka Fiřtová*

This article explains how to compute the correlation coefficient between two variables, and the correlation matrix between multiple variables.

## How to compute the correlation coefficient

In our calculations, we are going to use the *trees *dataset which is integrated in *R*. This dataset contains information about 31 trees, namely their girth, height and volume (of wood). First let’s take a look at the first few rows.

> head(trees) Girth Height Volume 1 8.3 70 10.3 2 8.6 65 10.3 3 8.8 63 10.2 4 10.5 72 16.4 5 10.7 81 18.8 6 10.8 83 19.7

To compute correlation in *R *we use the *cor *function.

If we want to compute the correlation of two variables (for example, the girth of the trees and their height), we simply enter the names of these two variables into the *cor *function (the syntax is as follows: name of the dataset, dollar sign ($), name of the variable; or alternatively name of the dataset[ , number of the column]. No other argument is needed.

This function computes the so-called Peason’s correlation coefficient, which is the correlation coefficient we usually have in mind when talking about „correlation“. It is the covariance of the variables divided by the sum of their standard deviations. The *cor *function can also compute Spearman’s rank correlation coefficient and Kendall’s correlation coefficient, which are, however, not the subject of this article.

> cor(trees$Girth, trees$Height) [1] 0.5192801

Alternatively:

> cor(trees[ , 1], trees[ ,2]) [1] 0.5192801

As we would expect, the correlation is positive – the taller the tree, the larger its girth.

A problem may arise when the dataset contains missing values (*NA*). Let us create a new dataset, *trees2*, into which we add a new row using the *rbind *function. This row will contain a missing value in the *girth* column (the values of the remaining variables are just made up).

> trees2 = rbind(trees, c(NA, 90, 60))

As expected, the *cor *function returns an error.

> cor(trees2$Girth, trees2$Height) [1] NA

Therefore, in case of missing observations, we have to specify that only complete observations (i.e. those without any missing values) should be used. This is done by adding one more argument when using the function, which is: *use = „complete.obs“*.

> cor(trees2$Girth, trees2$Height, use = "complete.obs") [1] 0.5192801

## Correlation matrix

Let us go back to the original dataset, *trees*. We want to display the correlation coefficients for each pair of variables at the same time. To do this, we simply enter more variables into the *cor *function, or even the whole dataset (when it only contains numeric variables).

> cor(trees) Girth Height Volume Girth 1.0000000 0.5192801 0.9671194 Height 0.5192801 1.0000000 0.5982497 Volume 0.9671194 0.5982497 1.0000000

If the *trees *dataset contained another, non-numeric variable (for example the location of the trees), we would have to specify we only want to use the first three columns:

> cor(trees[,1:3])

When there are more than two variables, the *cor *function returns the so-called correlation matrix. On its main diagonal the elements are equal to one (the correlation of the variable with itself), the other elements are the respective correlation coefficients. The matrix is symmetrical (the elements above and below the main diagonal are identical).

For example, we can see that the volume of the trees correlates more strongly with their girth (correlation equal to 0.97) than with their height (correlation equal to 0.598).

## Testing the significance of the correlation coefficient

When we want to test whether the correlation coefficient is significant, we use the *cor.test* function. This test is used to find out if the correlation is “as high as it is just by chance” (i.e. in our specific sample), or if we can make a general conclusion that there is indeed a non-zero correlation in the whole population.

Let us explore the significance of the correlation coefficient between the variables *girth *and *height*.

> cor.test(trees$Girth, trees$Height)

*R *returns the following. The test-statistic value (*t*) is 3.2722. We could compare it with the critical value, but there is a simpler way. The function also displays the *p-*value, so we can compare the *p*-value with the significance level, which is usually set to 0.05. If the *p*-value is smaller than 0.05 (as is true in our case), then we conclude that there truly is a statistically significant linear relationship between the variables.

data: trees$Girth and trees$Height t = 3.2722, df = 29, p-value = 0.002758 alternative hypothesis: true correlation is not equal to 0 95 confidence interval: 0.2021327 0.7378538 sample estimates cor 0.5192801

We can see that 0.002758 is smaller than 0.05. The girth and the height of the trees are significantly correlated.