Data analysis often calls for testing whether two or more sets of numbers are related in some way (see Correlation_and_dependence). Here are some methods to test for or describe a relationship (usually a linear one) between random variables, or two sets of data.



Pearson's product-moment correlation test

This is a common test for a linear relationship between two variables which yields a correlation coefficient between 1 and -1. This test should give the same significance (p-value) as a simple linear regression on the same data.

  • Use cor.test in R.
  • Use scipy.stats.pearsonr or pandas.DataFrame.corr(method='pearson') in python

Simple linear regression

Linear regression is related to correlation and a sample correlation can be calculated as the square root of the R^2^ (Coefficient_of_determination), with the sign of the slope of the regression line (the coefficient of x).

  • Use lm in R
  • Use regress in MATLAB.
  • Numpy has polyfit and Scipy has linregress
  • See other notes here.


If the relationship is non-linear or variance is not normally distributed, these may be useful. Both of the tests below can be used with cor.test in R. Python pandas also has these tests in pandas.DataFrame.corr(method='spearman OR kendall') in python

Spearman's rank test

Ranks x and y datapoints and then does a correlation test between the ranked data. May demonstrate a correlation when a Pearson or linear model test do not.

Kendall's Tau

Looks at pairs of data points and tests how frequently the relationship between the pair goes in one direction or the other.

This paper uses Kendall's Tau for analyzing climate trends.

Corrections for significance

It can be hard to get a good significance value if you are doing multiple comparisons. Lots of people recommend against this, but the Bonferroni correction can be applied. Also, it may be useful to construct 95% confidence intervals around a correlation, and use that instead (does it include 0?).

Further reading on this