Data analysis often calls for testing whether two or more sets of numbers are related in some way (see Correlation_and_dependence). Here are some methods to test for or describe a relationship (usually a linear one) between random variables, or two sets of data.
Pearson's product-moment correlation test
This is a common test for a linear relationship between two variables which yields a correlation coefficient between 1 and -1. This test should give the same significance (p-value) as a simple linear regression on the same data.
Simple linear regression
Linear regression is related to correlation and a sample correlation can be calculated as the square root of the R^2^ (Coefficient_of_determination), with the sign of the slope of the regression line (the coefficient of x).
- Numpy has
polyfitand Scipy has
- See other notes here.
If the relationship is non-linear or variance is not normally
distributed, these may be useful. Both of the tests below can be used
cor.test in R. Python pandas also has these tests in
pandas.DataFrame.corr(method='spearman OR kendall') in python
Spearman's rank test
Ranks x and y datapoints and then does a correlation test between the ranked data. May demonstrate a correlation when a Pearson or linear model test do not.
Looks at pairs of data points and tests how frequently the relationship between the pair goes in one direction or the other.
This paper uses Kendall's Tau for analyzing climate trends.
Corrections for significance
It can be hard to get a good significance value if you are doing multiple comparisons. Lots of people recommend against this, but the Bonferroni correction can be applied. Also, it may be useful to construct 95% confidence intervals around a correlation, and use that instead (does it include 0?).
Further reading on this
- Gotelli and Ellison, Chapter 10
- SE question