Correlation provides a good (initial) indication of association; however, people often throw correlation values around without considering their significance. Although there is debate regarding what levels of correlation align with the strength of correlation (i.e. strong, moderate, and weak), we should also be aware that sample size is an influencing factor for whether a correlation is statistically significant or not. Just this week at work, I had a conversation that highlighted this.
I was running some initial exploratory data analysis on a smaller data set and was reporting some simple correlations to a client. I was identifying weak to moderate correlations and I outlined which correlations were statistically significant (at p < .05) and which were not. This caught my client off guard and he asked me to help him “understand how a relatively low correlation produces a very significant p-value.”
I proceeded to illustrate with a simple example where I generated four different data sets, all with correlation coefficients (r) ≈ .30 but with n ranging from 25 to 10001. You can see that the low level of correlation exists in each data set but as more observations are added you are able to better discern whether the correlation is statistically different than zero.
I explained that although sample size is influential, the level of correlation determines just how influential n is. At high levels of correlation (|r| > .50), sample size will have less impact. Consequently, you cannot simultaneously have a strong correlation and a large p-value. As the strength of correlation becomes small, sample size will influence the results more. Therefore, you can have moderate levels of correlation (|r| < .50) and have a p-value that is either large or small depending on sample size. Since the test statistic to determine statistical significance2 is solely based on the correlation coefficient (r) and sample size (n) variables:
…we can determine the mix of r and n required to obtain statistically significant results. The following plot provides a reference chart that indicates the level of correlation and sample size required to obtain p < 0.05, suggesting the relationship is statistically different from zero. When your correlation results and sample size places you above the validation curve you can be confident the results are statistically significant. However, if your results are near or below the curve then you should always validate your results to determine if your relationship is statistically significant.
Keep in mind this only illustrates the relationship between the correlation coefficient and sample size on producing a significant p-value. Whether your data meets the required assumptions will determine if your results are valid. And, obviously, the logic of the associated variables will also determine if the correlation is practically important.
This example is addressing the Pearson correlation coefficient, which is the most widely used correlation method. Therefore, this test statistic is specific to the Pearson method and differs from the Spearman and Kendall’s tau methods. ↩
Researchers have offered rules of thumb for interpreting the meaning of correlation coefficients, but these rules of thumb are often domain specific (i.e. what is a “strong” correlation in the medical field may be considered a “weak” correlation in retail marketing). ↩