Pearson correlation is a correlation measure used to measure the strength and direction of a linear relationship between two variables. Two variables are said to be correlated if a change in one variable is accompanied by a change in the other variable, either in the same direction or in the opposite direction. It must be remembered that a small (not significant) correlation coefficient does not mean that the two variables are not related . It is possible that two variables have a strong relationship but the value of the correlation coefficient is close to zero, for example in the case of a non-linear relationship .
Thus, the correlation coefficient only measures the strength of the linear relationship and not the non-linear relationship . It should also be remembered that the existence of a strong linear relationship between variables does not necessarily mean that there is a causal, causal relationship .
Introduction
Often researchers observe several parameters from the same sampling or unit of observation. For example, in research testing a certain type of fertilizer, in addition to recording rice yields, the researcher may also want to record several other responses, such as number of grains, weight of 100 seeds, number of tillers, nitrogen uptake, potassium uptake, etc. If there are only two variables recorded, it is said to be bivariate , whereas if there are more, it is said to be multivariate. The recorded variable has a random value, so it is said to be a random variable. In contrast to the fertilizer dose that has been determined previously, the fertilizer variable is fixed, so it is said to be a fixed variable. It is possible that, in addition to the researcher wanting to see the relationship between fertilizer dose (factor) and rice yield (response), he also wanted to see the relationship between the pairs of response variables he observed. Does the increase in nitrogen uptake go hand in hand with an increase in yield or is it the other way around and what is the strength of the relationship? The strength and direction of the linear relationship between the two variables can be described by a statistical measure called the "correlation coefficient".
Data exploration
Before analyzing the correlation between variables, we should first explore the data graphically. Often we see the pattern of relationships between variables by plotting the pair of sample data on a Cartesian diagram called a scatterplot or scatter diagram. Each data pair (x, y) is plotted as a single point.
An example of a scatter diagram can be seen in the following figure.

At a glance we can see the pattern of the relationship from the graphs. In Graphs a, b, c it can be seen that the increase in the value of y is in line with the increase in the value of x. If the value of x increases, then the value of y also increases, and vice versa. From Graphs a to c , the distribution of the data pair points is getting closer to a straight line which shows that the close relationship between the variables x and y is getting stronger (synergistic).
The opposite happens in Graphs d , e , and f . The increase in the value of y is not in line with the increase in the value of x (antagonist). An increase in one of the values causes a decrease in the value of the other. Once again it appears that the strength of the relationship between the two variables from d to f is getting stronger.
Unlike the previous graph, Graph g does not show a linear relationship pattern between the two variables. This indicates that there is no correlation between the two variables. Finally, in Graph h we can see that there is a pattern of relationship between the two variables, it's just that the pattern is not in the form of a linear relationship, but in the form of a quadratic.
Covariance and Correlation
To understand the linear correlation between two variables, there are two elements that we must examine, measuring the relationship between the two variables (covariance) and the standardization process.
Covariance
One measure of the strength of a linear relationship between two continuous random variables is to determine how much the two variables co-variate, that is, they vary together. If one variable increases (or decreases) as a result of an increase (or decrease) in its partner variable, then the two variables are called covaries. However, if one variable does not change with the increase (or decrease) of another variable, then the variable is not covariated. A statistic to measure how much of the two covariates in the sample of observations are covariance.
$$Covarian=S_{xy}=\frac{\sum{(x_i-\bar{x})(y_i-\bar{y})}}{n-1}$$
In addition to measuring the strength of the relationship between two variables, covariance also determines the direction of the relationship between the two variables.
- If the value is positive , it means that if the value of x is above the average value, then the value of y is also above the average value of y, and vice versa (unidirectional).
- A negative covariance value indicates that if the x value is above the average value while the y value is below the average value (opposite direction).
- Finally, if the covariance value is close to zero , it indicates that the two variables are not related.
Standardization
One of the limitations of covariance as a measure of the strength of a linear relationship is that the direction/magnitude of the gradient depends on the units of the two variables. For example, the covariance between N uptake (%) and Rice Yield (tonnes) will be much larger if we convert % (1/100) to ppm (1/million). So that the covariance value does not depend on the unit of each variable, we must standardize it first by dividing the covariance value by the standard deviation value of the two variables so that the value will lie between -1 and +1. This statistical measure is known as the Pearson product moment correlation which measures the strength of the linear relationship (straight line) of the two variables. The linear correlation coefficient is sometimes referred to as the Pearson correlation coefficient in honor of Karl Pearson (1857-1936), who first developed this statistical measure.
Covariance:
$$Covarian=S_{xy}=\frac{\sum\left(x_i-\bar{x}\right)\left(y_i-\bar{y}\right)}{n-1}$$
Standard Deviation of X and Y variables:
$$S_x=\sqrt{\frac{\sum\left(x_i-\bar{x}\right)^2}{n-1}}{\ dan\ S}_y=\sqrt{\frac{\sum\left(y_i-\bar{y}\right)^2}{n-1}}$$
Correlation
The covariance value is standardized by dividing the covariance value by the standard deviation value of the two variables.
$$Correlation=r_{xy}=\frac{S_{xy}}{S_x\bullet S_y}$$
$$Correlation=r_{xy}=\frac{\sum\left(x_i-\bar{x}\right)\left(y_i-\bar{y}\right)}{\sqrt{\sum\left(x_i-\bar{x}\right)^2}\bullet\sqrt{\sum\left(y_i-\bar{y}\right)^2}}$$
or
$$Correlation=r_{xy}=\frac{\sum{x_iy_i-\frac{\sum{x_i\sum y_i}}{n}}}{\sqrt{\Sigma x_i^2-\frac{\left(\Sigma x_i\right)^2}{n}}\bullet\sqrt{\Sigma y_i^2-\frac{\left(\Sigma y_i\right)^2}{n}}}$$
atau
$$Correlation=r_{xy}=\frac{n\sum{x_iy_i-\sum{x_i\sum y_i}}}{\sqrt{{n\Sigma x}_i^2-\left(\Sigma x_i\right)^2}\bullet\sqrt{{n\Sigma y}_i^2-\left(\Sigma y_i\right)^2}}$$
Correlation coefficient
The correlation coefficient measures the strength and direction of the linear relationship of two variables. It must be remembered that a small (not significant) correlation coefficient does not mean that the two variables are not related . It is possible that two variables have a strong relationship but the value of the correlation coefficient is close to zero, for example in the case of a non-linear relationship . Thus, the correlation coefficient only measures the strength of the linear relationship and not the non-linear relationship .
It should also be remembered that the existence of a strong linear relationship between variables does not necessarily mean that there is a causal, causal relationship . The two pairs of variables, x and y, may have high correlation coefficients as a result of the z factor. For example, temperature (x) and air pressure (y) may have a high correlation coefficient, but they do not necessarily indicate a causal relationship (for example, the lower the air temperature, the lower the air pressure). The correlation between temperature and air pressure could be solely as a result of changes in altitude (z) of a place, the higher the place, both the temperature and air pressure will decrease. (although theoretically there is a proportional relationship between temperature and pressure: PV = nRT).Thus, correlation only explains the strength of the relationship without considering the causality relationship, which one is affected and which one is influencing. The two variables can each act as Variable X or Variable Y .
Correlation characteristics
- The value of r always lies between -1 and +1
- The value of r does not change if all the data on either the x variable, the y variable, or both are multiplied by a certain constant (c) value (as long as c 0).
- The value of r does not change if all the data on the variable x, variable y, or both are added to a certain constant (c) value.
- The value of r will not be affected by determining which is the variable x and which is the variable y. The two variables are interchangeable.
- The value of r is only to measure the strength of linear relationships, and is not designed to measure non-linear relationships
Assumption
Assumptions for correlation analysis:
- The paired data sample (x, y) comes from a random sample and is quantitative data.
- The data pair (x, y) must be normally distributed.
It must be remembered that correlation analysis is very sensitive to outliers !
Assumptions can be checked visually by using:
- Boxplots, histograms & univariate scatterplots for each variable
- Bivariate scatterplots
If it does not meet the assumptions, for example the data is not normally distributed (or there are outlier data values), we can use the Spearman rank correlation , a correlation for non-parametric analysis.
Coefficient of Determination
The correlation coefficient, r, provides only a measure of the strength and direction of a linear relationship between two variables. However, it does not provide information about the proportion of variation (variation) of the dependent variable (Y) that can be explained or caused by a linear relationship with the value of the independent variable (X). The value of r cannot be directly compared, for example we cannot say that the value of r = 0.8 is twice the value of r = 0.4.
Fortunately, the squared value of r can measure exactly the ratio/proportion, and this statistical value is called the Coefficient of Determination, r 2 . Thus, the Coefficient of Determination can be defined as a value that expresses the proportion of Y variance that can be explained/explained by a linear relationship between X and Y variables.
For example, if the correlation value (r) between absorption N and the result = 0.8, then r 2 = 0.8 x 0.8 = 0.64 = 64%. This means that 64% of the variability in rice yields can be explained by the high and low N uptake. The remaining 36% may be caused by other factors and/or errors from the experiment.
Correlation Coefficient Test
There are two methods commonly used to test the significance of the correlation coefficient. The first method uses the t-test and the second method uses the r table.
Flowchart for hypothesis testing:

Notes:
The critical table value of r can be seen in the table below. The complete critical value of r can be downloaded at the following link: critical value of r table :

Factors that will affect the value of the correlation test:
The size of the correlation coefficient and the size/number of samples.
Applied Example
The following is data on age, weight, and blood pressure.
| Individual | Age | Weight | Systolic Pressure |
| A | 34 | 45 | 108 |
| B | 43 | 44 | 129 |
| C | 49 | 56 | 126 |
| D | 58 | 57 | 149 |
| E | 64 | 65 | 168 |
| F | 73 | 63 | 161 |
| G | 78 | 55 | 174 |
In this case, we want to see if there is a linear relationship between age and systolic blood pressure? The real level used is 5%.
Hypothesis
H0: ρ = 0 vs H1: ρ ≠ 0
Data Exploration

Based on the scatterplot, it appears that the distribution of the dots follows a linear pattern with a positive slope, which means that there is a consistent relationship between age and systolic blood pressure. Thus, we can use the correlation coefficient to determine whether the linear relationship between the two variables is meaningful or not. If the relationship pattern is not linear, it is not appropriate to use the correlation coefficient because the value of r is only to measure the strength and direction of the linear relationship between the two quantitative variables.
Assumption
Both data come from quantitative data. Furthermore, whether the distribution of the two variables is normally distributed?
Formal Test
H0: data is normally distributed
H1: data is not normally distributed

Interpretation
If the value of sig (p-value) 0.05, then reject H0 which means the data is not normally distributed
If the value of sig (p-value) > 0.05, then accept H0 which means the data is normally distributed
In the case above, the p-value for both variables is > 0.05, so we can conclude that the data is normally distributed.
It appears that the normality test for the two variables meets the requirements, the distribution follows the normal distribution, either by using the Kolmogorov-Smirnov test or the Shapiro-Wilk test.
Graphic


Graphically it also appears that the two variables are normally distributed. The use of box plots to see whether the distribution of data is normally distributed or not is described on the topic: Getting to know Box Plots
Calculation of correlation coefficient value (r)
| No | Age (X) | Systolic Pressure (Y) | X 2 | Y 2 | XY |
| 1 | 34 | 108 | 1156 | 11664 | 3672 |
| 2 | 43 | 129 | 1849 | 16641 | 5547 |
| 3 | 49 | 126 | 2401 | 15876 | 6174 |
| 4 | 58 | 149 | 3364 | 22201 | 8642 |
| 5 | 64 | 168 | 4096 | 28224 | 10752 |
| 6 | 73 | 161 | 5329 | 25921 | 11753 |
| 7 | 78 | 174 | 6084 | 30276 | 13572 |
| Sum | 399 | 1015 | 24279 | 150803 | 60112 |
| Average | 57 | 145 |
$$r_{xy}=\frac{n\sum{x_iy_i-\sum{x_i\sum y_i}}}{\sqrt{{n\Sigma x}_i^2-\left(\Sigma x_i\right)^2}\cdot\sqrt{{n\Sigma y}_i^2-\left(\Sigma y_i\right)^2}}$$
$$r_{xy}=\frac{7\left(60112\right)-(399)(1015)}{\sqrt{7\left(24279\right)-\left(399\right)^2}\cdot\sqrt{7\left(150803\right)-\left(1015\right)^2}}$$
$$r_{xy}=\frac{15799}{103.69\times159.36}=0.9561$$
Hypothesis test
Method 1
$$t=\frac{r}{\sqrt{\frac{1-r^2}{n-2}}}=\frac{0.9561}{\sqrt{\frac{1-\left(.9561\right)^2}{7-2}}}=7.30$$
Determine the value of t-table with significance level (α) = 5% and db = n-2.
From the distribution table t, we get: t(0.05/2, 5)= 2.57
Compare t-count with t-table:
From the calculation results, we get the value of t-count = 7.30 and t-table = 2.57. It is clear that the value of |t-count| > t-table so that Reject H0 and Accept H1. Thus, we can state that there is a linear relationship between age and systolic blood pressure.
Method 2
Compare values |r| with the value of the critical table r for n = 7. The value of r in the critical table = 0.754.
From the calculation results, the value of r = 0.956 is obtained. It is clear that |r|> 0.754 so we can conclude that there is a linear relationship between age and systolic blood pressure.
Output Analysis using SPSS

We can state like this:
Correlation between age and systolic blood pressure: r(7) = 0.956; p < 0.01
Coefficient of Determination
$${Coefficient\ determination\ \left(R\right)}=r^2=0.91$$
The value of the coefficient of determination above represents the proportion of variance in systolic blood pressure that can be explained/explained by a linear relationship between the variables of age and systolic blood pressure. Based on the results of the analysis, we are 95% sure that about 91% of the variation in high and low systolic blood pressure is determined by a person's age.