Principal Component Analysis (PCA) on Iris Flower Data: A Case Study

Principal Component Analysis (PCA) is a technique in statistics that is used to combine multiple possibly correlated variables into a smaller set of variables called principal components. This technique is often used in dimensionality reduction, that is, trying to reduce the number of variables used in analysis.

In the context of dimensionality reduction, PCA tries to retain as much information as possible from the original data. This is done by finding new dimensions (that is, combinations of the original variables) that maximize the variance of the original data.

In practice, PCA can help extract useful information from high-dimensional data by reducing the level of noise and redundancy. This can help in various applications, ranging from face recognition to genetic analysis.

Case Example

Here is one of the case examples often used in Principal Component Analysis (PCA), the Iris dataset case. This dataset is a multivariate dataset introduced by statistician and biologist Ronald Fisher in his article in 1936. This dataset is very popular in various fields, including Machine Learning and Statistics.

The Iris dataset contains 150 samples from three species of iris (Iris setosa, Iris virginica, and Iris versicolor). Four features were measured from each sample: the length and width of the sepals, and the length and width of the petals.

Iris Dataset: Originally published at UCI Machine Learning Repository

The Iris Dataset, this small dataset from 1936 is often used for testing out machine learning algorithms and visualizations. The Iris dataset is a classification dataset that contains three classes of 50 instances each, where each class refers to a type of iris plant. The three classes in the Iris dataset are: Setosa, Versicolor, Virginica. Each row of the table represents an iris flower, including its species and dimensions of its botanical parts, sepal length, sepal width, petal length and petal width (in centimeters).

Author: R.A. Fisher (1936)

Source: UCI Machine Learning Repository

Steps for Principal Component Analysis (PCA):

Activate the worksheet (Sheet) to be analyzed.
Place the cursor on the Dataset (to create a Dataset, see the Data Preparation method).
If the active cell is not on the Dataset, SmartstatXL will automatically try to determine the Dataset.
Activate the SmartstatXL Tab.
Click the Multivariate > Principal Component Analysis menu.
SmartstatXL will display a dialog box to confirm whether the dataset is correct or not (usually the cell address of the dataset is automatically selected correctly).
If it's correct, Click the Next Button.
The Principal Component Analysis dialog box will then appear:
Select the Variable, Analysis Method, Extraction Method, and label for the biplot (optional). In this case study, we set:
- Variable: Sepal length, Sepal width, Petal length, and Petal width
- Analysis Method: Correlation
- Extraction Method: Based on Eigenvalue
- Label: Species
The details can be seen in the following dialog box:
Analysis Method
Analysis based on covariance is typically used when the units of the observed variables are the same or when the absolute scale of these variables is important. Conversely, analysis based on correlation is used when the units of the observed variables are different or when we are only interested in the relationships between variables, not their absolute differences.
Extraction Method
In Principal Component Analysis (PCA), the component extraction method typically involves finding the eigenvectors and eigenvalues of the covariance or correlation matrix. These eigenvalues represent the amount of variance explained by each principal component, and the eigenvectors represent how much of the original variables contribute to that principal component.
Here are two common approaches in determining how many components or factors to extract:
- Based on Eigenvalue: This approach, also known as the "one eigenvalue rule" or "Kaiser criterion", suggests that we should only retain the principal components that have an eigenvalue greater than 1. This is based on the idea that each principal component should explain more variance than the average of the original variables.
- Based on a Fixed Number of Components: This approach involves determining the number of components or factors to retain based on prior knowledge or the purpose of the analysis. For example, if the goal of the analysis is to reduce the dimensionality of the data to two or three for visualization purposes, then we might choose to retain only two or three principal components.
Both of these approaches have advantages and disadvantages. The eigenvalue-based approach is a commonly used general rule, but it may not always yield the most appropriate number of components for the purpose of the analysis. The fixed number of components approach may require more knowledge about the data and the purpose of the analysis. Therefore, the best approach might depend on the context of the analysis.
Press the "Output" tab.
Select the Principal Component Analysis output as shown below by pressing the Select All button:
Press the OK button to generate the output in the Output Sheet.

Analysis Results

Here is the Output of Principal Component Analysis (PCA):

Information on Principal Component Analysis

Kaiser-Meyer-Olkin (KMO) Test

The KMO is a measure that indicates how well the data fits for Principal Component Analysis. Its value ranges between 0 and 1. A high KMO value (close to 1) indicates that the correlation patterns among variables are pretty good for PCA. A low KMO value (below 0.5) indicates that the correlation patterns among variables may not be suitable for PCA.

In this case, the KMO value is 0.536, indicating that the data is fairly suitable for PCA. However, this value is somewhat low and could be considered sub-optimal, as typically KMO values above 0.6 are considered supportive of PCA.

Bartlett's Test of Sphericity

This test checks the hypothesis that the original variables are independent or uncorrelated in the population; in other words, the population correlation matrix is an identity matrix. A relatively high Chi-Square value relative to the degrees of freedom and significance less than 0.05 usually indicates that the test is significant, i.e., the data are likely to be correlated, and therefore suitable for PCA.

In this case, the Chi-Square value of Bartlett's test is 706.361 with 6 degrees of freedom, and significance (p-value) is 0.000, far below 0.05. This indicates that there is strong evidence that the variables are correlated and therefore suitable for PCA.

So, based on both of these tests, the data from the Iris dataset are quite suitable for PCA, although the KMO value is somewhat low and could be considered sub-optimal.

Correlation Matrix Table

These tables show various information about the relationships among variables in your dataset:

Correlation Matrix Table

This correlation matrix shows the relationship between each pair of variables. Its values range from -1 to 1, where -1 means perfect negative correlation, 1 means perfect positive correlation, and 0 means no correlation. In this matrix, we can see that sepal length is strongly correlated with petal length (0.872) and petal width (0.818). In other words, longer sepals tend to have longer and wider petals. The correlation between sepal width and petal length (-0.421) and petal width (-0.357) is negative, indicating that wider sepals tend to have shorter and narrower petals. The strongest correlation is observed between petal length and petal width (0.963), indicating that petal length and width are very closely correlated.

Reproduction Correlation Matrix Table

This table generates an estimate of the original correlation matrix based on the extracted principal components. In this table, each number represents an estimated correlation value based on the principal components.

Error Correlation Matrix Table

This table shows the difference between the original and reproduction correlation matrix, which can tell us how well our PCA model can reproduce the original correlations. Lower values indicate that our model is doing a good job of reproducing the original correlations. In this case, the errors seem relatively small, indicating that the PCA model is doing a fairly good job.

Eigenvalue and Screeplot

The "Total Variance Explained" table is an important output from the principal component analysis. This table shows how much information (in terms of variance) is explained by each principal component.

The first principal component (first row) has an eigenvalue of 2.911 and explains 72.8% of the total variance in the data. This means that this first principal component encapsulates almost 73% of the information in the original data set.
The second principal component (second row) has an eigenvalue of 0.921 and explains 23% of the total variance. So, if we combine the first two principal components, they will explain almost 96% (72.8% + 23%) of the total variance.
The third principal component (third row) has an eigenvalue of 0.147 and explains 3.7% of the total variance. So, if we combine the first three principal components, they will explain almost 99.5% (72.8% + 23% + 3.7%) of the total variance.
The fourth principal component (fourth row) has an eigenvalue of 0.021 and explains 0.5% of the total variance. If we combine all four principal components, they will explain 100% of the total variance, which means they summarize all the information in the original data set.

Based on this table and the Screeplot, we can see that the first two principal components explain almost all the variance in the data. Therefore, we may be able to reduce our data to two dimensions while retaining almost all the original information.

In the next discussion, we will change the extraction method based on the number of fixed components, 2 components, as shown in the following dialog box.

Eigenvector Table

The table shows the eigenvector for each principal component. Eigenvectors are vectors that only change by a scalar when a linear transformation is applied. In the context of PCA, they are the direction in which the data's variability is maximized.

Each column represents a principal component (PC1, PC2, PC3, PC4), and each row represents the original variable (Sepal length, Sepal width, Petal length, Petal width). These values show how much each variable contributes to the principal component. This is also known as loadings.

For example, for the first principal component (PC1), Sepal length, Petal length, and Petal width make significant positive contributions (0.522, 0.581, and 0.566), while Sepal width makes a negative contribution (-0.263). This means that PC1 might represent the overall size of the iris flower, as these three variables are positively correlated with PC1.

For the second principal component (PC2), Sepal width is the largest contributor (0.926), and the other variables contribute less. This means that PC2 may represent more specific characteristics related to the sepal width.

These values can be used to formulate each principal component in the form of a linear equation from the original variables. For example, PC1 can be formulated as:

PC1 = 0.522*(Sepal length) - 0.263*(Sepal width) + 0.581*(Petal length) + 0.566*(Petal width)

Overall, the interpretation of this eigenvector table will depend on the context of the data and your analysis objectives.

Component Loading

This table shows the component loading values and communalities for each variable.

Component Loading

Loadings are correlations between the original variables and the principal components. Loading values indicate how much the original variables contribute to each principal component. Loadings can be positive or negative, indicating the direction of the relationship between the original variable and the principal component.

In this table, we can see that Sepal length, Petal length, and Petal width have very high positive loadings on PC1, which means these variables are strongly correlated with the first principal component. Sepal width, on the other hand, has a negative loading on PC1, which means it is opposite to PC1. On PC2, Sepal width has a very high positive loading, which means it is strongly correlated with the second principal component.

Communalities

Communalities are the proportion of variance in the original variables that is explained by the principal components. In other words, this is the amount of variance that can be explained by all the principal components. In this table, we can see that all variables have very high communality values (more than 0.9), which means that almost all of the variance in these variables can be explained by the first two principal components.

Explained Variance

The row 'Expl. Variance' shows the amount of variance explained by each principal component. This value is the same as the eigenvalue of each principal component, which we discussed earlier.

The rows '% Variance' and '% Cum. Variance' show the proportion of total variance explained by each principal component and the cumulative proportion of variance explained by all principal components up to that point. In this table, we can see that PC1 explains 72.8% of the total variance and PC1 and PC2 together explain 95.8% of the total variance.

Variable Contribution Table

This table shows how much each original variable contributes to the variability of the identified principal components (PC1 and PC2 in this case).

The values in the table represent the proportion (in percentage terms) of the total variability of each principal component that is associated with a particular original variable. For example, Sepal length contributes to 27.3% of the total variability in PC1 and 13.9% of the total variability in PC2.

Loading Plot and Biplot

Loading Plot

The loading plot is a visualization of the component loadings we previously discussed. It shows how much each original variable contributes to the main components. On the horizontal axis, we have PC1 explaining 72.77% of the total variance, and on the vertical axis, we have PC2 explaining 23.03% of the total variance.

In quadrant I, we have Petal Length, Petal Width, and Sepal Length. This means that these three variables have positive loadings on both PC1 and PC2, indicating they positively correlate with PC1 and PC2.

In quadrant II, we have Sepal Width. This means Sepal Width has negative loading on PC1 and positive loading on PC2, meaning Sepal Width negatively correlates with PC1 and positively with PC2.

Biplot

A biplot is an extension of the loading plot, where the observation data is also displayed in a lower-dimensional space. Biplots are used for the visualization of multivariate data, where both original variables (in the form of vectors) and individual observations are displayed.

In this case, besides the loadings for each variable (just as displayed in the loading plot), the spread of observation data is also shown. Each point on the plot represents one observation (in this case, one iris flower), and the relative positions of these points reflect the relationships between these observations in the lower-dimensional space generated by PCA. The biplot can help us understand how observations relate to each other based on the original variables, and how these variables contribute to the main components.

Setosa:

The iris flowers of the setosa type seem to cluster in quadrants II and III, around the loading plot for Sepal Width. This indicates that the characteristics of Setosa are highly correlated with sepal width. As they are on the negative side of the PC1 axis, it suggests that Setosas tend to have lower values for petal length, petal width, and sepal length compared to other iris types.

Versicolor:

The spread of Versicolor data extends in quadrants I, III, and IV. Although its spread is wider, it's generally closer to petal length, petal width, and sepal length (quadrants I and IV) compared to sepal width (quadrant III). This suggests that the characteristics of Versicolor are more correlated with petal length and width as well as sepal length rather than with sepal width.

Virginica:

The spread of Virginica data extends in quadrants I and IV, and it seems to be more towards the positive side of the x-axis (PC1) compared to Versicolor. This suggests that Virginicas tend to have higher values for petal length, petal width, and sepal length compared to other iris types.

In general, this indicates that there are significant differences between iris types based on petal and sepal length and width. The fact that different types move in different directions on the plot indicates that they can be distinguished based on these variables. This also indicates the success of PCA in reducing the data dimensions in such a way that differences between groups can be visually seen.

Component Score Coefficient Table

This Component Score Coefficient Table shows how the scores for each main component (in this case, PC1 and PC2) are calculated from the original variables.

The values in this table are the regression coefficients of each original variable on the main components, meaning they show how much change in the main components is expected per unit change in the original variable, assuming all other variables remain constant. For example, for PC1, every 1-unit increase in Sepal length would result in a 0.306 increase in the PC1 value, assuming all other variables remain constant. Meanwhile, every 1-unit increase in Sepal width would result in a 0.154 decrease in the PC1 value, assuming all other variables remain constant. The same goes for PC2, every 1-unit increase in Sepal length will result in a 0.388 increase in the PC2 value, assuming all other variables remain constant. And so on.

In general, this Component Score Coefficient Table provides a summary of how the original variables contribute to each main component and how the main component scores are calculated from the original variables.

Component Score Table

This Component Score Table contains the scores of each observation for each main component. In this case, we are looking at the scores for PC1 and PC2.

Main component scores are the values calculated for each observation on the main component. These scores are found by multiplying each variable value by the relevant weight (as found in the component score coefficient table) and summing them. Each row in the component score table represents one observation (in this case, one iris flower) and shows how that observation is positioned relative to each main component. For example, the first observation has a score of -1.323 for PC1 and 0.525 for PC2. These scores are important because they provide a way to visualize multivariate data in lower-dimensional space. For example, we can use these scores to create a biplot, as you mentioned earlier.

Considering the position of main component scores in a two-dimensional plot helps us understand how observations relate to each other based on the original variables and how these variables contribute to the main component. These scores can also be used in further analysis, such as clustering or classification.

Table of Observation Component Coordinates

This Table of Observation Component Coordinates contains the same values as the Component Score Table you previously provided. These values are projections of each observation on each main component, in this case PC1 and PC2.

Each row in this table represents one observation (in this case, one iris flower), and shows how that observation is positioned relative to each main component. For example, the first observation has coordinates -2.257 for PC1 and 0.504 for PC2. This means that this observation is projected to this point in the space created by the main components. These coordinates are important because they allow us to visualize multivariate data in lower-dimensional space, as shown by the biplot. By understanding where each observation is in this space, we can gain a better understanding of how observations relate to each other and how they relate to each main component.

Using these coordinates, we can see that some types of iris flowers, as you mentioned earlier, correlate more with some variables than others, and that they can be distinguished based on these variables in lower space.

Conclusion

From the results of the principal component analysis on Iris flower data, several important conclusions can be drawn as follows:

Main Components: The analysis shows that 95.8% of the variation in the data can be explained by the first two main components (PC1 and PC2). PC1 explains 72.8% of the variation and PC2 explains 23% of the variation. These components provide a good way to summarize information in the data, which originally had four variables (sepal length, sepal width, petal length, and petal width), into two new variables without losing much information.
Variable Contributions: The petal length and petal width variables contribute most to PC1, while the sepal width contributes most to PC2. Therefore, the petal length and width and sepal width are the most informative variables for distinguishing between types of iris flowers.
Observation Component Coordinates: From the observation component coordinates table and biplot, we can see that setosa tends to have higher PC2 values, while versicolor and virginica tend to have higher PC1 values. This suggests that PC1 and PC2 can be used to effectively distinguish between setosa and the other two types of iris.
Biplot: From the biplot, we can see that setosa tends to move around the sepal width loading plot. Meanwhile, versicolor and virginica tend to move more towards the right (positive x) and closer to the petal length, petal width, and sepal length loading plot. This provides a better understanding of how observations relate to each other and how they relate to each main component.

Overall, PCA is an effective method for summarizing and understanding multivariate data, and the results show that it can be successfully used to analyze and understand iris flower data.

Sidebar Menu

Main Menu EN

How to Conduct Principal Component Analysis