Sidebar Menu

Exploratory Data Analysis (EDA) is a data exploration method using simple arithmetic techniques and graphic techniques in summarizing observational data. Data exploration is an integral part of our perception. If the ultimate goal of the research is not to produce causal inference, further data analysis is no longer needed. However, if needed, exploratory data analysis is very supportive in studying and discovering the properties of the data which can later be useful in selecting the right statistical model. Thus, in exploratory data analysis , it is the nature of the observational data that will determine the appropriate statistical analysis model (or improvement of the planned analysis).

The first step in analyzing the data is to study the characteristics of the data. There are several important reasons that we need to consider carefully before we do actual data analysis. The first reason for checking data is to check for errors that may occur at various stages, from recording data in the field to entering data on a computer. The next reason is for data exploration purposes so that we can determine the right analysis model.

Introduction

As we know, scientific research is an activity that can be analogous to solving a puzzle. Research should be problem-centered, not on the statistical analysis tools used. Curiosity, suspicion, and imagination are the main keys in the discovery process. Data exploration is an integral part of our perception. If the ultimate goal of the research is not to produce causal inference, further data analysis is no longer needed. However, if needed, exploratory data analysis is very supportive in studying and discovering the properties of the data which can later be useful in selecting the right statistical model. Thus, in exploratory data analysis, it is the nature of the observed data that will determine the appropriate statistical analysis model (or refinement of the planned analysis). Exploratory Data Analysis ( EDA ) is a data exploration method using simple arithmetic techniques and graphic techniques in summarizing observational data. EDA is widely used in various ways such as:

  • Maximize data analysis
  • Searching for hidden data structures (uncovering hidden mysteries)
  • Remove important variables
  • Detect abnormalities and anomalies
  • Doing an assumption test
  • Model building
  • Perform optimization

The main contribution of the exploratory data analysis approach lies in the visual presentation of all summary statistics. Summary statistics only numerically, can obscure, hide, or even misrepresent the data structure. If the numerical summary is used separately and received immediately without visual inspection of the data, it may result in incorrect model selection. The choice of a model that is carried out in a hurry and may be based on wrong assumptions will result in drawing the wrong conclusions. For this reason, preliminary analysis should begin with a visual examination, not a numerical summary of the data.

Analytical Engineering Paradigm

There are three approaches to data analysis:

  1. Classic ( Classical )
  2. Explorative ( Exploratory (EDA))
  3. Bayesian

Thus, EDA is one of the three existing data analysis approaches. The three approaches have similarities, they all start from a general theory or problem and end with a conclusion. The difference lies in the order and focus of the intermediate steps.

  • Classical analysis, in order:
    • Problem → Data → Model → Analysis → Conclusion
  • EDA, in order:
    • Problem → Data → Analysis → Model → Conclusion
  • Bayesian, in order:
    • Problem → Data → Model → Prior/conditional distribution → Analysis → Conclusion

So, in classical analysis, data collection is followed by model application (normality, linearity, etc.) and next is analysis, estimation, and testing focused on the model parameters. In EDA, data collection is not followed by the application of the model, but is immediately followed by analysis with the aim of determining what model will be appropriate. Finally, the Bayesian estimation method is an estimation process by considering two things, namely the data we currently have and initial information about the case we are studying. Both are used together to make a conclusion or test assumptions about the parameter model. In fact, data analysis is a combination of the three approaches above (as well as other approaches). The differences above are only described to emphasize the main differences between the three approaches. EDA is not a set of techniques. EDA is an approach, pattern/attitude/philosophy on how we analyze a set of data. Then, is EDA the same as graphical statistics? The answer is no. EDA does use a lot of graphical techniques, but EDA is not synonymous with graphic analysis even though the two are similar and sometimes the terminology often goes back and forth. The graphical analysis approach is only limited to a set of technical tools that are all graphical and only focus on one aspect of the characteristics of the data, while EDA covers a wider area. EDA places more emphasis on a direct approach so that the data itself can reveal the structure and model. Some of the graphic techniques often used in EDA are often very simple. These techniques include:

  • plotting data mentah (histogram, dotplot, dataplot, stem-and-leaf plot, )
  • simple statistical plotting like (boxplot, mean plot, std plot)
  • etc

Graphical Presentation of Data

The most common data structure is a collection of numbers. This structure is very simple, but if the amount of observational data is very large, it is very difficult for us to see a picture of the characteristics of the data as a whole if we only see a series of numbers that are so many. There are several techniques for summarizing and studying the characteristics and distribution of data in which the data can be represented graphically. Among them are histogram, dotplot, stem-and-leaf plot, density trace, box plot, and probability plot.

Histogram

histogram 1histogram 2

 

Dotplot

Binning: Lebar Interval = 1 

Dotplot 1

 Binning: Lebar Interval = 2 

Dotplot 2

Binning: Lebar Interval =10 

Dotplot 3

 

Stem-and-leaf plot

MINITAB:
Stem-and-leaf of Nilai Ujian N = 80
Leaf Unit = 1.0

2 3 58
5 4 389
8 5 169
19 6 00133356778
(24) 7 000011122233444455667899
37 8 0000111223334566788889
15 9 000111223335789
^ ^ ^
f stem | leaf

Box-plot

Box-PlotBox-Plot Data Group