Exploring and Visualizing Data

Shan-Hung Wu & DataLab
Fall 2022

Exploratory Data Analysis

Exploratory Data Analysis (EDA) is an important and recommended first step of Machine Learning (prior to the training of a machine learning model that are more commonly seen in research papers). EDA performs the exploration and exploitation steps iteratively. In the exploration step, you "explore" the data, usually by visualizing them in different ways, to discover some characteristics of data. Then, in the exploitation step, you use the identified characteristics to figure out the next things to explore. You then repeat the above two steps until you are satisfied with what you have learned from the data. Data visualization plays an important role in EDA. Next, we use the Wine dataset from the UCI machine learning repository as an example dataset and show some common and useful plots.

Visualizing the Important Characteristics of a Dataset

Let's download the Wine dataset using Pandas first:

As we can see, showing data row-by-row with their column names does not help us get the "big picture" and characteristics of data.

NOTE: pd.read_csv() function returns a pandas.DataFrame object. Pandas Dataframe is an useful "two-dimensional size-mutable, potentially heterogeneous tabular data structure with labeled axes".

Pairwise Join Distributions

We can instead see the join distribution of any pair of columns/attributes/variables/features by using the pairplot function offered by Seaborn, which is based on Matplotlib: