# inline plotting instead of popping out
%matplotlib inline
import os
import numpy as np # numpy 1.26.0
import pandas as pd # pandas 2.1.1
import matplotlib.pyplot as plt # matplotlib 3.8.0
import seaborn as sns # seaborn 0.13.0
from sklearn.preprocessing import StandardScaler # scikit-learn 1.3.1
Exploratory Data Analysis (EDA) is an important and recommended first step of Machine Learning (prior to the training of a machine learning model that are more commonly seen in research papers). EDA performs the exploration and exploitation steps iteratively. In the exploration step, you "explore" the data, usually by visualizing them in different ways, to discover some characteristics of data. Then, in the exploitation step, you use the identified characteristics to figure out the next things to explore. You then repeat the above two steps until you are satisfied with what you have learned from the data. Data visualization plays an important role in EDA. Next, we use the Wine dataset from the UCI machine learning repository as an example dataset and show some common and useful plots.
df = pd.read_csv(
'https://archive.ics.uci.edu/ml/machine-learning-databases/wine/wine.data',
header = None)
df.columns = [
'Class label', 'Alcohol', 'Malic acid', 'Ash', 'Alcalinity of ash',
'Magnesium', 'Total phenols', 'Flavanoids', 'Nonflavanoid phenols',
'Proanthocyanins', 'Color intensity', 'Hue', 'OD280/OD315 of diluted wines',
'Proline'
]
X = df.drop(labels='Class label', axis=1)
y = df['Class label']
df.head()
Class label | Alcohol | Malic acid | Ash | Alcalinity of ash | Magnesium | Total phenols | Flavanoids | Nonflavanoid phenols | Proanthocyanins | Color intensity | Hue | OD280/OD315 of diluted wines | Proline | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 14.23 | 1.71 | 2.43 | 15.6 | 127 | 2.80 | 3.06 | 0.28 | 2.29 | 5.64 | 1.04 | 3.92 | 1065 |
1 | 1 | 13.20 | 1.78 | 2.14 | 11.2 | 100 | 2.65 | 2.76 | 0.26 | 1.28 | 4.38 | 1.05 | 3.40 | 1050 |
2 | 1 | 13.16 | 2.36 | 2.67 | 18.6 | 101 | 2.80 | 3.24 | 0.30 | 2.81 | 5.68 | 1.03 | 3.17 | 1185 |
3 | 1 | 14.37 | 1.95 | 2.50 | 16.8 | 113 | 3.85 | 3.49 | 0.24 | 2.18 | 7.80 | 0.86 | 3.45 | 1480 |
4 | 1 | 13.24 | 2.59 | 2.87 | 21.0 | 118 | 2.80 | 2.69 | 0.39 | 1.82 | 4.32 | 1.04 | 2.93 | 735 |
As we can see, showing data row-by-row with their column names does not help us get the "big picture" and characteristics of data.
NOTE: pd.read_csv()
function returns a pandas.DataFrame object. Pandas Dataframe is an useful "two-dimensional size-mutable, potentially heterogeneous tabular data structure with labeled axes".
We can instead see the joint distribution of any pair of columns/attributes/variables/features by using the pairplot function offered by Seaborn, which is based on Matplotlib:
sns.set(style='whitegrid', context='notebook')
sns.pairplot(df, vars=df.columns[1:], hue="Class label", height=2.5, palette=sns.color_palette('deep', df['Class label'].unique().size))
plt.tight_layout()
if not os.path.exists('./output'):
os.makedirs('./output')
plt.savefig('./output/fig-wine-scatter.png', dpi=300)
plt.show()