My EDA Checklist for Every New Dataset

Every time I receive a new dataset, I follow the same sequence of steps before touching a model. This checklist has saved me from countless mistakes — missing values disguised as zeros, target leakage hiding in plain sight, and distributions so skewed that any model trained on raw features would be unreliable.

Step 1 — Understand the Shape and Schema

df.shape          # rows × columns
df.dtypes         # data types
df.head(10)       # first look
df.describe()     # summary statistics for numerics

The first thing I check is whether the number of rows and columns matches what was described. Discrepancies here often signal a loading error or a join gone wrong.

Step 2 — Audit Missing Values

missing = df.isnull().sum()
missing_pct = (missing / len(df)) * 100
missing_pct[missing_pct > 0].sort_values(ascending=False)

I categorize columns by missingness: under 5% (safe to impute), 5–30% (impute with care), over 30% (consider dropping or engineering a missingness indicator). The pattern of missingness matters too — is it random, or does it correlate with the target?

Step 3 — Examine Distributions

For numeric columns, I plot histograms and check skewness. Highly skewed features (

skew

> 1) often benefit from a log or square-root transformation before modeling.

For categorical columns, I check cardinality. A column with 10,000 unique values in a 12,000-row dataset is essentially an ID column and should be dropped or encoded carefully.

Step 4 — Check for Duplicates and Outliers

df.duplicated().sum()

# IQR-based outlier detection
Q1 = df[col].quantile(0.25)
Q3 = df[col].quantile(0.75)
IQR = Q3 - Q1
outliers = df[(df[col] < Q1 - 1.5*IQR) | (df[col] > Q3 + 1.5*IQR)]

Outliers are not always errors — sometimes they are the most interesting observations. The question is whether they represent genuine signal or data quality issues.

Step 5 — Explore Relationships with the Target

A correlation heatmap gives a quick overview, but it only captures linear relationships. I supplement it with scatter plots for the top correlated features and box plots for categorical predictors.

Final Note

EDA is not a one-time step. As modeling reveals unexpected behavior, I return to the data to investigate. Treat it as an ongoing conversation with the dataset, not a checkbox to tick before the “real” work begins.