Every time I receive a new dataset, I follow the same sequence of steps before touching a model. This checklist has saved me from countless mistakes — missing values disguised as zeros, target leakage hiding in plain sight, and distributions so skewed that any model trained on raw features would be unreliable.
Step 1 — Understand the Shape and Schema
df.shape # rows × columns
df.dtypes # data types
df.head(10) # first look
df.describe() # summary statistics for numerics
The first thing I check is whether the number of rows and columns matches what was described. Discrepancies here often signal a loading error or a join gone wrong.
Step 2 — Audit Missing Values
missing = df.isnull().sum()
missing_pct = (missing / len(df)) * 100
missing_pct[missing_pct > 0].sort_values(ascending=False)
I categorize columns by missingness: under 5% (safe to impute), 5–30% (impute with care), over 30% (consider dropping or engineering a missingness indicator). The pattern of missingness matters too — is it random, or does it correlate with the target?
Step 3 — Examine Distributions
| For numeric columns, I plot histograms and check skewness. Highly skewed features ( | skew | > 1) often benefit from a log or square-root transformation before modeling. |
For categorical columns, I check cardinality. A column with 10,000 unique values in a 12,000-row dataset is essentially an ID column and should be dropped or encoded carefully.
Step 4 — Check for Duplicates and Outliers
df.duplicated().sum()
# IQR-based outlier detection
Q1 = df[col].quantile(0.25)
Q3 = df[col].quantile(0.75)
IQR = Q3 - Q1
outliers = df[(df[col] < Q1 - 1.5*IQR) | (df[col] > Q3 + 1.5*IQR)]
Outliers are not always errors — sometimes they are the most interesting observations. The question is whether they represent genuine signal or data quality issues.
Step 5 — Explore Relationships with the Target
A correlation heatmap gives a quick overview, but it only captures linear relationships. I supplement it with scatter plots for the top correlated features and box plots for categorical predictors.
Final Note
EDA is not a one-time step. As modeling reveals unexpected behavior, I return to the data to investigate. Treat it as an ongoing conversation with the dataset, not a checkbox to tick before the “real” work begins.