Nettoyer des données sales avec Python avant de les charger dans Power BI

Votre rapport Power BI ne vaut que ce que valent les données derrière. Quand les données arrivent incohérentes, mal formatées, ou pleines de doublons, les corriger dans Power Query est fastidieux. Voici pourquoi Python est souvent le meilleur endroit.

🎯 Situation

L'export CRM d'un client arrivait avec des noms clients en sept formats différents : DUPONT JEAN, Dupont, Jean, jean dupont, J. DUPONT, Dupont J., JEAN DUPONT (DOUBLE), et vide. Même personne, sept enregistrements différents. Power Query règle la casse. Mais la déduplication, la correspondance approximative et la gestion des valeurs vides nécessitent une logique que le langage M gère maladroitement.

👉 Le nettoyage de données, c'est le 80 % ingrat de tout projet data. La bibliothèque pandas de Python est faite pour ça — cohérente, testable, lisible. L'objectif n'est pas de remplacer entièrement Power Query. C'est de gérer le nettoyage difficile en Python avant que les données arrivent dans Power BI.

⚠️ Challenge

🔋 The 4 most common dirty data problems

Inconsistent string formats — names, addresses, product codes entered differently by different people
Duplicate records — same entity with slightly different spellings or IDs
Missing values — blanks, nulls, 'N/A', 'n/a', '0' all meaning 'no data'
Type inconsistencies — revenue as text in some rows, number in others; dates as strings

🔧 Why Python over Power Query for hard cleaning

pandas has dedicated string methods: str.strip(), str.lower(), str.replace() — readable and chainable
drop_duplicates() with subset parameter handles deduplication in one line
fillna() with different strategies per column (forward fill, zero fill, median)
pd.to_datetime() with errors='coerce' handles mixed date formats without crashing
Python is testable — write a unit test for your cleaning logic, run it against sample data before deploying

🔍 Analyse

A real cleaning script for the CRM export problem:

import pandas as pd

df = pd.read_csv('crm_export.csv')

# Normalize customer names
df['customer_name'] = (df['customer_name']
    .str.strip()
    .str.lower()
    .str.replace(r'\s+', ' ', regex=True)
    .str.title()
)

# Remove duplicate markers
df['customer_name'] = df['customer_name'].str.replace(r'\s*\(duplicate\)', '', regex=True, flags=re.IGNORECASE)

# Handle blanks
df['customer_name'] = df['customer_name'].replace('', pd.NA)
df = df.dropna(subset=['customer_name'])

# Deduplicate on name + email df = df.drop_duplicates(subset=['customer_name', 'email'], keep='last')

# Fix revenue column (mixed text/number)
df['revenue'] = pd.to_numeric(df['revenue'].str.replace('[,$]', '', regex=True), errors='coerce').fillna(0)

# Export clean file df.to_csv('crm_clean.csv', index=False) print(f"Clean: {len(df)} rows (removed {original_count - len(df)} records)")

✓️ Bonne pratique

The ETL principle for data cleaning:

1. Never clean data in Power BI — fix it before it arrives. Power Query transformations are applied at every refresh and are hard to test. 2. Python script runs once nightly — cleans the source export and writes a clean file or database table. Power BI connects to the clean output. 3. Log what was cleaned — write a cleaning report: how many rows removed, how many values normalized, how many duplicates dropped. This is your data quality audit trail. 4. Keep raw data alongside clean data — never overwrite the source. If your cleaning logic has a bug, you need to be able to rerun against the original.

💡 Synthèse

Dirty data doesn't stay dirty. It either gets cleaned deliberately before it reaches the report — or it stays dirty and erodes trust in every number the dashboard shows. Python is the right tool for deliberate, documented, testable data cleaning. Power BI is the right tool for what comes after.

👉 Garbage in, garbage out — confidently displayed in a well-designed dashboard.

Clean the data before it reaches Power BI. Python is the right place to do it.