🎯 Situation
L'export CRM d'un client arrivait avec des noms clients en sept formats différents : DUPONT JEAN, Dupont, Jean, jean dupont, J. DUPONT, Dupont J., JEAN DUPONT (DOUBLE), et vide. Même personne, sept enregistrements différents. Power Query règle la casse. Mais la déduplication, la correspondance approximative et la gestion des valeurs vides nécessitent une logique que le langage M gère maladroitement.
⚠️ Challenge
🔋 The 4 most common dirty data problems
- Inconsistent string formats — names, addresses, product codes entered differently by different people
- Duplicate records — same entity with slightly different spellings or IDs
- Missing values — blanks, nulls, 'N/A', 'n/a', '0' all meaning 'no data'
- Type inconsistencies — revenue as text in some rows, number in others; dates as strings
🔧 Why Python over Power Query for hard cleaning
- pandas has dedicated string methods: str.strip(), str.lower(), str.replace() — readable and chainable
- drop_duplicates() with subset parameter handles deduplication in one line
- fillna() with different strategies per column (forward fill, zero fill, median)
- pd.to_datetime() with errors='coerce' handles mixed date formats without crashing
- Python is testable — write a unit test for your cleaning logic, run it against sample data before deploying
🔍 Analyse
A real cleaning script for the CRM export problem:
import pandas as pd
df = pd.read_csv('crm_export.csv')
# Normalize customer names
df['customer_name'] = (df['customer_name']
.str.strip()
.str.lower()
.str.replace(r'\s+', ' ', regex=True)
.str.title()
)
# Remove duplicate markers
df['customer_name'] = df['customer_name'].str.replace(r'\s*\(duplicate\)', '', regex=True, flags=re.IGNORECASE)
# Handle blanks
df['customer_name'] = df['customer_name'].replace('', pd.NA)
df = df.dropna(subset=['customer_name'])
# Deduplicate on name + email df = df.drop_duplicates(subset=['customer_name', 'email'], keep='last')
# Fix revenue column (mixed text/number)
df['revenue'] = pd.to_numeric(df['revenue'].str.replace('[,$]', '', regex=True), errors='coerce').fillna(0)
# Export clean file df.to_csv('crm_clean.csv', index=False) print(f"Clean: {len(df)} rows (removed {original_count - len(df)} records)")
✓️ Bonne pratique
The ETL principle for data cleaning:
1. Never clean data in Power BI — fix it before it arrives. Power Query transformations are applied at every refresh and are hard to test. 2. Python script runs once nightly — cleans the source export and writes a clean file or database table. Power BI connects to the clean output. 3. Log what was cleaned — write a cleaning report: how many rows removed, how many values normalized, how many duplicates dropped. This is your data quality audit trail. 4. Keep raw data alongside clean data — never overwrite the source. If your cleaning logic has a bug, you need to be able to rerun against the original.
💡 Synthèse
Dirty data doesn't stay dirty. It either gets cleaned deliberately before it reaches the report — or it stays dirty and erodes trust in every number the dashboard shows. Python is the right tool for deliberate, documented, testable data cleaning. Power BI is the right tool for what comes after.
👉 Garbage in, garbage out — confidently displayed in a well-designed dashboard.
Clean the data before it reaches Power BI. Python is the right place to do it.