🎯 Situation
A data analyst at a logistics client built a solid Python script over three weeks: it pulled daily shipment data from an API, cleaned it with pandas, and loaded it into Azure SQL. The script worked perfectly. She ran it every morning at 7 AM, manually, from her laptop. When she was on vacation, the data stopped flowing. When her laptop was in for repairs, the Power BI dashboard showed a week of blanks. The script was good. The deployment wasn't.
⚠️ Challenge
📋 Option 1 & 2: Local scheduling
- Windows Task Scheduler — right-click your script → Create Task → Triggers → daily at 7 AM. Works while the machine is on. Zero cost.
- macOS/Linux cron — add: 0 7 * * * /usr/bin/python3 /path/to/script.py to crontab. Same limitation: requires the machine to be running.
- Right for: personal scripts, development machines, low-stakes data that can tolerate occasional missed runs.
- Not right for: production pipelines where Power BI depends on the data being current every morning.
⛅ Option 3 & 4: Cloud scheduling — production grade
- Azure Functions (Consumption plan) — deploy your Python script as a Function, set a Timer trigger (cron syntax), runs in the cloud on schedule regardless of any laptop. Cost: ~$0.20/month for daily runs.
- Azure Data Factory (ADF) — pipeline orchestration with a Python activity. More complex to set up, but includes dependency management, retry logic, and monitoring built in. Right for multi-step pipelines.
- GitHub Actions — free for public repos, $0/month for 2,000 minutes/month on private. Schedule a workflow to pull your repo and run the script. Logs are automatic. Good middle ground between local and full Azure.
🔍 Analysis
The three production requirements for any scheduled script:
- Logging — write every run's output to a file or database: start time, end time, rows processed, errors. If something breaks, you need to know what happened and when.
- Error alerting — if the script fails, an email or Teams notification fires immediately. Not the next morning when someone notices the dashboard is blank.
- Idempotency — if the script runs twice (e.g., a retry after a failure), it should produce the correct result, not double-insert data. Use UPSERT (INSERT ... ON CONFLICT UPDATE) or truncate-and-reload patterns.
✓️ Best Practice
The minimal production wrapper for any Python pipeline script:
import logging, sys
from datetime import datetime
logging.basicConfig(
filename='pipeline.log',
level=logging.INFO,
format='%(asctime)s %(levelname)s %(message)s'
)
def run_pipeline():
logging.info("Pipeline started")
try:
# --- your pipeline code here ---
rows = extract_from_api()
cleaned = transform(rows)
load_to_sql(cleaned)
logging.info(f"Success: {len(cleaned)} rows loaded")
except Exception as e:
logging.error(f"Pipeline failed: {e}")
send_alert(str(e)) # Teams / email notification
sys.exit(1)
if __name__ == "__main__":
run_pipeline()
💡 Summary
Scheduling is the last 10% of building a pipeline — and the part that determines whether it actually works in production. A manually-run script is a dependency: it depends on a person, a laptop, and a routine. A scheduled, logged, alerting script is infrastructure.
👉 The script is done when it runs itself.
Logging, alerting, scheduling — that's the 20% that makes the 80% reliable.