🎯 Situation

A data analyst at a logistics client built a solid Python script over three weeks: it pulled daily shipment data from an API, cleaned it with pandas, and loaded it into Azure SQL. The script worked perfectly. She ran it every morning at 7 AM, manually, from her laptop. When she was on vacation, the data stopped flowing. When her laptop was in for repairs, the Power BI dashboard showed a week of blanks. The script was good. The deployment wasn't.

👉 A script that runs manually is a prototype. A script that runs on a schedule, logs its output, alerts on failure, and doesn't depend on a specific laptop is a pipeline. The code is the same. The deployment is the difference.

⚠️ Challenge

📋 Option 1 & 2: Local scheduling

  • Windows Task Scheduler — right-click your script → Create Task → Triggers → daily at 7 AM. Works while the machine is on. Zero cost.
  • macOS/Linux cron — add: 0 7 * * * /usr/bin/python3 /path/to/script.py to crontab. Same limitation: requires the machine to be running.
  • Right for: personal scripts, development machines, low-stakes data that can tolerate occasional missed runs.
  • Not right for: production pipelines where Power BI depends on the data being current every morning.

⛅ Option 3 & 4: Cloud scheduling — production grade

  • Azure Functions (Consumption plan) — deploy your Python script as a Function, set a Timer trigger (cron syntax), runs in the cloud on schedule regardless of any laptop. Cost: ~$0.20/month for daily runs.
  • Azure Data Factory (ADF) — pipeline orchestration with a Python activity. More complex to set up, but includes dependency management, retry logic, and monitoring built in. Right for multi-step pipelines.
  • GitHub Actions — free for public repos, $0/month for 2,000 minutes/month on private. Schedule a workflow to pull your repo and run the script. Logs are automatic. Good middle ground between local and full Azure.

🔍 Analysis

The three production requirements for any scheduled script:

  • Logging — write every run's output to a file or database: start time, end time, rows processed, errors. If something breaks, you need to know what happened and when.
  • Error alerting — if the script fails, an email or Teams notification fires immediately. Not the next morning when someone notices the dashboard is blank.
  • Idempotency — if the script runs twice (e.g., a retry after a failure), it should produce the correct result, not double-insert data. Use UPSERT (INSERT ... ON CONFLICT UPDATE) or truncate-and-reload patterns.

✓️ Best Practice

The minimal production wrapper for any Python pipeline script:

import logging, sys
from datetime import datetime

logging.basicConfig(
    filename='pipeline.log',
    level=logging.INFO,
    format='%(asctime)s %(levelname)s %(message)s'
)

def run_pipeline():
    logging.info("Pipeline started")
    try:
        # --- your pipeline code here ---
        rows = extract_from_api()
        cleaned = transform(rows)
        load_to_sql(cleaned)
        logging.info(f"Success: {len(cleaned)} rows loaded")
    except Exception as e:
        logging.error(f"Pipeline failed: {e}")
        send_alert(str(e))  # Teams / email notification
        sys.exit(1)

if __name__ == "__main__":
    run_pipeline()

💡 Summary

Scheduling is the last 10% of building a pipeline — and the part that determines whether it actually works in production. A manually-run script is a dependency: it depends on a person, a laptop, and a routine. A scheduled, logged, alerting script is infrastructure.

👉 The script is done when it runs itself.

Logging, alerting, scheduling — that's the 20% that makes the 80% reliable.