Skip to content

SQL + Python Notebook Workflow Guidebook for Analysts


🎯 Purpose

This guidebook outlines a production-aligned, modular workflow for analysts who combine SQL + Python in notebooks. It emphasizes separation of logic, reproducibility, and efficient transitions between data querying, EDA, and modeling.


🧱 1. Use SQL to Prepare β€” Not Model

Use SQL For Avoid Doing in SQL
Joins + filtering Full-feature normalization
Pre-aggregation Custom feature generation
Cohorting, funnels Modeling or residual logic
Deduplication Outlier transformations

βœ”οΈ SQL should stage structured, clean inputs β€” leave exploration for Python


πŸ”„ 2. Pulling SQL into Python

πŸ”Ή BigQuery Example

from google.cloud import bigquery
import pandas as pd

client = bigquery.Client()
query = """
    SELECT user_id, event_date, spend
    FROM project.dataset.transactions
    WHERE event_date >= DATE_SUB(CURRENT_DATE(), INTERVAL 30 DAY)
"""
df = client.query(query).to_dataframe()

πŸ”Ή Postgres or Snowflake Example (via SQLAlchemy)

from sqlalchemy import create_engine
import pandas as pd

engine = create_engine("postgresql://user:password@host/db")
df = pd.read_sql("SELECT * FROM users", engine)

πŸ§ͺ 3. Notebook Structure for Analytics

[ Cell 1: Imports ]
[ Cell 2: SQL Query or Data Import ]
[ Cell 3: Initial sanity checks (df.head(), df.info()) ]
[ Cell 4: EDA β€” value_counts, describe(), histograms ]
[ Cell 5: Cleaning β€” nulls, dtypes, mapping ]
[ Cell 6: Feature generation ]
[ Cell 7: Modeling or export ]

βœ”οΈ Use markdown cells to label each phase clearly βœ”οΈ Save query text in markdown cell or .sql file for versioning


🧹 4. SQL View + Python Transform Pattern

Use SQL to stage a view and Python to finish the analysis.

SQL View Example:

-- Create a view in BigQuery or DB
CREATE OR REPLACE VIEW cohort_summary AS
SELECT cohort_month, days_since_signup, COUNT(*) AS retained_users
FROM raw.events
GROUP BY 1, 2;

Python:

df = client.query("SELECT * FROM cohort_summary").to_dataframe()
sns.lineplot(data=df, x="days_since_signup", y="retained_users", hue="cohort_month")

πŸ› οΈ 5. Best Practices

  • Use parameterized queries to dynamically fetch different timeframes
  • Store SQL queries in separate .sql files or markdown cells
  • Version-control final notebooks, but treat exploration as flexible
  • Use .query() or .loc[] with readable filters (df[df["country"] == "US"])

βœ… SQL + Python Workflow Checklist

  • [ ] Data pulled from SQL using tested/staged view or saved query
  • [ ] Notebook includes import, structure, and EDA cells
  • [ ] SQL only handles structural prep (joins, filters, grouping)
  • [ ] Python handles flexible EDA, feature work, plotting
  • [ ] Comments and markdown explain workflow clearly

πŸ’‘ Tip

β€œWrite SQL like a database engineer. Explore like a scientist. Ship notebooks like a product analyst.”