SQL + Python Notebook Workflow Guidebook for Analysts
π― Purpose¶
This guidebook outlines a production-aligned, modular workflow for analysts who combine SQL + Python in notebooks. It emphasizes separation of logic, reproducibility, and efficient transitions between data querying, EDA, and modeling.
π§± 1. Use SQL to Prepare β Not Model¶
Use SQL For | Avoid Doing in SQL |
---|---|
Joins + filtering | Full-feature normalization |
Pre-aggregation | Custom feature generation |
Cohorting, funnels | Modeling or residual logic |
Deduplication | Outlier transformations |
βοΈ SQL should stage structured, clean inputs β leave exploration for Python
π 2. Pulling SQL into Python¶
πΉ BigQuery Example¶
from google.cloud import bigquery
import pandas as pd
client = bigquery.Client()
query = """
SELECT user_id, event_date, spend
FROM project.dataset.transactions
WHERE event_date >= DATE_SUB(CURRENT_DATE(), INTERVAL 30 DAY)
"""
df = client.query(query).to_dataframe()
πΉ Postgres or Snowflake Example (via SQLAlchemy)¶
from sqlalchemy import create_engine
import pandas as pd
engine = create_engine("postgresql://user:password@host/db")
df = pd.read_sql("SELECT * FROM users", engine)
π§ͺ 3. Notebook Structure for Analytics¶
[ Cell 1: Imports ]
[ Cell 2: SQL Query or Data Import ]
[ Cell 3: Initial sanity checks (df.head(), df.info()) ]
[ Cell 4: EDA β value_counts, describe(), histograms ]
[ Cell 5: Cleaning β nulls, dtypes, mapping ]
[ Cell 6: Feature generation ]
[ Cell 7: Modeling or export ]
βοΈ Use markdown cells to label each phase clearly
βοΈ Save query text in markdown cell or .sql
file for versioning
π§Ή 4. SQL View + Python Transform Pattern¶
Use SQL to stage a view and Python to finish the analysis.
SQL View Example:¶
-- Create a view in BigQuery or DB
CREATE OR REPLACE VIEW cohort_summary AS
SELECT cohort_month, days_since_signup, COUNT(*) AS retained_users
FROM raw.events
GROUP BY 1, 2;
Python:¶
df = client.query("SELECT * FROM cohort_summary").to_dataframe()
sns.lineplot(data=df, x="days_since_signup", y="retained_users", hue="cohort_month")
π οΈ 5. Best Practices¶
- Use parameterized queries to dynamically fetch different timeframes
- Store SQL queries in separate
.sql
files or markdown cells - Version-control final notebooks, but treat exploration as flexible
- Use
.query()
or.loc[]
with readable filters (df[df["country"] == "US"]
)
β SQL + Python Workflow Checklist¶
- [ ] Data pulled from SQL using tested/staged view or saved query
- [ ] Notebook includes import, structure, and EDA cells
- [ ] SQL only handles structural prep (joins, filters, grouping)
- [ ] Python handles flexible EDA, feature work, plotting
- [ ] Comments and markdown explain workflow clearly
π‘ Tip¶
βWrite SQL like a database engineer. Explore like a scientist. Ship notebooks like a product analyst.β