ETL
Data Processing IntermediateWhat is ETL?
ETL (Extract, Transform, Load) is a data integration process that combines data from multiple sources into a single, consistent data store. It’s the backbone of data warehousing and analytics. The three phases work together to ensure raw, messy data becomes analysis-ready information.
ETL is the heartbeat of data pipelines. Raw data becomes insights through this three-step dance. Without ETL, you’d have scattered, inconsistent data that no one can trust or analyze.
The Three Phases
┌─────────────────────────────────────────────────────────┐
│ EXTRACT │ TRANSFORM │ LOAD │
│ │ │ │
│ ┌─────┐ │ ┌─────┐ │ ┌─────┐ │
│ │Raw │ → │ │Clean│ → │ │Data │ │
│ │Data │ │ │Data │ │ │Lake │ │
│ └─────┘ │ └─────┘ │ └─────┘ │
│ │ │ │
│ Sources: │ Operations: │ Destinations: │
│ • Web scrapes │ • Normalize │ • Data warehouse│
│ • APIs │ • Dedupe │ • Database │
│ • Databases │ • Validate │ • Cloud storage │
│ • Files │ • Enrich │ • BI tools │
└─────────────────────────────────────────────────────────┘
ETL Implementation Example
import pandas as pd
from sqlalchemy import create_engine
# EXTRACT
def extract_from_scraper(output_file):
return pd.read_json(output_file, lines=True)
def extract_from_api(endpoint):
response = requests.get(endpoint)
return pd.json_normalize(response.json())
# TRANSFORM
def transform_data(df):
# Clean
df = df.drop_duplicates()
df = df.dropna(subset=['url', 'price'])
# Normalize
df['price'] = df['price'].apply(normalize_price)
df['date'] = pd.to_datetime(df['date'])
# Enrich
df['category'] = df['url'].apply(extract_category)
return df
# LOAD
def load_to_database(df, table_name, connection_string):
engine = create_engine(connection_string)
df.to_sql(table_name, engine, if_exists='append', index=False)
# Pipeline
raw = extract_from_scraper('products.json')
clean = transform_data(raw)
load_to_database(clean, 'products', 'postgresql://...')
Modern ETL: ELT vs ETL
| Approach | Order | Use Case |
|---|---|---|
| ETL | Extract → Transform → Load | Traditional data warehouses |
| ELT | Extract → Load → Transform | Cloud data lakes (Snowflake, BigQuery) |
Pro tip: For modern cloud analytics, ELT is often better. Load raw data to your warehouse first, then transform using SQL. This preserves the original data for reprocessing and lets you iterate on transformations without re-scraping.