ETL

What is ETL?

ETL (Extract, Transform, Load) is a data integration process that combines data from multiple sources into a single, consistent data store. It’s the backbone of data warehousing and analytics. The three phases work together to ensure raw, messy data becomes analysis-ready information.

ETL is the heartbeat of data pipelines. Raw data becomes insights through this three-step dance. Without ETL, you’d have scattered, inconsistent data that no one can trust or analyze.

The Three Phases

┌─────────────────────────────────────────────────────────┐
│  EXTRACT         │  TRANSFORM        │  LOAD            │
│                 │                   │                  │
│  ┌─────┐        │  ┌─────┐          │  ┌─────┐         │
│  │Raw  │   →    │  │Clean│    →     │  │Data │         │
│  │Data │        │  │Data │          │  │Lake │         │
│  └─────┘        │  └─────┘          │  └─────┘         │
│                 │                   │                  │
│  Sources:       │  Operations:      │  Destinations:   │
│  • Web scrapes  │  • Normalize      │  • Data warehouse│
│  • APIs         │  • Dedupe         │  • Database      │
│  • Databases    │  • Validate       │  • Cloud storage │
│  • Files        │  • Enrich         │  • BI tools      │
└─────────────────────────────────────────────────────────┘

ETL Implementation Example

import pandas as pd
from sqlalchemy import create_engine

# EXTRACT
def extract_from_scraper(output_file):
    return pd.read_json(output_file, lines=True)

def extract_from_api(endpoint):
    response = requests.get(endpoint)
    return pd.json_normalize(response.json())

# TRANSFORM  
def transform_data(df):
    # Clean
    df = df.drop_duplicates()
    df = df.dropna(subset=['url', 'price'])
    
    # Normalize
    df['price'] = df['price'].apply(normalize_price)
    df['date'] = pd.to_datetime(df['date'])
    
    # Enrich
    df['category'] = df['url'].apply(extract_category)
    
    return df

# LOAD
def load_to_database(df, table_name, connection_string):
    engine = create_engine(connection_string)
    df.to_sql(table_name, engine, if_exists='append', index=False)

# Pipeline
raw = extract_from_scraper('products.json')
clean = transform_data(raw)
load_to_database(clean, 'products', 'postgresql://...')

Modern ETL: ELT vs ETL

Approach	Order	Use Case
ETL	Extract → Transform → Load	Traditional data warehouses
ELT	Extract → Load → Transform	Cloud data lakes (Snowflake, BigQuery)

Pro tip: For modern cloud analytics, ELT is often better. Load raw data to your warehouse first, then transform using SQL. This preserves the original data for reprocessing and lets you iterate on transformations without re-scraping.

ETL

What is ETL?

The Three Phases

ETL Implementation Example

Modern ETL: ELT vs ETL

Used in Our Services

Related Terms

Need This at Scale?

Share This Term

Got Questions?