Data Normalization
Data Processing IntermediateWhat is Data Normalization?
Data normalization is the process of transforming inconsistent, messy data from multiple sources into a clean, standardized format. This includes fixing encoding issues, standardizing units, parsing dates, handling currency symbols, and ensuring consistent data types across all records.
Data normalization transforms messy scraped data into clean, consistent formats. Garbage in, quality out — unless you normalize. Raw scraped data is rarely ready for analysis or machine learning. It needs scrubbing.
Common Normalization Tasks
| Data Type | Problem | Solution |
|---|---|---|
| Prices | “$1,299.99”, “₹94,999”, “1.299,99 €” | Parse to float, normalize currency |
| Dates | “Jan 15, 2024”, “15/01/2024”, “2024-01-15” | ISO 8601 format |
| Phone | “+91 98765 43210”, “09876543210” | E.164 format |
| Names | “JOHN DOE”, “john doe”, “John DoE” | Proper case |
| URLs | Relative, absolute, messy params | Absolute URLs, cleaned |
Normalization Code Examples
import re
from datetime import datetime
from decimal import Decimal
def normalize_price(value):
"""Handle various price formats"""
if isinstance(value, (int, float)):
return float(value)
# Remove currency symbols and commas
cleaned = re.sub(r'[^\d.]', '', str(value))
return float(cleaned) if cleaned else 0.0
def normalize_date(value):
"""Parse multiple date formats"""
formats = [
'%B %d, %Y', # January 15, 2024
'%d/%m/%Y', # 15/01/2024
'%Y-%m-%d', # 2024-01-15
'%d-%b-%Y', # 15-Jan-2024
]
for fmt in formats:
try:
return datetime.strptime(value, fmt).isoformat()
except ValueError:
continue
return None
def normalize_phone(value, country='IN'):
"""Extract and format phone numbers"""
digits = re.sub(r'\D', '', str(value))
if len(digits) == 10:
return f"+91{digits}" # India
return digits
Pro tip: Build a normalization library specific to your use case. Document every transformation. When business logic changes, you’ll need to update your normalizers — and having them documented saves hours of debugging.
Related Terms
Need This at Scale?
Get enterprise-grade Data Normalization implementation with our expert team.
Contact Us