Deduplication
Data Processing IntermediateWhat is Deduplication?
Deduplication is the process of identifying and removing duplicate or near-duplicate records from a dataset. In web scraping, this means handling cases where the same product, lead, or data point appears multiple times due to pagination, site mirrors, or data updates.
Deduplication removes duplicate records from your dataset. Because counting the same product 50 times isn’t useful data. Clean data means accurate analytics, better ML models, and smarter business decisions.
Types of Duplicates
| Type | Example | Detection Method |
|---|---|---|
| Exact | Same URL, same timestamp | Hash-based |
| Near-exact | Same URL, different timestamp | Fuzzy matching |
| Semantic | Same product, different name | ML embeddings |
| Derived | Same product, different variant | Business rules |
Deduplication Strategies
1. URL-Based (Primary Key)
seen_urls = set()
for product in products:
url = product['canonical_url']
if url in seen_urls:
continue # Skip duplicate
seen_urls.add(url)
results.append(product)
2. Content Hashing (Fuzzy Match)
import hashlib
def content_hash(item):
# Create hash of normalized content
normalized = json.dumps(item, sort_keys=True)
return hashlib.md5(normalized.encode()).hexdigest()
# Detect near-duplicates
for product in products:
h = content_hash(product)
if h in seen_hashes:
continue
3. Simhash for Large Datasets
# Simhash finds similar documents efficiently
from simhash import Simhash
def dedupe_similar(items, threshold=3):
seen = []
for item in items:
h = Simhash(item['content'].split())
if any(h.distance(s) < threshold for s in seen):
continue
seen.append(h)
results.append(item)
Production Deduplication Pipeline
# Multi-layer deduplication
def dedupe_pipeline(records):
# Layer 1: Exact URL dedupe
seen_urls = set()
deduped = []
for r in records:
if r['url'] not in seen_urls:
seen_urls.add(r['url'])
deduped.append(r)
# Layer 2: Normalized content dedupe
seen_hashes = set()
final = []
for r in deduped:
h = content_hash(r)
if h not in seen_hashes:
seen_hashes.add(h)
final.append(r)
return final
Benchmark tip: For millions of records, use Redis sets for real-time dedupe or Apache Spark for batch processing. Dedupe early in your pipeline to save storage and processing costs downstream.
Related Terms
Need This at Scale?
Get enterprise-grade Deduplication implementation with our expert team.
Contact Us