Deduplication

What is Deduplication?

Deduplication is the process of identifying and removing duplicate or near-duplicate records from a dataset. In web scraping, this means handling cases where the same product, lead, or data point appears multiple times due to pagination, site mirrors, or data updates.

Deduplication removes duplicate records from your dataset. Because counting the same product 50 times isn’t useful data. Clean data means accurate analytics, better ML models, and smarter business decisions.

Types of Duplicates

Type	Example	Detection Method
Exact	Same URL, same timestamp	Hash-based
Near-exact	Same URL, different timestamp	Fuzzy matching
Semantic	Same product, different name	ML embeddings
Derived	Same product, different variant	Business rules

Deduplication Strategies

1. URL-Based (Primary Key)

seen_urls = set()

for product in products:
    url = product['canonical_url']
    if url in seen_urls:
        continue  # Skip duplicate
    seen_urls.add(url)
    results.append(product)

2. Content Hashing (Fuzzy Match)

import hashlib

def content_hash(item):
    # Create hash of normalized content
    normalized = json.dumps(item, sort_keys=True)
    return hashlib.md5(normalized.encode()).hexdigest()

# Detect near-duplicates
for product in products:
    h = content_hash(product)
    if h in seen_hashes:
        continue

3. Simhash for Large Datasets

# Simhash finds similar documents efficiently
from simhash import Simhash

def dedupe_similar(items, threshold=3):
    seen = []
    for item in items:
        h = Simhash(item['content'].split())
        if any(h.distance(s) < threshold for s in seen):
            continue
        seen.append(h)
        results.append(item)

Production Deduplication Pipeline

# Multi-layer deduplication
def dedupe_pipeline(records):
    # Layer 1: Exact URL dedupe
    seen_urls = set()
    deduped = []
    for r in records:
        if r['url'] not in seen_urls:
            seen_urls.add(r['url'])
            deduped.append(r)
    
    # Layer 2: Normalized content dedupe
    seen_hashes = set()
    final = []
    for r in deduped:
        h = content_hash(r)
        if h not in seen_hashes:
            seen_hashes.add(h)
            final.append(r)
    
    return final

Benchmark tip: For millions of records, use Redis sets for real-time dedupe or Apache Spark for batch processing. Dedupe early in your pipeline to save storage and processing costs downstream.

Deduplication

What is Deduplication?

Types of Duplicates

Deduplication Strategies

Production Deduplication Pipeline

Used in Our Services

Related Terms

Need This at Scale?

Share This Term

Got Questions?