Deduplication Embedding

GitHub Pages Learning Site

De-duplication is not a one-shot model. It is a retrieval system that keeps learning from its mistakes.

This project teaches a realistic duplicate-question pipeline on Quora Question Pairs: prepare data, build a corpus, generate embeddings, search with HNSW, threshold similarity, evaluate failure modes, mine hard negatives, and retrain.

Dataset Quora Question Pairs
Default Corpus 40,000 questions
Best Run Contrastive + active learning

Mental Model

The end-to-end pipeline

The repo is intentionally small, but every stage matches a production concern: data quality, semantic retrieval, decision thresholds, offline evaluation, and continuous improvement.

01

Prepare data

Download Quora pairs, normalize text, remove invalid rows, create train and validation splits, and build a unique question corpus.

02

Embed questions

Start with all-MiniLM-L6-v2, then optionally fine-tune for pair classification or retrieval-oriented training.

03

Index locally

Use hnswlib for ANN search or optional FAISS for exact inner-product search over normalized embeddings.

04

Search candidates

Embed a query, search top-k neighbors, and recompute cosine scores for readable ranked results.

05

Decide duplicate or not

Use a similarity threshold to turn retrieval outputs into a binary duplicate decision.

06

Measure the right things

Evaluate retrieval with Recall@1, Recall@5, and MRR. Evaluate classification with precision, recall, F1, and threshold sweeps.

07

Mine better training signal

Export false positives, false negatives, and hard negatives so the encoder improves on the mistakes the current system actually makes.

Constant Improvement

The system improves through feedback, not just bigger models

This repo is designed to teach that de-duplication quality is driven by the quality of the failure analysis loop.

  1. Run the current modelRetrieve neighbors and classify pairs.
  2. Collect failuresFalse positives and false negatives show where the decision boundary is wrong.
  3. Mine hard negativesNearest non-duplicates expose retrieval confusion the model currently cannot separate.
  4. Augment training dataFold difficult examples back into the next training cycle.
  5. Retrain the encoderUse contrastive, MNRL, or triplet objectives depending on the target behavior.
  6. Rebuild the indexNew embeddings require a new ANN index before search improves.
  7. Compare metrics againIf offline metrics do not improve, the new objective is not yet paying off.

What the experiments showed

The strongest result in this project came from contrastive training plus active learning, not from the most specialized retrieval loss. That is a useful lesson: better-shaped supervision beats more complicated training when the feedback loop is aligned with the system’s real errors.

Mermaid view of the active-learning loop

flowchart LR A["Validation or live queries"] --> B["Embed with current encoder"] B --> C["Retrieve top-k neighbors"] C --> D["Apply cosine threshold"] D --> E["Compare with labels or reviewer feedback"] E --> F["False positives"] E --> G["False negatives"] C --> H["Mine close non-duplicate neighbors"] F --> I["Feedback pool"] G --> I H --> I I --> J["Augmented training file"] J --> K["Contrastive or retrieval-oriented retrain"] K --> L["Re-embed corpus"] L --> M["Rebuild ANN index"] M --> N["Re-evaluate Recall@k, MRR, Precision, Recall, F1"] N -->|improved| B N -->|not improved| I

Retriever Comparison

Different retrieval mechanisms change what kinds of duplicates you can find

The repo now benchmarks dense, lexical, and hybrid retrieval on the same validation setup. This matters because duplicate detection quality is limited by candidate generation before thresholding even starts.

Method Signal Backend Recall@1 Recall@5 MRR What it teaches
Hybrid dense + char TF-IDF Semantic + lexical Score fusion 0.6213 0.8353 0.7056 Best overall candidate generator in this repo.
Dense HNSW Semantic hnswlib ANN 0.6046 0.8267 0.6911 Fast local ANN baseline with strong retrieval quality.
Dense exact Semantic Brute-force dot product 0.6046 0.8267 0.6911 Shows the embedding model, not the ANN, is the main bottleneck here.
Dense FAISS Semantic FAISS IndexFlatIP 0.6046 0.8267 0.6911 Same quality as exact dense search at this scale.
TF-IDF char n-gram Lexical scikit-learn sparse retrieval 0.5553 0.7736 0.6382 Much better than word TF-IDF for spelling and phrase overlap.
TF-IDF word n-gram Lexical scikit-learn sparse retrieval 0.4540 0.6679 0.5345 Weakest retriever for paraphrastic duplicates.

Why HNSW, exact, and FAISS tie here

The corpus is only 40k questions and the index settings are strong enough that the ANN approximation error is negligible.

Why character TF-IDF helps

Near-duplicates often share substrings, morphology, or spelling variants that a semantic encoder and a word tokenizer can both partially miss.

Why the hybrid wins

Dense retrieval finds paraphrases, while character n-grams recover lexical closeness. Score fusion combines both without training a new model.

Contrastive Loss Math

The contrastive objective is shaping geometry, not directly choosing the duplicate threshold

The best training recipe in this repo uses sentence-transformers ContrastiveLoss with cosine distance and margin 0.5. The math explains why it pairs well with hard negatives and active learning.

1. Embed both questions

For a pair (q1, q2), the encoder produces two vectors:

e1 = f(q1)
e2 = f(q2)

The loss then measures how close or far those vectors are in embedding space.

2. Convert similarity into distance

The library uses cosine distance, derived from cosine similarity:

s(q1, q2) = (e1 dot e2) / (||e1|| * ||e2||)
d(q1, q2) = 1 - s(q1, q2)

Duplicates should have small d. Non-duplicates should have larger d.

3. Positive pairs are pulled together

For duplicate labels y = 1, the loss becomes:

L_pos = 0.5 * d^2

This means the closer a duplicate pair already is, the smaller the gradient becomes. The model focuses hardest on positives that are still far apart.

4. Negative pairs use a hinge margin

For non-duplicate labels y = 0, the loss becomes:

L_neg = 0.5 * max(0, m - d)^2

If a negative pair is already farther apart than the margin m, it contributes zero loss. Only hard negatives inside the margin keep pushing gradients.

The full loss used in this repo

L = 0.5 * [ y * d^2 + (1 - y) * max(0, m - d)^2 ]

With m = 0.5, a negative pair stops being penalized once its cosine distance reaches 0.5, which is equivalent to cosine similarity falling to 0.5 or below.

Why the margin is not the serving threshold

The margin is a training rule about embedding geometry. The serving threshold is chosen later by sweeping validation similarities and selecting the cutoff that gives the best precision-recall tradeoff. In this repo, the best operational threshold ended up around 0.77, well above the training margin boundary.

That is normal. Training says "push hard negatives outside the active margin." Evaluation says "given the whole learned space, where should the classifier cut for best F1?"

Experiment Timeline

What changed across iterations

These are measured local runs on the default 30k train / 5k validation / 40k corpus setup. The goal is to show how a de-duplication system changes as you improve supervision.

Model Avg Precision F1 Recall@1 Recall@5 Key Lesson
Baseline MiniLM 0.7628 0.7352 0.6046 0.8267 Strong retrieval baseline with simple thresholding.
MNRL fine-tune 0.7508 0.7282 0.6116 0.8223 Best dense retriever among the single-loss runs, but weaker duplicate classification.
Contrastive fine-tune 0.8465 0.8060 0.5997 0.8131 Better pair classification, slightly weaker retrieval.
Contrastive + active learning 0.8647 0.8359 0.6051 0.8250 Best overall tradeoff in this repo.
MNRL + active learning 0.7601 0.7361 0.6105 0.8250 Recovered retrieval after feedback mining, but still lagged badly on classification.

What actually improved

Active learning increased precision and recall together because the new training examples matched the system’s real confusion set.

What MNRL was best at

MNRL delivered the strongest retrieval of the dense training losses, with Recall@1 of 0.6116 and MRR of 0.6956, but it produced too many false positives for final duplicate classification.

What MNRL active learning changed

The MNRL active-learning retrain brought Recall@5 back to 0.8250, but because MNRL trains only on positive pairs in this repo, it could not use the mined negative feedback as directly as contrastive loss.

Why that matters

In de-duplication, retrieval and classification are not the same objective. MNRL helped ranking, while contrastive helped the final duplicate decision much more.

Best retriever from the comparison run

A simple dense plus character-TF-IDF hybrid reached Recall@1 of 0.6213 and Recall@5 of 0.8353, beating pure dense retrieval on the same corpus.

Learning Material

A practical study path for de-duplication systems

Use the codebase as the lab and this sequence as the teaching order.

Lesson 1: Reframe the problem

Duplicate detection is not only a classifier. It is usually retrieval first, then scoring or reranking.

Lesson 2: Clean data beats clever code

Corpus construction, question normalization, and train-validation splits determine whether metrics mean anything.

Lesson 3: Embeddings are the retrieval backbone

Sentence-transformers give a small, practical baseline that runs locally and supports both search and training.

Lesson 4: ANN is an engineering tradeoff

HNSW lets you search large corpora quickly, but retrieval quality depends on both index settings and the embedding geometry.

Lesson 5: Thresholds are part of the model

A threshold sweep is not an afterthought. It is the operational policy layer for duplicate vs non-duplicate decisions.

Lesson 6: Measure retrieval and classification separately

Recall@k and MRR tell you if the right candidate is retrieved. Precision, recall, and F1 tell you if the decision layer behaves well.

Lesson 7: Improvement comes from the failure loop

False positives, false negatives, and hard negatives are not side outputs. They are the next version of the training set.

Lesson 8: Not every new objective wins

The aligned triplet experiment improved the data alignment but still lost to contrastive plus active learning. That is a real systems lesson.

Hands-On Lab

Run the full loop yourself

Baseline

python src/prepare_data.py
python src/train_encoder.py --mode embed
python src/build_index.py
python src/evaluate.py
python src/search.py --query "How can I learn Python fast?" --top_k 5

Best current workflow

python src/train_encoder.py \
  --mode finetune \
  --loss_type contrastive \
  --contrastive_margin 0.5 \
  --epochs 1 \
  --train_batch_size 32 \
  --output_model_dir models/quora_miniLM_contrastive \
  --embed_after_training \
  --output_embeddings models/corpus_embeddings_contrastive.npy \
  --embedding_metadata_file models/embedding_metadata_contrastive.json

Teach the hard-negative loop

python src/active_learning.py \
  --model_name_or_path models/quora_miniLM_contrastive \
  --embeddings_file models/corpus_embeddings_contrastive.npy \
  --index_file indices/quora_hnsw_contrastive.index \
  --index_metadata_file indices/quora_hnsw_contrastive_metadata.json \
  --threshold_summary_file models/evaluation_summary_contrastive.json

Further Study

Curated learning material

These links help connect the toy project to the real libraries and concepts behind it.