Prepare data
Download Quora pairs, normalize text, remove invalid rows, create train and validation splits, and build a unique question corpus.
GitHub Pages Learning Site
This project teaches a realistic duplicate-question pipeline on Quora Question Pairs: prepare data, build a corpus, generate embeddings, search with HNSW, threshold similarity, evaluate failure modes, mine hard negatives, and retrain.
Mental Model
The repo is intentionally small, but every stage matches a production concern: data quality, semantic retrieval, decision thresholds, offline evaluation, and continuous improvement.
Download Quora pairs, normalize text, remove invalid rows, create train and validation splits, and build a unique question corpus.
Start with all-MiniLM-L6-v2, then optionally fine-tune for pair classification or retrieval-oriented training.
Use hnswlib for ANN search or optional FAISS for exact inner-product search over normalized embeddings.
Embed a query, search top-k neighbors, and recompute cosine scores for readable ranked results.
Use a similarity threshold to turn retrieval outputs into a binary duplicate decision.
Evaluate retrieval with Recall@1, Recall@5, and MRR. Evaluate classification with precision, recall, F1, and threshold sweeps.
Export false positives, false negatives, and hard negatives so the encoder improves on the mistakes the current system actually makes.
Constant Improvement
This repo is designed to teach that de-duplication quality is driven by the quality of the failure analysis loop.
What the experiments showed
The strongest result in this project came from contrastive training plus active learning, not from the most specialized retrieval loss. That is a useful lesson: better-shaped supervision beats more complicated training when the feedback loop is aligned with the system’s real errors.
Mermaid view of the active-learning loop
Retriever Comparison
The repo now benchmarks dense, lexical, and hybrid retrieval on the same validation setup. This matters because duplicate detection quality is limited by candidate generation before thresholding even starts.
The corpus is only 40k questions and the index settings are strong enough that the ANN approximation error is negligible.
Near-duplicates often share substrings, morphology, or spelling variants that a semantic encoder and a word tokenizer can both partially miss.
Dense retrieval finds paraphrases, while character n-grams recover lexical closeness. Score fusion combines both without training a new model.
Contrastive Loss Math
The best training recipe in this repo uses sentence-transformers
ContrastiveLoss with cosine distance and margin 0.5.
The math explains why it pairs well with hard negatives and active learning.
For a pair (q1, q2), the encoder produces two vectors:
e1 = f(q1)
e2 = f(q2)
The loss then measures how close or far those vectors are in embedding space.
The library uses cosine distance, derived from cosine similarity:
s(q1, q2) = (e1 dot e2) / (||e1|| * ||e2||)
d(q1, q2) = 1 - s(q1, q2)
Duplicates should have small d. Non-duplicates should have larger d.
For duplicate labels y = 1, the loss becomes:
L_pos = 0.5 * d^2
This means the closer a duplicate pair already is, the smaller the gradient becomes. The model focuses hardest on positives that are still far apart.
For non-duplicate labels y = 0, the loss becomes:
L_neg = 0.5 * max(0, m - d)^2
If a negative pair is already farther apart than the margin m, it contributes zero loss. Only hard negatives inside the margin keep pushing gradients.
L = 0.5 * [ y * d^2 + (1 - y) * max(0, m - d)^2 ]
With m = 0.5, a negative pair stops being penalized once its cosine distance reaches
0.5, which is equivalent to cosine similarity falling to 0.5 or below.
The margin is a training rule about embedding geometry. The serving threshold is chosen later by sweeping validation similarities
and selecting the cutoff that gives the best precision-recall tradeoff. In this repo, the best operational threshold ended up around
0.77, well above the training margin boundary.
That is normal. Training says "push hard negatives outside the active margin." Evaluation says "given the whole learned space, where should the classifier cut for best F1?"
Experiment Timeline
These are measured local runs on the default 30k train / 5k validation / 40k corpus setup. The goal is to show how a de-duplication system changes as you improve supervision.
Active learning increased precision and recall together because the new training examples matched the system’s real confusion set.
MNRL delivered the strongest retrieval of the dense training losses, with Recall@1 of 0.6116 and MRR of 0.6956, but it produced too many false positives for final duplicate classification.
The MNRL active-learning retrain brought Recall@5 back to 0.8250, but because MNRL trains only on positive pairs in this repo, it could not use the mined negative feedback as directly as contrastive loss.
In de-duplication, retrieval and classification are not the same objective. MNRL helped ranking, while contrastive helped the final duplicate decision much more.
A simple dense plus character-TF-IDF hybrid reached Recall@1 of 0.6213 and Recall@5 of 0.8353, beating pure dense retrieval on the same corpus.
Learning Material
Use the codebase as the lab and this sequence as the teaching order.
Duplicate detection is not only a classifier. It is usually retrieval first, then scoring or reranking.
Corpus construction, question normalization, and train-validation splits determine whether metrics mean anything.
Sentence-transformers give a small, practical baseline that runs locally and supports both search and training.
HNSW lets you search large corpora quickly, but retrieval quality depends on both index settings and the embedding geometry.
A threshold sweep is not an afterthought. It is the operational policy layer for duplicate vs non-duplicate decisions.
Recall@k and MRR tell you if the right candidate is retrieved. Precision, recall, and F1 tell you if the decision layer behaves well.
False positives, false negatives, and hard negatives are not side outputs. They are the next version of the training set.
The aligned triplet experiment improved the data alignment but still lost to contrastive plus active learning. That is a real systems lesson.
Hands-On Lab
python src/prepare_data.py
python src/train_encoder.py --mode embed
python src/build_index.py
python src/evaluate.py
python src/search.py --query "How can I learn Python fast?" --top_k 5
python src/train_encoder.py \
--mode finetune \
--loss_type contrastive \
--contrastive_margin 0.5 \
--epochs 1 \
--train_batch_size 32 \
--output_model_dir models/quora_miniLM_contrastive \
--embed_after_training \
--output_embeddings models/corpus_embeddings_contrastive.npy \
--embedding_metadata_file models/embedding_metadata_contrastive.json
python src/active_learning.py \
--model_name_or_path models/quora_miniLM_contrastive \
--embeddings_file models/corpus_embeddings_contrastive.npy \
--index_file indices/quora_hnsw_contrastive.index \
--index_metadata_file indices/quora_hnsw_contrastive_metadata.json \
--threshold_summary_file models/evaluation_summary_contrastive.json
Further Study
These links help connect the toy project to the real libraries and concepts behind it.
The exact dataset used by this project.
Model loading, encoding, fine-tuning, evaluators, and loss functions.
Contrastive, MultipleNegativesRankingLoss, and TripletLoss references.
The original Hadsell, Chopra, and LeCun paper on learning invariant embeddings with a margin.
The ANN index library used for local semantic retrieval.
An optional exact or approximate vector-search backend. This repo now supports a FAISS CPU index too.
Useful background for understanding the lexical retrievers and why character n-grams helped.
The baseline encoder used to start the pipeline.
Background on precision, recall, F1, ranking metrics, and validation workflows.