ContextoSolver.
Reproducible NLP research framework benchmarking Word2Vec, fastText, GloVe, and SVD+PPMI on word similarity and beam-search navigation. GloVe achieves 99.8% navigation success rate.

ContextoSolver is a reproducible experimental framework that trains static word embeddings on the text8 corpus and evaluates them two ways: intrinsic word-similarity correlation against human judgments (WordSim-353, SimLex-999), and a navigation task inspired by the word game Contexto — move from a start word to a target word through high cosine-similarity neighbors using beam search. Four embedding families are implemented with aligned hyperparameters (50-dimensional vectors, window 5, min count 5): Word2Vec (Gensim skip-gram), fastText (subword-enriched), GloVe (vendored Stanford C reference), and count-based SVD on a harmonic-weighted PPMI co-occurrence matrix. Training runs across 5 random seeds per model for variance reporting and paired statistical tests. Key finding: GloVe achieves the strongest navigation metrics (99.8% success rate, 5.0 median steps) while Word2Vec scores highest on SimLex yet fails more often on navigation — revealing a tension between lexical similarity benchmarks and geometric navigability that motivates the dual-evaluation design.
Problem
Word embedding models are typically evaluated only on intrinsic word-similarity benchmarks, which may not reflect their practical utility for multi-hop reasoning tasks. There is no standard extrinsic benchmark for geometric navigability in embedding space.
Solution
Designed a beam-search navigation task as a computable analogue to Contexto: move from start to target through cosine-similarity neighbors. Automated paired significance tests (McNemar, Wilcoxon, permutation) formalize model comparisons across both intrinsic and extrinsic dimensions.
- 01Trained 4 embedding families (Word2Vec, fastText, GloVe, SVD+PPMI) on text8 with aligned hyperparameters across 5 random seeds
- 02Beam-search navigation task: move from start→target through cosine-similarity neighbors in embedding space
- 03GloVe 99.8% success rate, 5.0 median steps; Word2Vec 81.3% — reveals navigation vs. intrinsic benchmark tension
- 04Automated paired significance tests (McNemar, Wilcoxon, permutation) across all model pairs and metrics
- 05Full reproducibility: YAML configs, hashed run IDs, multi-seed variance reporting, pytest suite
- 99.8%GloVe Success Rate
- 4Embedding Models
- 5Random Seeds
- 200+Trial Pairs
backend
- Python
- NumPy / SciPy
- Gensim
- scikit-learn
- NLTK
other
- pandas
- matplotlib
- pytest