Methodology

How the atlas is drawn.

A page on Contextualize is the output of a five-stage pipeline that runs over English Wiktionary. Each stage is independently re-runnable on the same SQLite corpus, which makes the whole system feel less like a build and more like a reading practice — each pass clarifying what the last one couldn't quite see.

① Source · Wiktextract

The raw data is the Wiktextract dump of English Wiktionary — about 1.3 million headwords, each with senses, pronunciations, etymology prose, etymology templates (the structured macros editors use to declare inheritance), and a web of synonyms, antonyms, hypernyms, hyponyms, and derived forms.

Wiktionary is CC-BY-SA — sense glosses are theirs, attributed, and re-licensed under the same terms.

② Transform · one record per entry

Each Wiktextract record is reshaped into a clean WordEntry — pronunciations are deduplicated per region, senses keep both a short gloss and the full definition prose, and relationships (synonyms, antonyms, derived forms) are bucketed by kind. MediaWiki residue — template anchors, raw wikitext — is stripped.

③ Etymology · templates, then prose

Wiktionary writes inheritance two ways: as structured templates ({{inh|en|enm|over}} → "Middle English over") and as prose ("From Middle English over, from Old English ofer…"). We read both. The template walker handles inh / der / bor / lbor / cal and their variants, normalizes each into an EtymologyStep, and reverses to earliest-ancestor-first order. Where templates are missing, a regex prose parser recovers the same shape with a curated language-name vocabulary (~80 languages).

④ Compound expansion · recursive grafts

Compounds like overview = over- + view have no inheritance of their own — they reference their constituent morphemes. The enrichment pass walks each compound entry, looks up the head morpheme in the same database, and adopts its chain (with a final synthesized step naming the compound formation). The pass iterates until idempotent, so stacks like bioinspirationalist (bio + inspirational + -ist) fully unfold: Latin īnspīrātus → Late Latin → Old French → Middle English → English inspirationalbioinspirational bioinspirationalist.

⑤ Vish · cycles in the definitional graph

A word's senses point at the words used in their definitions — treat each pointer as a directed edge and the resulting graph contains cycles that close back through the original word. The Vish pass walks every entry that's eligible (has both in- and out-edges) and looks for a closed loop of length 7–10.

Two filtering decisions keep the loops semantically tight rather than mechanically literal:

  • Only the first two senses of each entry contribute edges. Beyond that we hit niche secondary readings ("necessary" → a euphemism for toilet) that pull cycles into unrelated fields.
  • The top-200 highest-degree lemmas (grammatical metalanguage like plural/participle, administrative descriptors like surname/county) plus a hand-curated list of ~280 scaffolding words (adverbs of degree, participial connectors, generic categorial nouns) are removed from the graph entirely. Cycles can't route through them.

Cycles are found by bounded random walk — at this graph size (millions of edges) a full BFS for shortest cycle is too expensive per node. When a walk succeeds it's rotated and shared with all of its members, so one successful walk supplies a cycle for seven to ten word pages at once.

Curated entries · setting the bar

Three entries — nostalgia, light, and cup — are hand-written and take precedence over their corpus counterparts at render time. They demonstrate what an entry looks like at full density: editorial etymology prose, a notes-annotated chain, and a hand-traced Vish loop with narration written sentence by sentence. They are the visible bar for the rest of the corpus.

What's missing

About a quarter of headwords carry an etymology chain. The remainder are mostly inflected forms, abbreviations, and proper-noun entries that have no inheritance to extract — but a few percent are legitimate misses where the Wiktionary prose escapes the parser.

About one in twenty entries shows a Vish ring. Coverage scales with how connected the word is to the wider definitional graph; concrete nouns (apple, crow) and very-technical terms often don't close a loop within ten hops. Live cycle detection in the browser is the next major iteration.

The etymon detail pages (the destination when you click a reconstructed ancestor) are placeholders right now — gathering every descendant of a single root into one page is a substantial piece of unfinished work.