Lexical Coverage Results — PaniniFS Engine¶

This page documents the lexical coverage metrics of the PaniniFS semantic engine on two corpora: the Gutenberg corpus (classic texts) and the Wikipedia corpus.

Metric

Lexical coverage measures the proportion of content words (after removing function words) to which at least one semantic atom can be assigned without adding a new primitive.

Global state — expanded Gutenberg corpus (v4.8.16)¶

62 files · 12 languages · ~5.8M words (state: February 2026)

Language	Code	Coverage	Family	Script
English	`en`	81.4%	IE/Germanic	Latin
German	`de`	81.4%	IE/Germanic	Latin
French	`fr`	79.4%	IE/Romance	Latin
Chinese	`zh`	76.6%	Sino-Tibetan	CJK
Japanese	`ja`	74.1%	Japonic	CJK
Esperanto	`eo`	73.2%	Constructed	Latin
Finnish	`fi`	71.7%	Uralic	Latin
Italian	`it`	71.1%	IE/Romance	Latin
Spanish	`es`	68.7%	IE/Romance	Latin
Russian	`ru`	56.3%	IE/Slavic	Cyrillic
Dutch	`nl`	55.9%	IE/Germanic	Latin
Sanskrit	`sa`	10.7%	IE/Indic	IAST transliteration
Global	—	76.8%	—	—

Note: Scores on the expanded corpus (62 files, hard texts including 14^th-century Dante, pre-1918 Russian spelling, pre-1947 Dutch spelling) are lower than scores on the original 11-file corpus, where 7/7 EU languages reached ≥ 90%.

Original Gutenberg corpus — 7 European languages (v4.8.11)¶

11 files · 7 languages · calibrated classical corpus

Language	Coverage	Status
English	94.4%	🟢
Esperanto	93.2%	🟢
German	91.1%	🟢
Finnish	90.6%	🟢
Spanish	90.1%	🟢
French	90.1%	🟢
Italian	90.1%	🟢
Global	91.2%	🎯

Milestone: 7/7 European languages ≥ 90%, achieved with v4.8.11 (February 21, 2026).

Wikipedia corpus (v4.7 — Wikipedia Audit)¶

973 articles · 14 languages · 2.2M words

34/34 atoms covered across all languages = 100% atom presence
Cross-language cosine similarity (FR↔ZH = 0.904, EN↔FR = 0.93)
14 languages include: EN, FR, DE, ES, IT, FI, EO, PT, NL, JA, ZH, HI, SA, AR

Version progression — EU original corpus¶

Evolution on the original Gutenberg EU corpus (11 files), from v4.8.2 to v4.8.11:

Version	New entries	Global gain	Coverage	Milestone
v4.8.2	base	—	85.1%
v4.8.3	771	+2.3pp	87.4%
v4.8.4	584	+1.4pp	88.8%
v4.8.5	algo fixes	+0.2pp	89.0%
v4.8.6	400	+0.4pp	89.4%
v4.8.7	307	+0.7pp	90.1%	🎯 90% global
v4.8.8	136	+0.4pp	90.5%	FR ≥ 90%
v4.8.9	113	+0.3pp	90.8%
v4.8.10	110	+0.2pp	91.0%
v4.8.11	124	+0.2pp	91.2%	🎯 7/7 EU ≥ 90%
Total	~2,550	+6.1pp	91.2%

Multilingual breakthroughs — expanded corpus (v4.8.12 → v4.8.16)¶

After expansion to 62 files including non-European languages:

Japanese: 18.8% → 74.1% (+55.3pp)¶

File	Content	Before	After
pg1982	Rashomon (Akutagawa)	18.8%	74.0%
pg31617	Shisei (Tanizaki)	—	71.9%
pg31757	Omedetaki hito (Mushanokoji)	—	78.4%

Techniques: kanji-only tokenization, furigana 《》 stripping, OpenCC kyūjitai → simplified.

Chinese: 33.8% → 73.9% (+40.1pp)¶

Techniques: OpenCC traditional→simplified, CJK punctuation filter, 471 entries (347 keywords, 64 stop words, 60 proper nouns).

Russian: 16.5% → 56.3% (+39.8pp total)¶

File	Content	Before	After
pg16527	Commercial text	—	64.4%
pg14741	Derzhavin, spiritual odes	21.8%	48.9%
pg30774	Travelers in Muscovy (pre-1918 spelling)	13.6%	41.8%

Techniques: Snowball Russian stemmer, pre-1918 spelling normalizer (ъ, ѣ→е, і→и), 450 keywords, 250 stop words.

Dutch: 28.4% → 55.9% (+27.5pp total)¶

File	Content	Before	After
pg17525	Buysse, Flemish prose	41.7%	52.5%
pg18066	Columbus, exploration	37.9%	56.8%

Techniques: Snowball Dutch stemmer, 48-pair pre-1947 spelling table (zoo→zo, groote→grote), 350 keywords, 180 stop words.

Notable per-file results (expanded EU corpus, v4.8.15)¶

File	Language	Content	Coverage
pg1232	EN	The Prince (Machiavelli)	83.6%
pg2407	DE	Also Sprach Zarathustra	89.1%
pg2000	ES	Don Quijote	86.4%
pg17989	FR	De la Terre à la Lune (Verne)	90.1%
pg1012	IT	Divina Commedia (Dante, 14^th c.)	~81%
pg16328	EN	Beowulf (ancient poetry)	81.6%
pg74	EN	Tom Sawyer (Twain)	83.6%
pg5185	EN	Kalevala EN	80.9%

Cross-language spillover effects¶

The v4.8.14 validation revealed untargeted gains due to kanji/hanzi sharing:

Language	Before v4.8.14	After	Gain
Esperanto	67.3%	73.2%	+5.9pp
Finnish	66.0%	71.7%	+5.7pp
German	77.8%	80.6%	+2.8pp
Chinese	73.9%	76.6%	+2.7pp
French	75.8%	78.4%	+2.6pp

Key insight: Japanese kanji share characters with Chinese hanzi; coverage gained for one language automatically benefits the other, confirming that the semantic atom is writing-system-independent.

Infrastructure and reproducibility¶

Component	Description
Engine	`seven_layers_engine.py` — 3,320 lines, 14 languages, 34 atoms
Finnish lemmatizer	`voikko` — inflected forms, past participles
Stemmers	Snowball for EN/FR/DE/ES/IT/FI/EO/RU/NL (9 languages)
Normalizer	`text_normalizer.py` — NFC, BCP 47, NFKC CJK, epoch detection
Russian normalization	`normalize_prereform_ru()` — pre-1918 spelling
CJK normalization	OpenCC `t2s` (traditional→simplified)
Dutch normalization	48-pair pre-1947 spelling table
Dolt corpus	3 databases (~215 MB), schema v3, ×877 optimization