Skip to content

title: Computational semantic compression status: draft


Computational semantic compression

Principles, metrics, and use cases (storage, communication, processing, learning, education).

Synthesis (draft)

  • Target minimal loss of meaning at fixed bitrate via Dhātu representations.
  • Evaluate trade-offs: fidelity, decodability, learnability.

Metrics (to refine)

  • Concept coverage, mutual information retained, ambiguity rate, reconstruction accuracy.

Minimal evaluation protocol

1) Bilingual toy corpus (FR/EN) of 100 sentences covering AAO, spatial relations, tense, negation, quantification. 2) Gold manual encoding + rule/LLM-guided automatic attempt. 3) Decode to paraphrases and score similarity/ambiguity. 4) Report mean/median/variance for coverage, ambiguity, length, reconstruction.

Micro-cases

  • Simple instruction: "Close the door" → [ACTION:close][OBJ:door][AGENT:addressee]
  • Yes/No question: "Are you hungry?" → [INTERROGATIVE][STATE:hunger][AGENT:addressee]
  • Attachment ambiguity: "I saw the man with the telescope"

Implementation notes

  • Version a small Dhātu inventory (v0.1) in YAML for iteration.
  • Provide a validator with unit tests and toy sentences.
  • Grow coverage by domain (household objects, motion, social interactions).

Try it (mini harness)

  • Folder: experiments/dhatu/
  • List toy corpus: python experiments/dhatu/validator.py --list
  • Compute basic metrics: python experiments/dhatu/validator.py --metrics

References

  • Shannon, C. E. (1948). A Mathematical Theory of Communication. Bell System Technical Journal, 27(3), 379–423; 27(4), 623–656. DOI: 10.1002/j.1538-7305.1948.tb01338.x
  • Tishby, N., Pereira, F. C., & Bialek, W. (2000). The Information Bottleneck Method. arXiv:physics/0004057. DOI: 10.48550/arXiv.physics/0004057
  • Hinton, G. E., & Salakhutdinov, R. R. (2006). Reducing the Dimensionality of Data with Neural Networks. Science, 313(5786), 504–507. DOI: 10.1126/science.1127647
  • See also: ../research/references.md