Classification by Compression: Application of Information-Theory Methods for the Identification of Themes of Scientific Texts

I. V. Selivanova, B. Y. A. Ryabko, A. E. Guskov

Research output: Contribution to journalArticlepeer-review

Abstract

A method for automatic classification of scientific texts based on data compression is proposed. The method is implemented and investigated based on the data from an archive of scientific texts (arXiv.org) and in the CyberLeninka scientific electronic library (CyberLeninka.ru). Experiments showed that the method correctly identified the themes of scientific texts with a probability of 75-95%; its accuracy depends on the quality of the original data.

Original languageEnglish
Pages (from-to)120-126
Number of pages7
JournalAutomatic documentation and mathematical linguistics
Volume51
Issue number3
DOIs
Publication statusPublished - 1 Jun 2017

Keywords

  • classification
  • thematic classification of texts
  • information theory
  • text compression
  • arXiv.org
  • CyberLeninka

OECD FOS+WOS

  • 1.02 COMPUTER AND INFORMATION SCIENCES

Cite this