The paper is devoted to the methods of automatic summarization, which use the representation of a text in the form of a graph. And contains an attempt at formal description of the text transformation in terms of the predicate calculus logic. The proposed method combines the use of a linguistic knowledge base, graph representation of texts and machine learning. The fragments of a text, such as words, sentences, paragraphs, are represented as graph nodes, and relations between nodes, for example, rhetorical relations, are denoted by edges. Automatic determination of rhetorical relations in the text allows you to set the location of the nucleus and satellite. To compile a brief annotation, it is necessary to transform the original text, based on the assumption that the nucleus contains the most important part of the statement. The relations between discursive markers in the text define a hierarchy that allows one to solve various problems of word processing in a natural language, including the task of automatically compiling a short abstract on a large volume of text. The summarization process created by the authors consists of six main steps: preprocessing, topic modeling, rhetorical analysis and transformation, weight evaluation, sentence selection, and smoothing. Topic modeling is used to discover key terms. First, unigram topiс models, that contain only one-word terms, are constructed. These models are further expanded by adding multiword terms. The most significant fragments of the source document are determined in the process of rhetorical analysis using discursive markers. Presentation of texts in the form of graphs helps to demonstrate the transformations with the text necessary to highlight important fragments. In assessing the importance of the text fragments are also included keywords, multiword and scientific terms, describing the scientific and technical texts. To store the marker information has created a linguistic knowledge base. The final step in the formation of the annotation is smoothing — a text conversion procedure that allows you to make the text of the abstract (annotation) received more coherent and consistent. The importance of sentences is determined using discursive markers and connectors. We used additive regularization for topic modeling (ARTM) to extract keywords and discover the topics. Our proposed BigARTM and Rake hybrid method for obtaining thematic models and the task of obtaining an abstract using RST markers, action and templates showed its effectiveness and efficiency in testing and in comparison with other methods as was shown in comparisons using the precision, recall and F- measure calculated in a way similar to [2, 10].
|Журнал||Journal of Theoretical and Applied Information Technology|
|Состояние||Опубликовано - 29 фев 2020|