Design and Development of Pipeline of Preprocessing Tools for Kazakh Language Texts

Madina Mansurova, Vladimir B. Barakhnin, Gulmira Madiyeva, Nurgali Kadyrbek, Bekzhan Dossanov

Результат исследования: Публикации в книгах, отчётах, сборниках, трудах конференцийстатья в сборнике материалов конференциинаучнаярецензирование

Аннотация

Nowadays, the Kazakh language belongs to the category of less-resourced languages, as there is a small number of resources developed and accessible to a wide range of users, such as text corpora, electronic dictionaries, morphological analyzers, thesauri, which allow to analyze text documents. The aim of this work is the design and development of pipeline of preprocessing tools for media-corpus of the Kazakh language. Media-corpus is hosted by al-Farabi Kazakh National University and serves linguists as an empirical basis for research in the contemporary written Kazakh language. The development of pipeline of preprocessing tools for media-corpus, the lexical and grammatical features of the Kazakh language were analyzed, on the basis of which the composition of the fundamental rules for changing the words (inflection) of the Kazakh language was determined. In the process of research, the tools for generation and lemmatization of the word forms of the Kazakh language were created. The proposed tools can be applied at the stage of morphological analysis in the systems of automatic analysis of the texts, in the creation of thesauruses and ontologies. For the case of the presence of homonymy, the template method was used, which allow to reduce the level of homonymy.

Язык оригиналаанглийский
Название основной публикацииHuman Language Technology. Challenges for Computer Science and Linguistics - 9th Language and Technology Conference, LTC 2019, Revised Selected Papers
РедакторыZygmunt Vetulani, Patrick Paroubek, Marek Kubis
ИздательSpringer Science and Business Media Deutschland GmbH
Страницы129-142
Число страниц14
ISBN (печатное издание)9783031053276
DOI
СостояниеОпубликовано - 2022
Событие9th Language and Technology Conference, LTC 2019 - Poznań, Польша
Продолжительность: 17 мая 201919 мая 2019

Серия публикаций

НазваниеLecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
Том13212 LNAI
ISSN (печатное издание)0302-9743
ISSN (электронное издание)1611-3349

Конференция

Конференция9th Language and Technology Conference, LTC 2019
Страна/TерриторияПольша
ГородPoznań
Период17.05.201919.05.2019

Предметные области OECD FOS+WOS

  • 1.02 КОМПЬЮТЕРНЫЕ И ИНФОРМАЦИОННЫЕ НАУКИ
  • 1.01 МАТЕМАТИКА

Fingerprint

Подробные сведения о темах исследования «Design and Development of Pipeline of Preprocessing Tools for Kazakh Language Texts». Вместе они формируют уникальный семантический отпечаток (fingerprint).

Цитировать