Automated detection of non-relevant posts on the russian imageboard “2ch”: Importance of the choice of word representations

Amir Bakarov, Olga Gureenkova

Результат исследования: Публикации в книгах, отчётах, сборниках, трудах конференцийстатья в сборнике материалов конференциинаучнаярецензирование

1 Цитирования (Scopus)

Аннотация

This study considers the problem of automated detection of non-relevant posts on Web forums and discusses the approach of resolving this problem by approximation it with the task of detection of semantic relatedness between the given post and the opening post of the forum discussion thread. The approximated task could be resolved through learning the supervised classifier with a composed word embeddings of two posts. Considering that the success in this task could be quite sensitive to the choice of word representations, we propose a comparison of the performance of different word embedding models. We train 7 models (Word2Vec, Glove, Word2Vec-f, Wang2Vec, AdaGram, FastText, Swivel), evaluate embeddings produced by them on dataset of human judgements and compare their performance on the task of non-relevant posts detection. To make the comparison, we propose a dataset of semantic relatedness with posts from one of the most popular Russian Web forums, imageboard “2ch”, which has challenging lexical and grammatical features.

Язык оригиналаанглийский
Название основной публикацииAnalysis of Images, Social Networks and Texts - 6th International Conference, AIST 2017, Revised Selected Papers
РедакторыWMP VanDerAalst, DI Ignatov, M Khachay, SO Kuznetsov, Lempitsky, IA Lomazova, N Loukachevitch, A Napoli, A Panchenko, PM Pardalos, AV Savchenko, S Wasserman
ИздательSpringer-Verlag GmbH and Co. KG
Страницы16-21
Число страниц6
ISBN (печатное издание)9783319730127
DOI
СостояниеОпубликовано - 1 янв 2018
Событие6th International Conference on Analysis of Images, Social Networks and Texts, AIST 2017 - Moscow, Российская Федерация
Продолжительность: 27 июл 201729 июл 2017

Серия публикаций

НазваниеLecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
Том10716 LNCS
ISSN (печатное издание)0302-9743
ISSN (электронное издание)1611-3349

Конференция

Конференция6th International Conference on Analysis of Images, Social Networks and Texts, AIST 2017
СтранаРоссийская Федерация
ГородMoscow
Период27.07.201729.07.2017

Fingerprint Подробные сведения о темах исследования «Automated detection of non-relevant posts on the russian imageboard “2ch”: Importance of the choice of word representations». Вместе они формируют уникальный семантический отпечаток (fingerprint).

Цитировать