Word embeddings for aggression identification

The First Shared task on Aggression Identification was organised in conjunction with the First Workshop on Trolling, Aggression and Cyberbullying. The idea of the shared task was fairly simple. Classify a text in one of the following three categories: Overtly Aggressive (OAG), Covertly Aggressive (CAG) and Non-aggressive (NAG). This means that the task is essentially a standard text categorisation task and an approach based on bag-of-words is a good baseline to start with (neither me, nor the task organisers provided a baseline based on bag-of-words, so I don’t know what is the accuracy of the method).

My approach for this task was to use word embeddings to calculate features which are used to train machine learning algorithms. My paper:

Constantin Orǎsan (2018) Aggressive Language Identification Using Word Embeddings and Sentiment Features. In Proceedings of the First Workshop on Trolling, Aggression and Cyberbullying (TRAC-2018), p. 113 – 119, Santa Fe, USA, August, 25, http://aclweb.org/anthology/W18-4414

as well as the GitHub repository which contains the code, provide more details about the approach employed. However, one thing I found interesting is the relative high performance of the method given how simple it is.

I tried the same method in the WASSA 2018 Implicit Emotion Shared Task (IEST), but its performance was very poor. I thought the task can be modelled very much the same way: given a bunch of words, predict a class. Not only that the method worked poorly, but in order to obtain better results, I had to consider only a window of three words on each side of the word (class) to predict. Probably this does not make much sense if you are not familiar with IEST. Due to lack of time and poor results, I did not prepare a paper describing the results. However, I hope to get back to this task and investigate word embeddings more.

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.