People's emotions can be gleaned from their text using machine learning techniques to build models that exploit large self-labeled emotion data from social media. Further, the self-labeled emotion data can be effectively adapted to train emotion classifiers in different target domains where training data are sparse.
Emotions are both prevalent in and essential to most aspects of our lives. They influence our decision-making, affect our social relationships and shape our daily behavior. With the rapid growth of emotion-rich textual content, such as microblog posts, blog posts, and forum discussions, there is a growing need to develop algorithms and techniques for identifying people's emotions expressed in text. It has valuable implications for the studies of suicide prevention, employee productivity, well-being of people, customer relationship management, etc. However, emotion identification is quite challenging partly due to the following reasons: i) It is a multi-class classification problem that usually involves at least six basic emotions. Text describing an event or situation that causes the emotion can be devoid of explicit emotion-bearing words, thus the distinction between different emotions can be very subtle, which makes it difficult to glean emotions purely by keywords. ii) Manual annotation of emotion data by human experts is very labor-intensive and error-prone. iii) Existing labeled emotion datasets are relatively small, which fails to provide a comprehensive coverage of emotion-triggering events and situations.
This dissertation aims at understanding the emotion identification problem and developing general techniques to tackle the above challenges. First, to address the challenge of fine-grained emotion classification, we investigate a variety of lexical, syntactic, knowledge-based, context-based and class-specific features, and show how much these features contribute to the performance of the machine learning classifiers. We also propose a method that automatically extracts syntactic patterns to build a rule-based classifier to improve the accuracy of identifying minority emotions. Second, to deal with the challenge of manual annotation, we leverage emotion hashtags to harvest Twitter "big data" and collect millions of self-labeled emotion tweets, the labeling quality of which is further improved by filtering heuristics. We discover that the size of the training data plays an important role in emotion identification task as it provides a comprehensive coverage of different emotion-triggering events/situations. Further, the unigram and bigram features alone can achieve a performance that is competitive with the best performance of using a combination of ngram, knowledge-based and syntactic features. Third, to handle the paucity of the labeled emotion datasets in many domains, we seek to exploit the abundant self-labeled tweet collection to improve emotion identification in text from other domains, e.g., blog posts, fairy tales. We propose an effective data selection approach to iteratively select source data that are informative about the target domain, and use the selected data to enrich the target domain training data. Experimental results show that the proposed method outperforms the state-of-the-art domain adaptation techniques on datasets from four different domains including blog, experience, diary and fairy tales.
Finally, we apply the proposed research to analyze cursing, an emotion rich activity, on Twitter. We explore a set of questions that have been recognized as crucial for understanding cursing in offline communications by prior studies, including ubiquity, utility, contextual dependencies, and people factors.