Web 2.0 and social media enable people to create, share and discover information instantly anywhere, anytime. A great amount of this information is subjective information -- the information about people's subjective experiences, ranging from feelings of what is happening in our daily lives to opinions on a wide variety of topics. Subjective information is useful to individuals, businesses, and government agencies to support decision making in areas such as product purchase, marketing strategy, and policy making. However, much useful subjective information is buried in ever-growing user generated data on social media platforms, it is still difficult to extract high quality subjective information and make full use of it with current technologies.
Current subjectivity and sentiment analysis research has largely focused on classifying the text polarity -- whether the expressed opinion regarding a specific topic in a given text is positive, negative, or neutral. This narrow definition does not take into account the other types of subjective information such as emotion, intent, and preference, which may prevent their exploitation from reaching their full potential. This dissertation extends the definition and introduces a unified framework for mining and analyzing diverse types of subjective information. We have identified four components of a subjective experience: an individual who holds it, a target that elicits it (e.g., a movie, or an event), a set of expressions that describe it (e.g., "excellent", "exciting"), and a classification or assessment that characterize it (e.g., positive vs. negative). Accordingly, this dissertation makes contributions in developing novel and general techniques for the tasks of identifying and extracting these components.
We first explore the task of extracting sentiment expressions from social media posts. We propose an optimization-based approach that extracts a diverse set of sentiment-bearing expressions, including formal and slang words/phrases, for a given target from an unlabeled corpus. Instead of associating the overall sentiment with a given text, this method assesses the more fine-grained target-dependent polarity of each sentiment expression. Unlike pattern-based approaches which often fail to capture the diversity of sentiment expressions due to the informal nature of language usage and writing style in social media posts, the proposed approach is capable of identifying sentiment phrases of different lengths and slang expressions including abbreviations and spelling variations. Unlike supervised approaches which require data annotation when applied to a new domain, the proposed approach is unsupervised and thus is highly portable to new domains.
We then look into the task of finding opinion targets in product reviews, where the product features (product attributes and components) are usually the targets of opinions. We propose a clustering approach that identifies product features and groups them into aspect categories. Unlike many existing approaches that first extract features and then group them into categories, the proposed approach identifies features and clusters them into aspects simultaneously. In addition, prior work on feature extraction tends to require seed terms and focuses on identifying explicit features, while the proposed approach extracts both explicit and implicit features and does not require seed terms.
Finally, we study the classification and assessment of several types of subjective information (e.g., sentiment, political preference, subjective well-being) in two specific application scenarios. One application is to predict election results based on analyzing the sentiments of social media users towards election candidates. Observing that different political preference and tweeting behavior of users may have significant effect on predicting election results. We propose methods to group users based on their political preference and participation in the discussion, and assess their sentiments towards the candidates to predict the results. We examine the predictive power of different user groups in predicting the results of 2012 U.S. Republican Presidential Primaries. The other application is to understand the relationship between religiosity and subjective well-being (or happiness). We analyze the tweets and networks of more than 250k U.S. Twitter users who self-declared their beliefs. We build classifiers to classify believers of different religions using the self-declared data. In order to understand the effect of religiosity on happiness, we examine the pleasant/unpleasant emotional expressions in users' tweets to estimate their subjective well-being, and investigate the variations in happiness among religious groups.
This dissertation focuses on developing methods that require minimal human supervision or labeling effort (e.g., unsupervised methods, or supervised methods using self-labeled data), which can be easily applied to new domains for many applications. The effectiveness of these methods has been demonstrated through the evaluation on real world datasets of user generated content from different domains.