Search Results (1 - 18 of 18 Results)

Sort By  
Sort Dir
 
Results per page  

Johnson, Eamon B.Methods in Text Mining for Diagnostic Radiology
Doctor of Philosophy, Case Western Reserve University, 2016, EECS - Computer and Information Sciences
Information extraction from clinical medical text is a challenge in computing to bring structure to the prose produced for communication in medical practice. In diagnostic radiology, prose reports are the primary means for communication of image interpretation to patients and other physicians, yet secondary use of the report requires either costly review by another radiologist or machine interpretation. In this work, we present mechanisms for improving machine interpretation of domain-specific text with large scale semantic analysis, using a corpus of 726,000 real-world radiology reports as a basis for experimentation. We examine the abstract conceptual problem of detection of incidental findings (uncertain or unexpected results) in imaging study reports. We demonstrate that classifiers incorporating semantic metrics can outperform F-measure of prior methods for follow-up classification and also outperform F-measure of incidental findings classification by physicians in-clinic (0.689 versus 0.648). Further, we propose two semantic metrics, focus and divergence, as calculated over the SNOMED-CT ontology graph, for summarization and projection of discrete report concepts into 2-dimensional space which enables both machine classification and physician interpretation of classifications. With understanding of the utility of semantic metrics for classification, we present methods for enhancing extraction of semantic information from clinical corpora. First, we construct a zero-knowledge method for imputation of semantic class for unlabeled terms through maximization of a confidence factor computed using pairwise co-occurrence statistics and rules limiting recall. Experiments with our method on corpora of reduced Mandelbrot information temperature produce accurate labeling of up to 25% of terms not labeled by prior methods. Second, we propose a method for context-sensitive quantification of relative concept salience and an algorithm capable of increasing both salience and diversity of concepts in document summaries in 28% of reports.

Committee:

Gultekin Ozsoyoglu (Committee Chair); Marc Buchner (Committee Member); Adam Perzynski (Committee Member); Andy Podgurski (Committee Member)

Subjects:

Computer Science

Keywords:

text mining; diagnostic radiology; information extraction; clinical text mining

Kayaalp, Naime F.Deciding Polarity of Opinions over Multi-Aspect Customer Reviews
Master of Science (MS), Ohio University, 2014, Industrial and Systems Engineering (Engineering and Technology)
The problem studied in this research is related to text mining. This research provides a solution for the multi-aspect segmentation problem of textual reviews and aims to fill the gap of the existing segmentation models in literature. In this research, in order to identify aspect opinion pairs in reviews, a review segmentation approach is proposed based on the conjunction and punctuation. The approach aims to segment the reviews in such a way that each of the segments represents a different feature or aspect of the reviewed product or service. An existing segmentation model, called a multi-aspect segmentation model, is adapted and implemented as well as the proposed segmentation model, named the improved multi-aspect segmentation model. Two systems achieved by two different segmentation approaches were evaluated in the task of aspect-opinion extraction from the English restaurant reviews. The proposed model achieved 8.2% improvement over the existing model when used for aspect-opinion identification.

Committee:

Gary Weckman (Advisor)

Subjects:

Computer Science; Industrial Engineering; Marketing

Keywords:

Text mining; polarity analysis; feature-opinion extraction; decision making

xinjian, qiCOMPUTATIONAL ANALYSIS, VISUALIZATION AND TEXT MINING OF METABOLIC NETWORKS
Doctor of Philosophy, Case Western Reserve University, 2013, EECS - Computer and Information Sciences
With the recent advances in experimental technologies, such as gas chromatography/mass spectrometry, the number of metabolites that can be measured in biofluids of individuals has markedly increased. Given a set of such measurements, a very common task encountered by biologists is to identify the metabolic mechanisms that lead to changes in the concentrations of given metabolites and interpret the metabolic consequences of the observed changes in terms of physiological problems, nutritional deficiencies, diseases. This thesis presents the steady-state metabolic network dynamics analysis (SMDA) approach in detail. Experimental evaluation of the SMDA tool against a mammalian metabolic network database is also presented. The query output space of the SMDA tool can be reduced via (i) larger number of observations exponentially reduce the output size, and (ii) exploratory search and browsing of the query output space is provided to allow users to search what they are looking for. SMDA is then applied to gene lethality testing. Compared with other methods that are used for gene lethality testing, the advantages of the SMDA algorithm are: (1) SMDA requires less input, and (2) does not make optimality assumptions. The algorithm has been tested on the genome scale reconstructed network of Trypanosoma cruzi and its gene lethality testing results taken as ground truth. Also, in this thesis, we study general framework of visualization tools as well as distinct features of each tool in the PathCase systems, namely PathCase-SB, PathCase-MAW editor, PathCase MAW, PathCase-SMDA, PathCase-RCMN, PathCase-Recon, and PathCase-MQL. Finally, this thesis proposes a number of metabolite/reaction identification techniques for Genome-Scale Reconstructed Metabolic Networks (GSRMN) (by matching metabolites/reactions to corresponding metabolites/reactions of a source model or data source). We employ a variety of computer science techniques that include approximate string matching, similarity score functions and filtering techniques, all enhanced by the underlying metabolic biochemistry-based knowledge. The proposed metabolite/reaction identification techniques are evaluated by an empirical study on four pairs of GSRMNs. Our results indicate that significant accuracy gains are made using the proposed metabolite/reaction identification techniques.

Committee:

Gültekin Özsoyoglu (Advisor); Andy Podgurski (Committee Member); M. Cenk Cavusoglu (Committee Member); Nicola Lai (Committee Member); Z. Meral Özsoyoglu (Committee Member)

Subjects:

Bioinformatics; Computer Science

Keywords:

METABOLIC NETWORKS;COMPUTATIONAL ANALYSIS;VISUALIZATION;TEXT MINING;GSRMN;Genome-Scale;Reconstructed

GUDIVADA, RANGA CHANDRADISCOVERY AND PRIORITIZATION OF BIOLOGICAL ENTITIES UNDERLYING COMPLEX DISORDERS BY PHENOME-GENOME NETWORK INTEGRATION
PhD, University of Cincinnati, 2007, Engineering : Biomedical Engineering
An important goal for biomedical research is to elucidate causal and modifier networks of human disease. While integrative functional genomics approaches have shown success in the identification of biological modules associated with normal and disease states, a critical bottleneck is representing knowledge capable of encompassing asserted or derivable causality mechanisms. Both single gene and more complex multifactorial diseases often exhibit several phenotypes and a variety of approaches suggest that phenotypic similarity between diseases can be a reflection of shared activities of common biological modules composed of interacting or functionally related genes. Thus, analyzing the overlaps and interrelationships of clinical manifestations of a series of related diseases may provide a window into the complex biological modules that lead to a disease phenotype. In order to evaluate our hypothesis, we are developing a systematic and formal approach to extract phenotypic information present in textual form within Online Mendelian Inheritance in Man (OMIM) and Syndrome DB databases to construct a disease - clinical phenotypic feature matrix to be used by various clustering procedures to find similarity between diseases. Our objective is to demonstrate relationships detectable across a range of disease concept types modeled in UMLS to analyze the detectable clinical overlaps of several Cardiovascular Syndromes (CVS) in OMIM in order to find the associations between phenotypic clusters and the functions of underlying genes and pathways. Most of the current biomedical knowledge is spread across different databases in different formats and mining these datasets leads to large and unmanageable results. Semantic Web principles and standards provide an ideal platform to integrate such heterogeneous information and could allow the detection of implicit relations and the formulation of interesting hypotheses. We implemented a page-ranking algorithm onto Semantic Web to prioritize biological entities for their relative contribution and relevance which can be combined with this clustering approach. In this way, disease-gene, disease-pathway or disease-process relationships could be prioritized by mining a phenome - genome framework that not only discovers but also determines the importance of the resources by making queries of higher order relationships of multi-dimensional data that reflect the feature complexity of diseases.

Committee:

Dr. Bruce Aronow (Advisor)

Keywords:

Semantic Web; RDF; OWL; SPARQL; Ontology; Biomedical Informatics; Bioinformatics; Integrative Bioinformatics; Text Mining; Phenome; Genome; Disease Modularity; Data Integration; Semantic Integration

Ortiz, AgustinReal Time Presentation
Master of Science, The Ohio State University, 2017, Computer Science and Engineering
This study describes and evaluates a prototype system that seeks to inject relevant components from documents at appropriate points of an ongoing conversation. The purpose of this study is to test and research whether a document or section of document is relevant to an audience, with respect to the topic being discussed, at a given moment in time. The evaluation criteria we study includes the relevancy of the selected document with respect to the context of the conversation (future work will include a study of the “naturalness” of the document injection point within the conversation, and the seamless quality of the user experience inside a presentation environment). A system named Real Time Presentation (RTP) has been prototyped as a web service to retrieve, collect, and index documents. It also provides the mechanism to retrieve and inject documents into a web-based user-interface. Essentially, RTP retrieves relevant unstructured documents, examines each document, and determines which sections of the document are relevant for presentation. A second component of this system is a framework for repetitive querying over live conversational data. The novelty of this approach is the combination of repetitive querying along with text mining algorithms to determine relevancy for document selection into a presentation setting. A product version of the RTP framework has been planned for integration into a platform named School2Biz, in which schools and business are brought together for their lecture events.

Committee:

Rajiv Ramnath (Advisor); Neelam Soundarajan (Committee Member)

Subjects:

Computer Science

Keywords:

presentation; inject; documents; querying; text mining; data;

Lucas, NanoshSoup at the Distinguished Table in Mexico City, 1830-1920
Master of Arts (MA), Bowling Green State University, 2017, Spanish/History (dual)
This thesis uses soup discourse as a vehicle to explore dimensions of class and hierarchies of taste in Mexican cookbooks and newspapers from 1830-1920. It contrasts soups with classic European roots, such as sopa de pan (bread soup), with New World soups, such as sopa de tortilla (tortilla soup) and chilaquiles (toasted tortillas in a soupy sauce made from chiles). I adopt a multi-disciplinary approach, combining quantitative methods in the digital humanities with qualitative techniques in history and literature. To produce this analysis, I draw from Pierre Bourdieu’s work on distinction and social capital, Max Weber’s ideas about modernization and rationalization, and Charles Tilly’s notions of categorical inequality. Results demonstrate that soup plays a part in a complex drama of inclusion and exclusion as people socially construct themselves in print and culinary practice. Elites attempted to define respectable soups by what ingredients they used, and how they prepared, served, and consumed soup. Yet, at the same time, certain soups seemed to defy hierarchical categorization, and that is where this story begins.

Committee:

Amílcar Challú (Committee Co-Chair); Francisco Cabanillas (Committee Co-Chair); Amy Robinson (Committee Member); Timothy Messer-Kruse (Committee Member)

Subjects:

History; Latin American History

Keywords:

gente decente; sopa de pan; chilaquiles; sopa de tortilla; soup; distinction; Mexico; taste; hierarchy; class; middle class; hegemony; cookbooks; bread; broth; caldo; food; cocina; newspapers; Mexican; text mining

Soni, SwapnilDomain-Specific Document Retrieval Framework for Near Real-time Social Health Data
Master of Science (MS), Wright State University, 2015, Computer Science
With the advent of web search and microblogging, the percentage of Online Health Information Seekers (OHIS) using these services to share and seek health information in real-time has increased exponentially. Recently, Twitter has emerged as one of the primary mediums for sharing and seeking of the latest information related to a variety of topics, including health information. Although Twitter is an excellent information source, the identification of useful information from the deluge of tweets is one of the major challenges. Twitter search is limited to keyword-based techniques to retrieve information for a given query and sometimes the results do not contain up-to-date (real-time) information. Moreover, Twitter does not utilize semantics to retrieve results. To address these challenges, we developed a system termed Social Health Signals, by leveraging rich domain knowledge to extract relevant and reliable health information from Twitter in near real-time. We have used semantics based techniques to 1) retrieve relevant and reliable health information shared on Twitter in real-time, 2) enable question answering, 3) to rank results based on relevancy, popularity and reliability, and 4) to enable efficient browsing of the results, we semantically group the search results into health categories In our approach, we have considered Twitter to search documents based on several unique features, including triple-pattern based mining, near real-time retrieval, and tweet contained URL based search. First, the triple-based pattern (subject, predicate, and object) mining technique extracts triple patterns from microblog messages related to chronic health conditions. The triple pattern is defined in a user given question (natural language). Second, in order to make the system near real-time, the search results are divided into intervals of six hours. Third, in addition to tweets, we use the content of the URLs (mentioned in the tweet) as the data source. Finally, the results are ranked according to relevancy and popularity such that at a particular time the most relevant information for the questions are displayed instead of basing results solely on temporal relevance. Our evaluation focuses on questions related to diabetes, such as “How to control diabetes?,” and compare the results with a Twitter search. To measure our results with Twitter, we have selected reliability, relevancy, and real-time features for the evaluation. We have conducted a blind survey to check the relevance of the results in which we selected three questions dealing with diabetes. To evaluate the reliable source, we compared a Google domain pagerank of our top 10 results with the Twitter's top 10 results. Also, for real-time we have compared timestamp of the Twitter search results with our system's search results.

Committee:

Amit Sheth, Ph.D. (Advisor); Krishnaprasad Thirunarayan, Ph.D. (Committee Member); Tanvi Banerjee, Ph.D. (Committee Member)

Subjects:

Computer Science

Keywords:

Twitter; Data mining; Triple pattern; Real-time; Health; Chronic disease; Social Media analysis; Text mining

Xiong, HuiCombining Subject Expert Experimental Data with Standard Data in Bayesian Mixture Modeling
Doctor of Philosophy, The Ohio State University, 2011, Industrial and Systems Engineering
Engineers face many quality-related datasets containing free-style text or images. For example, a database could include summaries of complaints filed by customers, or descriptions of the causes of rework or maintenance or of the associated actions taken, or a collection of quality inspection images of welded tubes. The goal of this dissertation is to enable engineers to input a database of free-style text or image data and then obtain a set of clusters or “topics” with intuitive definitions and information about the degree of commonality that together helps prioritize system improvement. The proposed methods generate Pareto charts of ranked clusters or topics with their interpretability improved by input from the analyst or method user. The combination of subject matter expert data with standard data is the novel feature of the methods considered. Prior to the methods proposed here, analysts applied Bayesian mixture models and had limited recourse if the cluster or topic definitions failed to be interpretable or are at odds with the knowledge of subject matter experts. The associated “Subject Matter Expert Refined Topic” (SMERT) model permits on-going knowledge elicitation and high-level human expert data integration to address the issues regarding: (1) unsupervised topic models often produce results to user, and (2) to provide a “Hierachical Analysis Designed Latency Experiment” (HANDLE) for human expert to interact with the model results. If grouping are missing key elements, so-called “boosting” these elements is possible. If certain members of a cluster are nonsensical or nonphysical, so-called “zapping” these nonsensical elements is possible. We also describe a fast Collapsed Gibbs Sampling (CGS) algorithm for SMERT method, which offers the capacity to efficiently SMERT model large datasets but which is associated with approximations in certain cases. We use three case studies to illustrate the proposed methods. The first relates to scrap text reports for a Chinese manufacturer of stone products. The second relates to laser welding of tube joints and images characterizing bead shape. The third case study relates to consumer reports text user reviews of the Toyota Camry. The user reviews cover 10 years and the widely publicized acceleration issue. In all cases, the SMERT models help provide interpretable groupings of records in a way that could facilitate data-driven prioritization of improvement actions.

Committee:

Theodore Allen, PhD (Advisor); Suvrajeet Sen, PhD (Committee Member); David Woods, PhD (Committee Member)

Subjects:

Computer Science; Engineering; Industrial Engineering; Information Technology

Keywords:

quality engineering; Bayesian mixture model; topic model; unstructured data; freestyle text; collapsed Gibbs sampling; text mining; data mining; human computer interaction; subject matter expert

Cakmak, AliMining Metabolic Networks and Biomedical Literature
Doctor of Philosophy, Case Western Reserve University, 2009, EECS - Computer and Information Sciences

With the advent of high-throughput experimental and genome sequencing technologies, the amounts of produced biological data and the related literature have increased dramatically. A significant portion of the produced biological data has revealed genotypic features of many model organisms. An outstanding problem presently is to map the characterized genotypic features of organisms to their phenotypic properties with the ultimate goal of making high-impact scientific discoveries in areas including diagnosing/curing diseases, engineering genomes, and inventing drugs. To this end, three major challenges concerning the management and analysis of the available data are: (i) high volume (e.g., thousands of genes, millions of publications), (ii) increasing diversity (e.g., genes, pathways, metabolic profiles), and (iii) high complexity (e.g., hierarchical organization of entities, graph structures, text/image data). Hence, efficient and effective biological data analysis and mining tools that can keep up with the increasing biological data production rate are highly desirable.

In this thesis, we study four biological data mining and analysis problems towards having a better understanding of the underlying biological phenomena. Our contributions address distinct keystones on the path from genotype (e.g., genes and their functionality annotations) to phenotype (e.g., metabolite concentration level changes, physiological conditions). More specifically, at the textual-knowledge level, we investigate automated functionality annotations of individual genomic entities from biomedical articles through text mining. Next, at the annotation (ontology) level, we study how functional annotations of individual genomic entities form templates in the context of their pathways with applications on pathway mining and categorization. Then, we generalize the problem of discovering frequent pathway functionality templates into a purely computer science problem, namely, that of mining taxonomy-superimposed graph databases, and solve the generalized problem. Finally, at the biological networks level, we study how pathways collaboratively may work together in a biological action scenario to produce the observed phenotypical changes in terms of metabolite concentration perturbations measured in biofluids. We show that, using only the increases and decreases of measured set of metabolites with respect to their "normal values", we can perform an effective causality analysis, and eliminate a majority of possible biological scenarios as causes for the observations through consistency analysis.

Committee:

Gultekin Ozsoyoglu, PhD (Advisor); Mark Adams, PhD (Committee Member); Mehmet Koyuturk, PhD (Committee Member); Jing Li, PhD (Committee Member); Meral Ozsoyoglu, PhD (Committee Member)

Subjects:

Bioinformatics; Computer Science

Keywords:

pathway inference; text mining; gene ontology; metabolic networks; metabolomics; graph mining; taxonomy-superimposed graph mining; data mining; pathway functionality template

Ghanem, Amer G.Identifying Patterns of Epistemic Organization through Network-Based Analysis of Text Corpora
PhD, University of Cincinnati, 2015, Engineering and Applied Science: Computer Science and Engineering
The growth of on-line textual content has exploded in recent years, creating truly massive text corpora. As the quantity of text available on- line increases, professionals from different industries such as marketing and politics are realizing the importance of extracting useful information and insights from this treasure trove of data. It is also clear, however, that doing so requires methods that go beyond those developed for classical data processing or even natural language processing. In particular, there is great need for efficient methods that can make sense of the semantic content of this data, and allows new knowledge to be inferred from it. The research in this dissertation describes a new method for identify- ing latent structures (topics) in texts through the application of community extraction techniques on associative networks of words. Since humans rep- resent knowledge in terms of associations, it is asserted that deriving top- ics from associative networks represents a more cognitively meaningful approach than using purely statistical patterns. The topic identification method proposed in this thesis is called Topic Extraction through Partitioning of Lexical Associative Networks (TExPLAN). It begins by constructing an associative network of words where the strength of their association indicates the frequency of their co-occurrence in documents. Once the word network is constructed, the algorithm proceeds in two stages. In the first stage, a partitioning of the word network takes place using a community extraction method to extract disjoint seed topics. The second stage of TExPLAN uses the connectivity of words across the boundaries of seed topics to assign a relevance measure to each word in each topic, thus generating a set of topics where each one covers all the words in the vocabulary, as is the case with LDA. The topics extracted by TExPLAN are used to define an epistemic metric space in which epistemic entities such as words, texts, documents, collections of documents, etc. can be embedded and compared. Once the dimensions are defined, the entities are visualized in two-dimensional space using multidimensional scaling. Because of its generality, the different types of entities can be analyzed jointly in the epistemic space. For this part of the thesis, we demonstrate the capabilities of the approach by applying it to the DBLP dataset, identifying similar conferences based on their locations in the epistemic space and deriving areas of interest associated with each conference. We are also able to analyze the epistemic diversity of conferences and determine which ones tend to attract more diverse authors and publications. Another part of the analysis focuses on authors and their participation in conferences. We define prominent status and answer questions about authors that have this status. We also look at the different ways an author can become prominent, and tie that to their epistemic diversity. Finally, we look at prominent authors who tend to publish documents that are relatively far from the mainstream of the conference in which they were published, and identify authors who may potentially become prominent in the future.

Committee:

Ali Minai, Ph.D. (Committee Chair); Raj Bhatnagar, Ph.D. (Committee Member); Karen Davis, Ph.D. (Committee Member); Carla Purdy, Ph.D. (Committee Member); James Uber, Ph.D. (Committee Member)

Subjects:

Computer Science

Keywords:

Data Mining;Text Mining;Topic Extraction;Semantic Analysis;Community Extraction;Semantic Spaces

Rogers, Benjamin CharlesUsing Genetic Algorithms for Feature Set Selection in Text Mining
Master of Science, Miami University, 2014, Computer Science & Software Engineering
The rationale behind design decisions are often recorded in different project documentation. One way to extract this rationale is by using text mining. Text mining involves data mining over natural language documents. The performance of a text mining system depends on many factors, including the feature sets used. Exhaustive searching for optimal combinations of feature sets is rarely feasible, often leading researchers to make guesses as to which combinations to use. A genetic algorithm is used to find optimal combinations of feature sets for binary rationale, the argumentation subset, the arguments-all subset, decisions, and alternatives. The genetic algorithm uses GATE, WEKA, and a pipeline that allows the automatic passing of information from one to the other. This pipeline is also useable in other text mining contexts. The genetic algorithm produced medium sized feature sets which tended to prefer unigrams and bigrams over 4-grams and 5-grams when compared to random selection.

Committee:

Janet Burge, PhD (Advisor); Dhananjai Rao, PhD (Committee Member); Michael Zmuda, PhD (Committee Member)

Subjects:

Artificial Intelligence; Computer Engineering; Computer Science; Information Science

Keywords:

genetic algorithms; feature set selection; text mining; design rationale; GATE; WEKA; pipeline

Li, YanjunHigh Performance Text Document Clustering
Doctor of Philosophy (PhD), Wright State University, 2007, Computer Science and Engineering PhD

Data mining, also known as knowledge discovery in database (KDD), is the process to discover interesting unknown knowledge from a large amount of data. Text mining is to apply data mining techniques to extract interesting and nontrivial information and knowledge from unstructured text. Text clustering is one of important techniques of text mining, which is the unsupervised classification of similar documents into different groups.

This research focuses on improving the performance of text clustering. We investigated the text clustering algorithms in four aspects: document representation, documents closeness measurement, high dimension reduction and parallelization. We propose a group of high performance text clustering algorithms, which target the unique characteristics of unstructured text database.

First, two new text clustering algorithms are proposed. Unlike the vector space model, which treats document as a bag of words, we use a document representation which keeps the sequential relationship between words in the documents. In these two algorithms, the dimension of the database is reduced by considering the frequent word (meaning) sequences, and the closeness of two documents is measured based on the sharing of frequent word (meaning) sequences.

Second, a text clustering algorithm with feature selection is proposed. This algorithm gradually reduces the high dimension of database by performing feature selection during the clustering. The new feature selection method applied is based on the well-known chi-square statistic and a new statistical data which can measure the positive and negative term-category dependence.

Third, a group of new text clustering algorithms is developed based on the k-means algorithm. Instead of using the cosine function, a new function involving global information is proposed to measure the closeness between two documents. This new function utilizes the neighbor matrix introduced in [Guha:2000]. A new method for selecting initial centroids and a new heuristic function for selecting a cluster to split are adopted in the proposed algorithms.

Last, a new parallel algorithm for bisecting k-means is proposed for the message-passing multiprocessor systems. This new algorithm, named PBKP, fully utilizes the data-parallelism of the bisecting k-means algorithm, and adopts a prediction step to balance the workloads of multiple processors to achieve a high speedup.

Comprehensive performance studies were conducted on all the proposed algorithms. In order to evaluate the performance of these algorithms, we compared them with existing text clustering algorithms, such as k-means, bisecting k-means [Steinbach:2000] and FIHC [Fung:2003]. The experimental results show that our clustering algorithms are scalable and have much better clustering accuracy than existing algorithms. For the parallel PBKP algorithm, we tested it on a 9-node Linux cluster system and analyzed its performance. The experimental results suggest that the speedup of PBKP is linear with the number of processors and data points. Moreover, PBKP scales up better than the parallel k-means with respect to the desired number of clusters.

Committee:

Soon Chung (Advisor)

Keywords:

Document Clustering; Text Mining; K-means; Bisecting K-means; Algorithm; Performance Analysis.

Ramakrishnan, CarticExtracting, Representing and Mining Semantic Metadata from Text: Facilitating Knowledge Discovery in Biomedicine
Doctor of Philosophy (PhD), Wright State University, 2008, Computer Science and Engineering PhD
The information access paradigm offered by most contemporary text information systems is a search-and-sift paradigm where users have to manually glean and aggregate relevant information from the large number of documents that are typically returned in response to keyword queries. Expecting the users to glean and aggregate information has lead to several inadequacies in these information systems. Owing to the size of many text databases, search-and-sift is a very tedious often requiring repeated keyword searches refining or generalizing queries terms. A more serious limitation arises from the lack of automated mechanisms to aggregate content across different documents to discover new knowledge. This dissertation focuses on processing text to assign semantic interpretations to its content (extracting Semantic metadata) and the design of algorithms and heuristics to utilize the extracted semantic metadata to support knowledge discovery operations over text content. Contributions in extracting semantic metadata in this dissertation cover the extraction of compound entities and complex relationships connecting entities. Extraction results are represented using a standard Semantic Web representation language (RDF) and are manually evaluated for accuracy. Knowledge discovery algorithms presented herein operate on RDF data. To further improve access mechanisms to text content, applications supporting semantic browsing and semantic search of text are presented.

Committee:

Amit Sheth, PhD (Advisor); Michael Raymer, PhD (Committee Member); Shaojun Wang, PhD (Committee Member); Guozhu Dong, PhD (Committee Member); Thaddeaus Tarpey, PhD (Committee Member); Vasant Honavar, PhD (Committee Member)

Subjects:

Computer Science

Keywords:

Semantic Web; Text Mining; Knowledge Discovery from Text; Information Extraction

Ogbonna, Antoine I.The Psychology of a Web Search Engine
Master of Computing and Information Systems, Youngstown State University, 2011, Department of Computer Science and Information Systems
With the complexity of the World Wide Web, including the difficulty to retrieve information, we attempt to construct Yoool, a simple yet efficient hyper-textual web search engine prototype that utilizes specialized algorithms, such as WordLink, WordMap, and A-Rank to minimize hyperlink manipulation while rendering objective, relevant, and satisfying results to the user.

Committee:

Alina Lazar, PhD (Advisor); John R. Sullins, PhD (Committee Member); Stephen P. Klein, PhD (Committee Member)

Subjects:

Computer Science; Mathematics

Keywords:

Search; Yoool; Search engine; Text mining

Xu, ZheA Sentiment Analysis Model Integrating Multiple Algorithms and Diverse Features
Master of Science, The Ohio State University, 2010, Computer Science and Engineering
In this thesis, we propose a model for integrating multiple sentiment analysis algorithms that each cover separate features, and show that it can do better than single algorithms that deal with multiple features. The key idea behind this integration model is the selective use of the right algorithm for the right case. We propose two measures to estimate the effectiveness of an algorithm, and, based on these measures, a two-step process to construct the model based on the understanding of contextual properties of algorithms. Our experiments show that our model outperforms existing baselines.

Committee:

Rajiv Ramnath (Advisor); Belkin Mikhail (Committee Member); Fang Hui (Committee Member)

Subjects:

Computer Science

Keywords:

text mining; sentiment analysis; measure; model

Patchala, JagadeeshData Mining Algorithms for Discovering Patterns in Text Collections
PhD, University of Cincinnati, 2016, Engineering and Applied Science: Computer Science and Engineering
Discovering meta-information from collections of text documents is becoming an important research area due to increasing demands for automated analysis of large text collections. This analysis process usually involves first structuring the unstructured text data and then deriving useful patterns from this structured text data. This process involves concepts taken from multiple fields such as data mining, artificial intelligence, statistics, databases, and linguistics. There are several focus areas within this research domain, each pursuing somewhat different objectives. Some of these focus areas include: information retrieval, document clustering, document classification, web mining, information extraction, natural language processing, and concept extraction. In this dissertation we have developed several novel methodologies to discover information embedded in text document collections and have presented solutions to some specific pattern discovery tasks.

Committee:

Raj Bhatnagar, Ph.D. (Committee Chair); Nan Niu, Ph.D. (Committee Member); Yizong Cheng, Ph.D. (Committee Member); Anil Jegga, M.Res. (Committee Member); Ali Minai, Ph.D. (Committee Member); Marepalli Rao, Ph.D. (Committee Member)

Subjects:

Computer Science

Keywords:

Authorship Analysis;Biclustering;3-clusters;Drug repurposing;Text mining;Data mining

Subramanian, NanditaAnalysis of Rank Distance for Malware Classification
MS, University of Cincinnati, 2016, Engineering and Applied Science: Computer Science
Malicious Cyber Adversaries may compromise the security of a system by denying access to legitimate users. This is often coupled with immeasurable loss of confidential data, which leads to hefty losses in both financial and trustworthiness aspects of a corporation. Malware exploits key vulnerabilities in applications presenting problems such as identity theft, unapproved software installations, etc. Abundance in malware detection and removal techniques in the ever evolving field of computers, presently exhibit a lower level of efficiency in detecting malicious softwares. Techniques available currently enable detection of softwares that are embedded with known signatures. No doubt these methods are efficient. However, most malware writers, aware of signature-based detection methods are working towards bypassing them. Machine learning based systems for malware classification and detection have been tested and proved to be more efficient than standard signature-based systems. A vital reason and justification providing a strong foothold for using machine learning techniques is that even unseen malware can be detected, thus eliminating malware detection failures and providing very high success rates. Our method uses efficient machine learning techniques for classification and detection of portable executable (PE) files of various malware classes commonly found in computers running Windows operating systems. For malicious files, computation of the distance between two files should yield an indication of their similarity. Using this as a basis, this thesis analyses the different approaches which can be employed for classifying malicious files using a method known as rank distance. This distance measure has been combined with a feature extraction method known as mutual information which analyses the opcodes n-gram sequences extracted from the PE files and segregates the most relevant opcodes from these. The most relevant opcodes, thus obtained, are used as features to identify which class a given file belongs to. An opcode relevance profile generated based on mutual information and the unclassified file are compared and assigned the respective rank distances for every class. Using these ranks, a distance between the two files is obtained. The class which has the least distance to the file is concluded to be the class of the file under scrutiny.

Committee:

Anca Ralescu, Ph.D. (Committee Chair); Chia Han, Ph.D. (Committee Member); Dan Ralescu, Ph.D. (Committee Member)

Subjects:

Computer Science

Keywords:

Rank Distance;Malware Classification;Mutual Information;Text Mining;Similarity Measures;Windows Malware

Kumar, VivekComputational Prediction of Protein-Protein Interactions on the Proteomic Scale Using Bayesian Ensemble of Multiple Feature Databases
Doctor of Philosophy, University of Akron, 2011, Biomedical Engineering

In the post-genomic world, one of the most important and challenging problems is to understand protein-protein interactions (PPIs) on a large scale. They are integral to the underlying mechanisms of most of the fundamental cellular processes. A number of experimental methods such as protein affinity chromatography, affinity blotting, and immunoprecipitation have traditionally helped in detecting PPIs on a small scale. Recently, high-throughput methods have made available an increasing amount of PPI data. However, this data contains a significant amount of erroneous information in the form of false positives and false negatives and shows little overlap among PPIs pooled from different methods, thus severely limiting their reliability. Because of such limitations, computational predictions are emerging to narrow down the set of putative PPIs.

In this dissertation, a novel computational PPI predictor was devised to predict PPIs with high accuracy. The PPI predictor integrates a number of proteomic features derived from biological databases. The features chosen for the purpose of this research were gene expression, gene ontology, MIPS functions, sequence patterns such as motifs and domains, and protein essentiality. While these features have little or no correlation with each other, they share some degree of relationship with the ability of proteins to interact with each other. Therefore, novel feature specific approaches were devised to characterize that relationship. Text mining and network topology based approaches were also studied. Gold Standard data comprising of high confidence PPIs and non-PPIs was used as evidence of interaction or lack thereof.

The predictive power of the individual features was integrated using Bayesian methods. The average accuracy, based on 10-fold cross-validation, was found to be 0.9396. Since all the features are computed on the proteomic scale, the Bayesian integration yields likelihood values for all possible combinations of proteins in the proteome. This has the added benefit of making it possible to enlist putative PPIs in a decreasing order of confidence measure in the form of likelihood values.

Integration of novel PPIs with other relevant biological information using Semantic Web representation was examined to better understand the underlying mechanism of diseases and novel target identification for drug discovery.

Committee:

Dr. Dale H. Mugler (Advisor); Dr. Daniel B. Sheffer (Committee Member); Dr. George C. Giakos (Committee Member); Dr. Amy Milsted (Committee Member); Dr. Daniel L. Ely (Committee Member)

Subjects:

Bioinformatics; Biomedical Engineering; Biostatistics; Computer Science; Molecular Biology

Keywords:

protein-protein interactions; bayesian ensemble; proteome; databases; ontology; text mining; drug discovery; graphs; networks; semantic web; gene expression