Doctor of Philosophy, The Ohio State University, 2022, Public Health
Introduction: The Ohio Department of Health (ODH) collects and maintains records from disease intervention specialist (DIS) investigations for all syphilis cases reported to the state, including exposed partners who tested negative for syphilis. The records contain information in a structured form and in the form of free-text notes (unstructured). We sought to apply natural language processing (NLP) methods to 2019 Ohio DIS syphilis records, to (1) determine whether DIS notes contain novel characteristics, behaviors, or patterns that are not yet reported in the syphilis literature, and (2) to explore if NLP methods could be used to identify key topics in the unstructured notes.
Methods: In Aim 1, we described the records and assessed feasibility of using these data for NLP analyses. We explored two approaches to numerically represent the unstructured text: (1) TF-IDF (term frequency, inverse document frequency), which measures the importance of words based on how many times they appear, and (2) GloVe pretrained word embeddings, which assign numerical vectors to words to encode their semantic meaning. In Aim 2, we performed agglomerative clustering using the structured data and unstructured text (using TF-IDF weights), with cosine similarity as the distance metric, to explore patterns in the data. In Aim 3, we explored if machine learning models could identify key topics in the unstructured text. To do this, we identified 21 key topics in the notes fields potentially relevant for syphilis transmission and DIS investigations. We manually coded these records to create “gold standard” labels for each topic (0=topic not present, 1=topic present), then trained machine learning models to identify the topics. Specifically, we explored three statistical models (naive Bayes, support vector machine [SVM], and logistic regression) using TF-IDF, and one neural network model (long short-term memory [LSTM] model) using GloVe.
Results: The cluster analysis (n=1,996) yielded 7 (open full item for complete abstract)
Committee: Abigail Norris Turner (Advisor); Xia Ning (Committee Member); Abigail Shoben (Committee Member); David Kline (Committee Member); William Miller (Committee Member)
Subjects: Computer Science; Epidemiology; Public Health