Skip to Main Content
Frequently Asked Questions
Submit an ETD
Global Search Box
Need Help?
Keyword Search
Participating Institutions
Advanced Search
School Logo
Files
File List
Prashanth_MSThesis_May7.pdf (1.16 MB)
ETD Abstract Container
Abstract Header
An Analysis of GPT API for Wrangling Web Scraping Data
Author Info
Umapathy, Prashanth
ORCID® Identifier
http://orcid.org/0000-0002-8246-7522
Permalink:
http://rave.ohiolink.edu/etdc/view?acc_num=osu171353619431029
Abstract Details
Year and Degree
2024, Master of Science, Ohio State University, Computer Science and Engineering.
Abstract
In my thesis, I investigate three methods to extract product data such as Brand, flavor, strain, units, thc and cbd levels from online cannabis product stores, aiming to find the most effective approach. The process starts with using Python's regex capabilities, a method that's quite precise but needs a lot of specific rules to be set up. This technique involves pulling out product details from websites using patterns, but it can get complicated as each unique piece of data format requires a unique rule. After discussing regex, I introduce the use of the GPT LLM API, an artificial intelligence natural language processing tool that reads and understands product descriptions from raw product website data to extract information automatically. The goal here is to see if this AI can do the job as well or better than the manual methods or the rule-based regex approach. It's a way to potentially streamline the process, reducing the need for so many specific rules. Then, I describe how we also used a manual method, where people collect the data by hand. This serves as a standard to measure the other methods against, providing a benchmark for accuracy and completeness. A significant part of my thesis is dedicated to explaining how I clean and organize the data from these methods, which is crucial for making it usable and reliable. I detail the strengths and limitations of the GPT API in this context, clarifying what it can handle and where it might need help. Furthermore, I thoroughly document all the procedures and rules used in the study. This is important for transparency and allows others to replicate or build on this work. In the end, I present two datasets, one corrected and extracted by humans and the other through the GPT extraction method. As the results, I showcase the different levels of accuracy obtained through these comprehensive approaches. Through this thesis, I shed light on the future of data extraction in specialized fields, for a shift towards more intelligent and adaptable methods that can handle the complexities of diverse and unstructured data sources.
Committee
Jian Chen (Advisor)
Ce Shang (Committee Member)
Pages
53 p.
Subject Headings
Computer Science
Keywords
Data Scraping
;
Data Extraction
;
Large Language Models
;
Feature Extraction
Recommended Citations
Refworks
EndNote
RIS
Mendeley
Citations
Umapathy, P. (2024).
An Analysis of GPT API for Wrangling Web Scraping Data
[Master's thesis, Ohio State University]. OhioLINK Electronic Theses and Dissertations Center. http://rave.ohiolink.edu/etdc/view?acc_num=osu171353619431029
APA Style (7th edition)
Umapathy, Prashanth.
An Analysis of GPT API for Wrangling Web Scraping Data.
2024. Ohio State University, Master's thesis.
OhioLINK Electronic Theses and Dissertations Center
, http://rave.ohiolink.edu/etdc/view?acc_num=osu171353619431029.
MLA Style (8th edition)
Umapathy, Prashanth. "An Analysis of GPT API for Wrangling Web Scraping Data." Master's thesis, Ohio State University, 2024. http://rave.ohiolink.edu/etdc/view?acc_num=osu171353619431029
Chicago Manual of Style (17th edition)
Abstract Footer
Document number:
osu171353619431029
Download Count:
186
Copyright Info
© 2024, some rights reserved.
An Analysis of GPT API for Wrangling Web Scraping Data by Prashanth Umapathy is licensed under a Creative Commons Attribution-NonCommercial-NoDerivs 3.0 Unported License. Based on a work at etd.ohiolink.edu.
This open access ETD is published by The Ohio State University and OhioLINK.