Skip to Main Content
 

Global Search Box

 
 
 
 

ETD Abstract Container

Abstract Header

An Analysis of GPT API for Wrangling Web Scraping Data

Umapathy, Prashanth

Abstract Details

2024, Master of Science, Ohio State University, Computer Science and Engineering.
In my thesis, I investigate three methods to extract product data such as Brand, flavor, strain, units, thc and cbd levels from online cannabis product stores, aiming to find the most effective approach. The process starts with using Python's regex capabilities, a method that's quite precise but needs a lot of specific rules to be set up. This technique involves pulling out product details from websites using patterns, but it can get complicated as each unique piece of data format requires a unique rule. After discussing regex, I introduce the use of the GPT LLM API, an artificial intelligence natural language processing tool that reads and understands product descriptions from raw product website data to extract information automatically. The goal here is to see if this AI can do the job as well or better than the manual methods or the rule-based regex approach. It's a way to potentially streamline the process, reducing the need for so many specific rules. Then, I describe how we also used a manual method, where people collect the data by hand. This serves as a standard to measure the other methods against, providing a benchmark for accuracy and completeness. A significant part of my thesis is dedicated to explaining how I clean and organize the data from these methods, which is crucial for making it usable and reliable. I detail the strengths and limitations of the GPT API in this context, clarifying what it can handle and where it might need help. Furthermore, I thoroughly document all the procedures and rules used in the study. This is important for transparency and allows others to replicate or build on this work. In the end, I present two datasets, one corrected and extracted by humans and the other through the GPT extraction method. As the results, I showcase the different levels of accuracy obtained through these comprehensive approaches. Through this thesis, I shed light on the future of data extraction in specialized fields, for a shift towards more intelligent and adaptable methods that can handle the complexities of diverse and unstructured data sources.
Jian Chen (Advisor)
Ce Shang (Committee Member)
53 p.

Recommended Citations

Citations

  • Umapathy, P. (2024). An Analysis of GPT API for Wrangling Web Scraping Data [Master's thesis, Ohio State University]. OhioLINK Electronic Theses and Dissertations Center. http://rave.ohiolink.edu/etdc/view?acc_num=osu171353619431029

    APA Style (7th edition)

  • Umapathy, Prashanth. An Analysis of GPT API for Wrangling Web Scraping Data. 2024. Ohio State University, Master's thesis. OhioLINK Electronic Theses and Dissertations Center, http://rave.ohiolink.edu/etdc/view?acc_num=osu171353619431029.

    MLA Style (8th edition)

  • Umapathy, Prashanth. "An Analysis of GPT API for Wrangling Web Scraping Data." Master's thesis, Ohio State University, 2024. http://rave.ohiolink.edu/etdc/view?acc_num=osu171353619431029

    Chicago Manual of Style (17th edition)