An Analysis of GPT API for Wrangling Web Scraping Data

Umapathy, Prashanth

Keyword Search

School Logo

Prashanth_MSThesis_May7.pdf (1.16 MB)

An Analysis of GPT API for Wrangling Web Scraping Data

Author Info

Umapathy, Prashanth

ORCID® Identifier

http://orcid.org/0000-0002-8246-7522

Permalink:

http://rave.ohiolink.edu/etdc/view?acc_num=osu171353619431029

Year and Degree

2024, Master of Science, Ohio State University, Computer Science and Engineering.

Abstract

In my thesis, I investigate three methods to extract product data such as Brand, flavor, strain, units, thc and cbd levels from online cannabis product stores, aiming to find the most effective approach. The process starts with using Python's regex capabilities, a method that's quite precise but needs a lot of specific rules to be set up. This technique involves pulling out product details from websites using patterns, but it can get complicated as each unique piece of data format requires a unique rule. After discussing regex, I introduce the use of the GPT LLM API, an artificial intelligence natural language processing tool that reads and understands product descriptions from raw product website data to extract information automatically. The goal here is to see if this AI can do the job as well or better than the manual methods or the rule-based regex approach. It's a way to potentially streamline the process, reducing the need for so many specific rules. Then, I describe how we also used a manual method, where people collect the data by hand. This serves as a standard to measure the other methods against, providing a benchmark for accuracy and completeness. A significant part of my thesis is dedicated to explaining how I clean and organize the data from these methods, which is crucial for making it usable and reliable. I detail the strengths and limitations of the GPT API in this context, clarifying what it can handle and where it might need help. Furthermore, I thoroughly document all the procedures and rules used in the study. This is important for transparency and allows others to replicate or build on this work. In the end, I present two datasets, one corrected and extracted by humans and the other through the GPT extraction method. As the results, I showcase the different levels of accuracy obtained through these comprehensive approaches. Through this thesis, I shed light on the future of data extraction in specialized fields, for a shift towards more intelligent and adaptable methods that can handle the complexities of diverse and unstructured data sources.

Committee

Jian Chen (Advisor)
Ce Shang (Committee Member)

Pages

53 p.

Subject Headings

Computer Science

Keywords

Data Scraping; Data Extraction; Large Language Models; Feature Extraction

Umapathy, P. (2024). An Analysis of GPT API for Wrangling Web Scraping Data [Master's thesis, Ohio State University]. OhioLINK Electronic Theses and Dissertations Center. http://rave.ohiolink.edu/etdc/view?acc_num=osu171353619431029
APA Style (7th edition)
Umapathy, Prashanth. An Analysis of GPT API for Wrangling Web Scraping Data. 2024. Ohio State University, Master's thesis. OhioLINK Electronic Theses and Dissertations Center, http://rave.ohiolink.edu/etdc/view?acc_num=osu171353619431029.
MLA Style (8th edition)
Umapathy, Prashanth. "An Analysis of GPT API for Wrangling Web Scraping Data." Master's thesis, Ohio State University, 2024. http://rave.ohiolink.edu/etdc/view?acc_num=osu171353619431029
Chicago Manual of Style (17th edition)

Document number:

osu171353619431029

Download Count:

186

Copyright Info

An Analysis of GPT API for Wrangling Web Scraping Data by Prashanth Umapathy is licensed under a Creative Commons Attribution-NonCommercial-NoDerivs 3.0 Unported License. Based on a work at etd.ohiolink.edu.
This open access ETD is published by The Ohio State University and OhioLINK.

Global Search Box

Files

File List

ETD Abstract Container

Abstract Header

An Analysis of GPT API for Wrangling Web Scraping Data

Abstract Details

Recommended Citations

Citations

Abstract Footer

Global Footer

Ohio Department of Higher Education

State Government Links

Education Links

Global Search Box

Files

File List

ETD Abstract Container

Abstract Header

An Analysis of GPT API for Wrangling Web Scraping Data

Abstract Details

Recommended CitationsRefworksEndNoteRISMendeley

Citations

Abstract Footer

Global Footer

Ohio Department of Higher Education

State Government Links

Education Links

Recommended Citations