SEEDEEP: A System for Exploring and Querying Deep Web Data Sources

Wang, Fan

Keyword Search

School Logo

osu1279758181.pdf (3.81 MB)

SEEDEEP: A System for Exploring and Querying Deep Web Data Sources

Author Info

Wang, Fan

Permalink:

http://rave.ohiolink.edu/etdc/view?acc_num=osu1279758181

Year and Degree

2010, Doctor of Philosophy, Ohio State University, Computer Science and Engineering.

Abstract

A popular trend in data dissemination involves online data sources that are hidden behind query forms, thus forming what is referred to as the deep web. Deep web data is stored in hidden databases. Hidden data can only be acessed after a user submits a query by filling an online form. Currently, hundreds of large, complex and in many cases, related and/or overlapping, deep web data sources have become available. The number of such data sources is still increasing rapidly every year.

The emergence of the deep web is posing many new challenges in data integration and query answering. First, the metadata of the deep web and the data records stored in deep web databases are hidden from the data integration system. Second, Multiple deep web data sources may have data redundancy. Furthermore, similar data sources may provide data with different data quality and even conflicting data. Therefore, data source selection is of great importance for a data integration system. Third, deep web data sources in a domain often have inter-dependencies, i.e., the output from one data source may be the input of another data source. Thus, answering a query over a set of deep web data sources often involving accessing a sequence of inter-dependent data sources in an intelligent order. Fourth, the common way of accessing data in deep web data sources is through standardized input interfaces. These interfaces, on one hand, provide a very simple query mechanism. On the other hand, these interfaces significantly constrain the types of queries that could be automatically executed. Finally, all deep web data sources are network based. Both the data source servers and network links are vulnerable to congestion and failures. Therefore, handling with fault tolerance issue is also necessary for a data integration system.

In our work, we propose SEEDEEP, an automatic system for exploring and querying deep web data sources. The SEEDEEP system is able to integrate deep web data sources in a particular domain and provide search functionality on structured SQL queries, online aggregation queries and low selectivity queries for domain users. Currently, the SEEDEEP system is composed of five modules which include schema mining, query planning, approximate query answering, query optimization and fault tolerance. The schema mining module can automatically mine the metadata of deep web data sources. The query planning module takes a structured query as input and generate a query plan over the set of integrated deep web data sources to answer the query based on a cost model. Currently, the query planning module is able to handle with Selection-Projection-Join queries, aggregation queries, and nested queries. The approximate query answering module is able to find approximate answers for online aggregation and low selectivity queries using sampling in an effective and efficient manner. The query optimization module explores the similarity between queries, and accelerates the execution of a query by reusing previous query plans and cached query data. Finally, the fault tolerance module deals with data source unavailability and inaccessibility issues.

Committee

Gagan Agrawal, PhD (Advisor)
Feng Qin, PhD (Committee Member)
P Sadayappan, PhD (Committee Member)

Pages

273 p.

Subject Headings

Computer Science

Keywords

Deep Web; Data Integration; Query Planning; Query Optimization; Data Management; Web Data

Wang, F. (2010). SEEDEEP: A System for Exploring and Querying Deep Web Data Sources [Doctoral dissertation, Ohio State University]. OhioLINK Electronic Theses and Dissertations Center. http://rave.ohiolink.edu/etdc/view?acc_num=osu1279758181
APA Style (7th edition)
Wang, Fan. SEEDEEP: A System for Exploring and Querying Deep Web Data Sources. 2010. Ohio State University, Doctoral dissertation. OhioLINK Electronic Theses and Dissertations Center, http://rave.ohiolink.edu/etdc/view?acc_num=osu1279758181.
MLA Style (8th edition)
Wang, Fan. "SEEDEEP: A System for Exploring and Querying Deep Web Data Sources." Doctoral dissertation, Ohio State University, 2010. http://rave.ohiolink.edu/etdc/view?acc_num=osu1279758181
Chicago Manual of Style (17th edition)

Document number:

osu1279758181

Download Count:

1,255

Copyright Info

Global Search Box

Files

File List

ETD Abstract Container

Abstract Header

SEEDEEP: A System for Exploring and Querying Deep Web Data Sources

Abstract Details

Recommended Citations

Citations

Abstract Footer

Global Footer

Ohio Department of Higher Education

State Government Links

Education Links

Global Search Box

Files

File List

ETD Abstract Container

Abstract Header

SEEDEEP: A System for Exploring and Querying Deep Web Data Sources

Abstract Details

Recommended CitationsRefworksEndNoteRISMendeley

Citations

Abstract Footer

Global Footer

Ohio Department of Higher Education

State Government Links

Education Links

Recommended Citations