Learning Effective and Robust
Semantic Knowledge
For Database Query Optimization


Chun-Nan Hsu

USC/Information Sciences Institute
My research addresses the knowledge acquisition problem of semantic query optimization. A machine learning approach to the problem will be proposed.

What is Semantic Query Optimization?

The semantic query optimization (SQO) (King 81) approach uses semantic rules, such as all Tunisian seaports have railroad access, to reformulate a query into a less expensive but equivalent query. For example, suppose we have a query to find all Tunisian seaports with railroad access and 2,000,000 ft^3 of storage space. From the rule given above, we can reformulate the query so that there is no need to check the railroad access of seaports, which may save some execution time. Many SQO systems have been developed since King's first system. State-of-the-art SQO systems can optimize queries with relational joins, disjunctions, set operations, and aggregates (e.g., max, average, etc.) using relational rules. In particular, the SQO approach may outperform conventional query optimization approaches in optimizing complex queries over heterogeneous multidatabases on the Internet (Hsu & Knoblock 96, PESTO paper).

However, to perform this kind of optimization requires the query processor possess substantial high utility semantic knowledge. The proposed research is to develop a general and efficient approach to learning high utility semantic rules required for SQO.

Desirable Rules for SQO

Since the number of potential rules that can be derived from a database is enormous, an immediate issue is how to constrain learning. High utility semantic rules should match query patterns and yield high cost reduction. Moreover, semantic rules must be robust against real-world databases subject to changes and noise. Otherwise, a learned semantic rule may become invalid and no longer useful after database changes. Therefore, high utility semantic rules must be effective in cost reduction, and robust against database changes.

The notion of robustness is new to machine learning and discovery. It measures the likelihood that a rule remains consistent with a database after data changes including insertions, deletions, and updates. (Hsu & Knoblock 95) defines and discusses this new measure.

Proposed Solution

We have developed a general learning approach and implemented it as a system called BASIL. This learning approach addresses both effectiveness and robustness issues with two components: an inductive learning component, and an operationalization component. The system is triggered to learn a new set of rules when the query processor encounters an expensive query that can not be satisfactorily optimized with existing rules. Given such an expensive query, the inductive learning component inductively forms from data an alternative query, which is the desired reformulation of the given query. Currently, this inductive learning component can form a conjunctive alternative query with relational joins from relational data. It uses knowledge about query execution cost to bias the induction so that the learned alternative query is much cheaper to be evaluated.

From this alternative query, the operationalization component derives operational semantic rules which enable the system to optimize expensive queries similar to the given query. However, they may not be robust and could be overly-specific. We uses a rule pruner to prune those rules into highly robust and widely applicable rules. To guide the pruning for robust rules, a Bayesian estimation approach for the robustness of relational rules against relational databases is developed. This estimation approach can bring to bear different forms of database information (schema, transaction log, etc.) and tolerate data noise. Consequently, our approach will be able to learn effective and robust rules for SQO systems.

Expected Contributions

This research will provide a general solution to the problem of learning for SQO. This research also identifies many important issues of autonomous speedup learning problems in dynamic real-world environments. These issues include learning for both effectiveness and robustness, their interactions, and probabilistic estimation of robustness.

References

Jonathan~J. King. Query Optimization by Semantic Reasoning. PhD thesis, Stanford University, Department of Computer Science, 1981.
chunnan@isi.edu