Learning Effective and Robust
Semantic Knowledge
For Database Query Optimization
Chun-Nan Hsu
USC/Information Sciences Institute
My research addresses the knowledge acquisition problem of
semantic query optimization. A machine learning approach to the problem
will be proposed.
What is Semantic Query Optimization?
The semantic query optimization (SQO) (King 81)
approach uses semantic rules, such as
all Tunisian seaports have railroad access,
to reformulate a query into a less expensive but equivalent
query. For example, suppose
we have a query to find all Tunisian seaports with railroad access
and 2,000,000 ft^3 of storage space. From the rule given above, we
can reformulate the query so that there is no need to check the
railroad access of seaports, which may save some execution time.
Many SQO systems have been developed
since King's first system.
State-of-the-art SQO systems can optimize queries with relational joins,
disjunctions, set operations, and aggregates
(e.g., max, average, etc.) using relational rules.
In particular, the SQO approach may outperform conventional query
optimization approaches in optimizing complex
queries over heterogeneous multidatabases on
the Internet (Hsu & Knoblock 96, PESTO paper).
However, to perform this kind of optimization requires the query
processor possess substantial high utility semantic knowledge.
The proposed research is to develop a general and efficient approach to
learning high utility semantic rules required for SQO.
Desirable Rules for SQO
Since the number of potential rules that can be derived from a database
is enormous, an immediate issue is how to constrain learning.
High utility semantic rules should match query patterns
and yield high cost reduction.
Moreover, semantic rules must be robust against
real-world databases subject to changes and noise.
Otherwise, a learned semantic rule may become invalid and
no longer useful after database changes.
Therefore, high utility semantic rules must be effective in cost
reduction, and robust against database changes.
The notion of robustness is new to machine learning and discovery.
It measures the likelihood that a rule remains consistent with a database
after data changes including insertions, deletions, and updates.
(Hsu & Knoblock 95)
defines and discusses this new measure.
Proposed Solution
We have developed a general learning approach and
implemented it as a system called BASIL.
This learning approach addresses both
effectiveness and robustness issues with two components:
an inductive learning component, and an operationalization component.
The system is triggered to learn a new set of rules
when the query processor encounters an expensive query
that can not be satisfactorily optimized with existing rules.
Given such an expensive query,
the inductive learning component
inductively forms from data an alternative query,
which is the desired reformulation of the given query.
Currently, this inductive learning component can
form a conjunctive alternative query
with relational joins from relational data.
It uses knowledge about query execution cost to
bias the induction so that the learned alternative query
is much cheaper to be evaluated.
From this alternative query,
the operationalization component derives operational semantic rules
which enable the system to optimize
expensive queries similar to the given query.
However, they may not be robust and could be overly-specific.
We uses a rule pruner to prune those rules into
highly robust and widely applicable rules.
To guide the pruning for robust rules,
a Bayesian estimation approach for
the robustness of relational rules against relational databases is
developed. This estimation approach can bring to bear different forms of
database information (schema, transaction log, etc.)
and tolerate data noise. Consequently, our approach will be able to learn
effective and robust rules for SQO systems.
Expected Contributions
This research will provide
a general solution to the problem of learning for SQO.
This research also identifies many important issues of
autonomous speedup learning problems in dynamic real-world environments.
These issues include learning for both effectiveness and robustness,
their interactions, and probabilistic estimation of robustness.
References
Jonathan~J. King.
Query Optimization by Semantic Reasoning.
PhD thesis, Stanford University, Department of Computer Science,
1981.
chunnan@isi.edu