Case Study: Query Processing & Deductive Database

Query processing:

The retrieval of information from a database according to a set of retrieval criteria, the database itself remaining unchanged.

In the context of a specific query language, the technique of translating the retrieval criteria specified using the language into more primitive database-access software, including a selection among different methods to choose the most efficient in the particular circumstances.

We now give an overview of how a DDBMS processes and optimizes a query. We first discuss the communication costs of processing a distributed query; we then discuss a special operation, called a semijoin, that is used in optimizing some types of queries in a DDBMS.

1. Data Transfer Costs of Distributed Query Processing:

In a distributed system, several additional factors further complicate query processing. The first is the cost of transferring data over the network. This data includes intermediate files that are transferred to other sites for further processing, as well as the final result files that may have to be transferred to the site where the query result is needed. Although these costs may not be very high if the sites are connected via a high-performance local area network, they become quite significant in other types of networks. Hence, DDBMS query optimization algorithms consider the goal of reducing the amount of data transfer as an optimization criterion in choosing a distributed query execution strategy.

We illustrate this with two simple example queries. Consider the relations EMPLOYEE and DEPARTMENT.

There are three simple strategies for executing this distributed query:

1. Transfer both the EMPLOYEE and the DEPARTMENT relations to the result site, and perform the join at site 3. In this case a total of 1,000,000 + 3500 = 1,003,500 bytes must be transferred.

2. Transfer the EMPLOYEE relation to site 2, execute the join at site 2, and send the result to site 3. The size of the query result is 40 * 10,000 = 400,000 bytes, so 400,000 + 1,000,000 = 1,400,000 bytes must be transferred.

3. Transfer the DEPARTMENT relation to site 1, execute the join at site 1, and send the result to site 3. In this case 400,000 + 3500 = 403,500 bytes must be transferred. If minimizing the amount of data transfer is our optimization criterion, we should choose strategy 3.

2. Distributed Query Processing Using Semijoin:

The idea behind distributed query processing using the semijoin operation is to reduce the number of tuples in a relation before transferring it to another site. Intuitively, the idea is to send the joining column of one relation R to the site where the other relation S is located; this column is then joined with S. Following that, the join attributes, along with the attributes required in the result, are projected out and shipped back to the original site and joined with R. Hence, only the joining column of R is transferred in one direction, and a subset of S with no extraneous tuples or attributes is transferred in the other direction. If only a small fraction of the tuples in S participate in the join, this can be quite an efficient solution to minimizing data transfer.

3. Query and Update Decomposition:

In a DDBMS with no distribution transparency, the user phrases a query directly in terms of specific fragments. For example, consider another query Q: "Retrieve the names and hours per week for each employee who works on some project controlled by department 5," which is specified on the distributed database where the relations at sites 2 and 3 are shown in the figure and those at site 1 are shown in the figure as in the earlier example. A user who submits such a query must specify whether it references the PROJS5 and WORKS_ON5 relations at site 2 or the PROJECT and WORKS_ON relations at site 1. The user must also maintain consistency of replicated data items when updating a DDBMS with no replication transparency.

On the other hand, a DDBMS that supports full distribution, fragmentation, and replication transparency allows the user to specify a query or update request on the schema of the figure just as though the DBMS were centralized. For updates, the DDBMS is responsible for maintaining consistency among replicated items by using one of the distributed concurrency control algorithms. For queries, a query decomposition module must break up or decompose a query into subqueries that can be executed at the individual sites. In addition, a strategy for combining the results of the subqueries to form the query result must be generated. Whenever the DDBMS determines that an item referenced in the query is replicated, it must choose or materialize a particular replica during query execution.

To determine which replicas include the data items referenced in a query, the DDBMS refers to the fragmentation, replication, and distribution information stored in the DDBMS catalog. For vertical fragmentation, the attribute list for each fragment is kept in the catalog. For horizontal fragmentation, a condition, sometimes called a guard, is kept for each fragment. This is basically a selection condition that specifies which tuples exist in the fragment; it is called a guard because only tuples that satisfy this condition are permitted to be stored in the fragment. For mixed fragments, both the attribute list and the guard condition are kept in the catalog.

Deductive Database:

A deductive database system is a database system which can make deductions (ie: conclude additional facts) based on rules and facts stored in the (deductive) database. Datalog is the language typically used to specify facts, rules and queries in deductive databases. Deductive databases have grown out of the desire to combine logic programming with relational databases to construct systems that support a powerful formalism and are still fast and able to deal with very large datasets. Deductive databases are more expressive than relational databases but less expressive than logic programming systems. Deductive databases have not found widespread adoptions outside academia, but some of their concepts are used in today's relational databases to support the advanced features of more recent SQL standards.

Deductive databases and logic programming:

Deductive databases reuse a large number of concepts from logic programming; rules and facts specified in the deductive database language Datalog look very similar to those in Prolog. However, there are a number of important differences between deductive databases and logic programming:

· Order sensitivity and procedurality: In Prolog program execution depends on the order of rules in the program and on the order of parts of rules; these properties are used by programmers to build effective programs. In database languages (like SQL or Datalog), however, program execution is independent of the order or rules and facts.

· Special predicates: In Prolog programmers can directly influence the procedural evaluation of the program with special predicates such as the cut, this has no correspondence in deductive databases.

· Function symbols: Logic Programming languages allow function symbols to build up complex symbols. This is not allowed in deductive databases.

· Tuple oriented processing: Deductive databases use set oriented processing while logic programming languages concentrate on one tuple at a time.

Bridewin

Monday, November 2, 2009