Case Study: Query Processing & Deductive Database

Query processing:

The retrieval of information from a database according to a set of retrieval criteria, the database itself remaining unchanged.

In the context of a specific query language, the technique of translating the retrieval criteria specified using the language into more primitive database-access software, including a selection among different methods to choose the most efficient in the particular circumstances.

We now give an overview of how a DDBMS processes and optimizes a query. We first discuss the communication costs of processing a distributed query; we then discuss a special operation, called a semijoin, that is used in optimizing some types of queries in a DDBMS.

1. Data Transfer Costs of Distributed Query Processing:

In a distributed system, several additional factors further complicate query processing. The first is the cost of transferring data over the network. This data includes intermediate files that are transferred to other sites for further processing, as well as the final result files that may have to be transferred to the site where the query result is needed. Although these costs may not be very high if the sites are connected via a high-performance local area network, they become quite significant in other types of networks. Hence, DDBMS query optimization algorithms consider the goal of reducing the amount of data transfer as an optimization criterion in choosing a distributed query execution strategy.

We illustrate this with two simple example queries. Consider the relations EMPLOYEE and DEPARTMENT.

There are three simple strategies for executing this distributed query:

1. Transfer both the EMPLOYEE and the DEPARTMENT relations to the result site, and perform the join at site 3. In this case a total of 1,000,000 + 3500 = 1,003,500 bytes must be transferred.

2. Transfer the EMPLOYEE relation to site 2, execute the join at site 2, and send the result to site 3. The size of the query result is 40 * 10,000 = 400,000 bytes, so 400,000 + 1,000,000 = 1,400,000 bytes must be transferred.

3. Transfer the DEPARTMENT relation to site 1, execute the join at site 1, and send the result to site 3. In this case 400,000 + 3500 = 403,500 bytes must be transferred. If minimizing the amount of data transfer is our optimization criterion, we should choose strategy 3.

2. Distributed Query Processing Using Semijoin:

The idea behind distributed query processing using the semijoin operation is to reduce the number of tuples in a relation before transferring it to another site. Intuitively, the idea is to send the joining column of one relation R to the site where the other relation S is located; this column is then joined with S. Following that, the join attributes, along with the attributes required in the result, are projected out and shipped back to the original site and joined with R. Hence, only the joining column of R is transferred in one direction, and a subset of S with no extraneous tuples or attributes is transferred in the other direction. If only a small fraction of the tuples in S participate in the join, this can be quite an efficient solution to minimizing data transfer.

3. Query and Update Decomposition:

In a DDBMS with no distribution transparency, the user phrases a query directly in terms of specific fragments. For example, consider another query Q: "Retrieve the names and hours per week for each employee who works on some project controlled by department 5," which is specified on the distributed database where the relations at sites 2 and 3 are shown in the figure and those at site 1 are shown in the figure as in the earlier example. A user who submits such a query must specify whether it references the PROJS5 and WORKS_ON5 relations at site 2 or the PROJECT and WORKS_ON relations at site 1. The user must also maintain consistency of replicated data items when updating a DDBMS with no replication transparency.

On the other hand, a DDBMS that supports full distribution, fragmentation, and replication transparency allows the user to specify a query or update request on the schema of the figure just as though the DBMS were centralized. For updates, the DDBMS is responsible for maintaining consistency among replicated items by using one of the distributed concurrency control algorithms. For queries, a query decomposition module must break up or decompose a query into subqueries that can be executed at the individual sites. In addition, a strategy for combining the results of the subqueries to form the query result must be generated. Whenever the DDBMS determines that an item referenced in the query is replicated, it must choose or materialize a particular replica during query execution.

To determine which replicas include the data items referenced in a query, the DDBMS refers to the fragmentation, replication, and distribution information stored in the DDBMS catalog. For vertical fragmentation, the attribute list for each fragment is kept in the catalog. For horizontal fragmentation, a condition, sometimes called a guard, is kept for each fragment. This is basically a selection condition that specifies which tuples exist in the fragment; it is called a guard because only tuples that satisfy this condition are permitted to be stored in the fragment. For mixed fragments, both the attribute list and the guard condition are kept in the catalog.

Deductive Database:

A deductive database system is a database system which can make deductions (ie: conclude additional facts) based on rules and facts stored in the (deductive) database. Datalog is the language typically used to specify facts, rules and queries in deductive databases. Deductive databases have grown out of the desire to combine logic programming with relational databases to construct systems that support a powerful formalism and are still fast and able to deal with very large datasets. Deductive databases are more expressive than relational databases but less expressive than logic programming systems. Deductive databases have not found widespread adoptions outside academia, but some of their concepts are used in today's relational databases to support the advanced features of more recent SQL standards.

Deductive databases and logic programming:

Deductive databases reuse a large number of concepts from logic programming; rules and facts specified in the deductive database language Datalog look very similar to those in Prolog. However, there are a number of important differences between deductive databases and logic programming:

· Order sensitivity and procedurality: In Prolog program execution depends on the order of rules in the program and on the order of parts of rules; these properties are used by programmers to build effective programs. In database languages (like SQL or Datalog), however, program execution is independent of the order or rules and facts.

· Special predicates: In Prolog programmers can directly influence the procedural evaluation of the program with special predicates such as the cut, this has no correspondence in deductive databases.

· Function symbols: Logic Programming languages allow function symbols to build up complex symbols. This is not allowed in deductive databases.

· Tuple oriented processing: Deductive databases use set oriented processing while logic programming languages concentrate on one tuple at a time.

OODBMS System

An object database (also object-oriented database) is a database model in which information is represented in the form of objects as used in object-oriented programming.

Object databases are a niche field within the broader DBMS market dominated by relational database management systems (RDBMS). Object databases have been considered since the early 1980s and 1990s but they have made little impact on mainstream commercial data processing, though there is some usage in specialized areas.

When database capabilities are combined with object-oriented (OO) programming language capabilities, the result is an object database management system (ODBMS).

Today’s trend in programming languages is to utilize objects, thereby making OODBMS ideal for OO programmers because they can develop the product, store them as objects, and can replicate or modify existing objects to make new objects within the OODBMS. Information today includes not only data but video, audio, graphs, and photos which are considered complex data types. Relational DBMS aren’t natively capable of supporting these complex data types. By being integrated with the programming language, the programmer can maintain consistency within one environment because both the OODBMS and the programming language will use the same model of representation. Relational DBMS projects using complex data types would have to be divided into two separate tasks: the database model and the application.

As the usage of web-based technology increases with the implementation of Intranets and extranets, companies have a vested interest in OODBMS to display their complex data. Using a DBMS that has been specifically designed to store data as objects gives an advantage to those companies that are geared towards multimedia presentation or organizations that utilize computer-aided design (CAD).

Some object-oriented databases are designed to work well with object-oriented programming languages such as Python, Perl, Java, C#, Visual Basic .NET, C++, Objective-C and Smalltalk; others have their own programming languages. ODBMSs use exactly the same model as object-oriented programming languages.

Advantages:

The main benefit of creating a database with objects as data is speed.
OODBMS are faster than relational DBMS because data isn’t stored in relational rows and columns but as objects.
Objects have a many to many relationship and are accessed by the use of pointers.
Pointers are linked to objects to establish relationships.
Another benefit of OODBMS is that it can be programmed with small procedural differences without affecting the entire system.
This is most helpful for those organizations that have data relationships that aren’t entirely clear or need to change these relations to satisfy the new business requirements.
This ability to change relationships leads to another benefit which is that relational DBMS can’t handle complex data models while OODBMS can.

Disadvantages:

Slower and more difficult to formulate than relational.
Lack of interoperability with a great number of tools/features that are taken for granted in the SQL world, including but not limited to industry standard connectivity, reporting tools, OLAP tools, and backup and recovery standards.
Lack a formal mathematical foundation, unlike the relational model, and this in turn leads to weaknesses in their query support.

Applications:

Object databases based on persistent programming acquired a niche in application areas such as engineering and spatial databases, telecommunications, and scientific areas such as high energy physics and molecular biology. They have made little impact on mainstream commercial data processing, though there is some usage in specialized areas of financial services. It is also worth noting that object databases held the record for the World's largest database (being the first to hold over 1000 terabytes at Stanford Linear Accelerator Center) and the highest ingest rate ever recorded for a commercial database at over one Terabyte per hour.

Another group of object databases focuses on embedded use in devices, packaged software, and real-time systems.

Bridewin

Monday, November 2, 2009