Go to the first, previous, next, last chapter, table of contents.

6 Heterogeneous Databases

If there is the need to query more than one WAIS database in parallel the one setting up a query interface may be lucky and the fields available in the different databases coincide exactly in each of the databases. But the more common case is that there are different fields in different databases. The schemas of the databases differ.

In most cases this is due to different types of the documents stored in the databases, e.g. the one database holds references to literature while another one holds product descriptions.

SFgate 5.1 is suited to handle the heterogeneity of databases holding documents not differing to much in their types, especially for references to literature, like e.g. articles, books, reports.

The task with respect to heterogeneous databases is to find a mapping of the attributes (or field names) used within a given query and the attributes (or field names) available in the different databases.

6.1 Mapping of Attributes

. . . . . .

To understand how to use the attribute mapping facility of SFgate one has to know what happens to a given query within SFgate. As pointed out in the previous section SFgate's task is to take the attributes from a given query and map them on the most suitable attributes being part of the database schema for each database to query.

A simple (but insufficient) solution is to rename the attributes within the databases to those used in queries. But this is sufficient only if all databases include the same attributes (on the semantic level) as those used within the query. So strictly speaking this solution does not deal with heterogeneity.

The solution implemented in SFgate 5.1 is based on a predefined set of attributes for use within queries. To do a mapping of attributes there has to be knowledge of how the attributes are related to each other. So we introduced a lattice (see section 6.2 Predefined Attribute Lattice) on these attributes which reflects the specialization relationships between the attributes.

With means of this lattice the mapping process for attributes can be defined with the four operations equality, specialization, generalization and ignorance in the following order:

Equality: If a query condition refers to a lattice attribute, which is in the schema of the database to query, no mapping is done. The original query condition is taken to be part of the translated query.
Specialization: If a query condition refers to a lattice attribute which is not in the database to query, but there are more special attributes within the database schema, the mapping is done by generating one new query conditions for each of the more special attributes. This is done by replacing the original attribute from the external level with the more special one of the conceptual level. The new query conditions (if there are more than one) are connected with the Boolean OR to one new query condition as part of the translated query.
Generalization: If a query condition refers to an lattice attribute which is not in the database to query and there aren't any attributes being more special within the schema of the database to query, the attribute is mapped on the nearest more general attribute which is part of the database schema.
Ignorance: If neither equality nor specialization nor generalization yielded a translation of the original query condition the whole condition is ignored for the translated query of the actual database.

6.2 Predefined Attribute Lattice

. . .

The set of attributes used within the lattice is mainly taken from the Scientific and Technical Attribute Set (STAS) hold by CNIDR. STAS defines standard identifiers for referring to searchable and retrievable fields within scientific and technical databases.

KL-ONE defines a diffs operator which allows for inheritance on attributes. Using the diffs construct a specialization hierarchy has been introduced on a (small) subset of the STAS attributes:

 TOP
 |
 |-keywords
 |  |
 |  |-content
 |  |  |
 |  |  +-full-text
 |  |  |  |
 |  |  |  +-title
 |  |  |  |  |
 |  |  |  |  |-book-title
 |  |  |  |  |-article-title
 |  |  |  |  +-series-title
 |  |  |  |
 |  |  |  |-abstract
 |  |  |  +-subject-descriptor
 |  |  |
 |  |  +-journal-title
 |  |  
 |  |-initiator
 |  |  |
 |  |  |-author-name
 |  |  |-editor-name
 |  |  |-corporate
 |  |  +-conference
 |  |
 |  +-publisher
 |     |
 |     |-publisher-name
 |     +-publisher-address
 |
 |-date
 |  |
 |  |-entry-date
 |  +-publication-date
 |
 +-meta
    |
    |-issn
    |-isbn
    |-crc
    |-volume
    |-number
    +-edition

TOP: The top attribute which subsumes all other attributes.
keywords: Subsumes attributes with direct regard to the document.
content: Information concerning the document content.
full-text: Full document text.
title: Information about the title of an document.
book-title: Title of the book.
article-title: Title of the article.
series-title: Title of the series.
abstract: Abstract of a document.
subject-descriptor: Descriptors of an document.
journal-title: Title of the journal.
initiator: Information concerning the initiator.
author-name: The name of an author person.
editor-name: The name of an editor person.
corporate: The name of a corporation or institution.
conference: The name of a conference.
publisher: A publishing organization.
publisher-name: The name of a publishing organization.
publisher-address: The address of a publishing organization.
date: Date information.
entry-date: The date a record is added to the database.
publication-date: The date when a document has been published.
meta: Additional information.
issn: International standard serials number.
isbn: International standard book number.
crc: Classification according to the ACM Computing Reviews Classification System.
volume: Volume number of a journal.
number: Number of a journal/serials.
edition: Edition of a document.

6.3 Attribute Mapping in Practice

Now, what is to do to use SFgate's heterogeneous database handling facility? First of all SFgate must be told to use it. So set up a FORM tag named attributes.

<INPUT NAME="attributes" TYPE="hidden" VALUE="1">

The next thing is to select a suitable lattice. You need not to use the lattice proposed in the previous section. Feel free to build up your own lattice. If not using the proposed one (i.e. file `$SFgate/lattice' file, which is installed in the application directory (see section 3.3.1.10 Directory for Application Files)) you have to announce your lattice to SFgate. So set up a FORM tag named lattice with the filename of your lattice file as value (including the complete path if you don't want to install it in the application directory):

<INPUT NAME="lattice" TYPE="hidden" VALUE="lattice">

Now that you have a lattice you can create the query part in your form. Specify the input fields using the attribute names from your lattice only.

The next step is to do some configuration on the databases to query. In general field names in WAIS databases are not taken from your lattice, so you've to tell SFgate a mapping of the database fields onto the attributes within the lattice. Furthermore SFgate needs to know the types of the database fields since fields from different databases mapped on one lattice attribute need not to possess the same type.

Let's start with an enumeration of the different possible types:

text
stemming
numeric
soundex
phonix

The types (yes, a database field can have more than one type, e.g. soundex and text for personal names) of an field can be easily derived from the .fmt-file (see section `Building a Format Description' in The freeWAIS-sf Manual) used to create the database.

Now how can the knowledge about a database be told to SFgate? This is done via an external file which should reside in the application file directory (see section 3.3.1.10 Directory for Application Files). Instead of the database specification as described in section 5.4 Databases the value of a database FORM tag must contain the name of that file. If the file doesn't reside in the application file directory it must be specified with the complete path.

Specifying only a file name of a database configuration file within a database FORM tag makes it necessary to configure server, port and name of the WAIS database within the configuration file itself.

Another point are converters. Different databases contain documents of different types so there might be a need to use different converters for documents resulting from different databases. This is done within database configuration files, too.

The syntax of database configuration files is taken from Perl. The three examples below form a database configuration file for the WAIS database `bibdb-html' on server `ls6.informatik.uni-dortmund.de'.

Server, port and name of the database are specified as simple perl variables:

$server = 'local';
$port   = '210';
$name   = 'bibdb-html';

The database fields, their types and their counterparts within the lattice are given in an anonymous hash reference named $attributes. The first part of an entry is the name of a datbase field followed by a colon and the list of types, seperated by commas. The second part is the counterpart within the lattice:

$attributes = {
    'py:numeric'      => 'publication-date',
    'au:text,soundex' => 'author-name',
    'ti:stemming'     => 'title',
    'cc:text'         => 'crc',
    'jt:stemming'     => 'journal-title',
    'vo:numeric'      => 'volume',
    'no:numeric'      => 'number',
    'global:text'     => 'keywords'
    };

Also the (optional) mapping of converters is given as an anonymous hash reference (named $converter). The first part of an entry is the converter name used within the form, the second part is the name of the converter to call:

$converter = {
    'BIBTEX'          => 'bibtex',
    'PRETTY'          => 'label:',
    'DEFAULT'         => 'label:'
    };

6.4 Further References

If you want to learn more about the bacckgrounds of SFgate's attribute mapping idea take a look at:

Norbert Fuhr (1996): Object-Oriented and Database Concepts for the Design of Networked Information Retrieval Systems (`http://ls6-www.informatik.uni-dortmund.de/ir/reports/96/Fuhr-96.html')
Norbert Gövert (1996): SFgate and Heterogeneous Databases (`http://ls6-www.informatik.uni-dortmund.de/ir/projects/SFgate/heterogeneous.html')
SFgate and Heterogeneous Databases - Demo: `http://ls6-www.informatik.uni-dortmund.de/ir/projects/SFgate/cs_literature.html'

Go to the first, previous, next, last chapter, table of contents.

SFgate 5.111