Why are logical databases the technology of choice for implementing information systems?
Information systems are generally about the management of records. Records can be records of just about anything: a company’s accounts, medical records, criminal records, a census, the archives of a newspaper, or the contents of a museum. Just about anything can be a record; the Babylonians used clay tablets to record their business dealings, some medical records are images, most of the contents of most museums are physical artefacts of various kinds. But contemporary information systems are generally concerned with documents that contain most of their information in the form of text. Physical objects like the contents of a museum are generally represented in information systems by documents called catalogue entries.
So, more specifically, information systems are about the management of records that are documents containing information mostly in textual form. The general technology for processing collections of text records is the text database.
The model of information-seeking behaviour supported by text databases has the following steps:
The user has an information need.
The user formulates the information need as a query consisting of a collection of terms.
The system returns the subset of its collection of documents containing all and only those documents that contain the query terms.
The user then reviews the documents returned, and makes a judgment as to whether each document satisfies the information need or not. The expectation is that many of the documents returned will be irrelevant (limited precision). The expectation is also that some of the documents in the collection that would have satisfied the information need were not returned, because the query did not contain appropriate terms (limited recall).
Precision and recall are measured on a percentage scale. A precision of 0% means that none of the documents retrieved met the information need. A precision of 100% means that all did. A recall of 0% means that none of the relevant documents were retrieved. A recall of 100% means that all were. Returning the entire collection guarantees 100% recall, but gives a very low precision. Text database systems are considered to perform very well if their average precision and average recall are as high as 40%.
Computer-based information systems generally make use of technologies such as relational databases. There is a wide variety of such systems, but they are generally characterised by data models based on classes and instances, with relationships among classes. Typically the data model is expressed in a language like UML, one of the varieties of entity-relationship modeling, or object-role modeling. The populations of particular systems are generally managed by systems based more or less on the first-order predicate calculus, such as relational database systems or object-oriented database systems, which we here call logical databases.
In text database terms, a query on a logical database is expected to have 100% precision and 100% recall. A class list is the definitive statement of which students are enrolled in a course. A person may attend lectures, submit assignments and sit an examination, but if they are not on the class list then they are not enrolled and cannot be assigned a grade. Another person may never attend classes, submit no assignments and not sit the examination but, being on the class list, is considered enrolled and will be given a grade, perhaps one signifying ‘no assessment submitted’.
Because a query on a logical database returns all and only the documents satisfying the information need, it is possible to construct much more complex queries. Combining information from two different tables requires 100% precision and 100% recall. So does the reliable use of negation, and complex selection conditions.
The claim here is that logical databases are the preferred technology for managing collections of records using information systems. But all we have established so far is that an information system manages a collection of records. We need to look at these collections in more detail.
Consider a particular kind of collection of documents that are records of activity of an organisation, namely the correspondence incoming and outgoing. Imagine we have a UML model for this collection, and consider a particular document, namely a letter from a potential customer enquiring about the possible existence of a product that the company does not at present supply. Call this letter Q. We want to compare this with a letter from an established customer placing an order for an existing product. Call this letter P.
We want to look at what the organisation can do with letter Q compared to what it can do with letter P. Letter P can be cross-referenced with other documents associated with the established customer, and with other documents associated with the existing product. Some of the former will be invoices, statements, payments, and so on. Some of the latter will be picking lists, shipping orders, purchase orders and so on. The organisation will have standard queries associated with these documents, for example all orders that have been delivered but not paid for, or all orders for a customer that have not yet been shipped.
By contrast, it is not at all clear what to do with letter Q. It might routinely be answered with a polite negative reply. If the prospective customer will potentially place large orders, the letter might be sent to the product development group for a feasibility study. The product may or may not be technically feasible. If technically feasible, there may or may not be the capital available for development, or there may be higher return uses for the capital that could be used for the project. It would be hard to know with what other kinds of documents letter Q would be associated, and hard to see what routine queries might retrieve it.
Letter P fits well into the class/instance/relationship data model, while letter Q does not. The class/instance/relationship data model permits the construction of complex queries, the reliable definition of negation, and so on. Information systems generally exclude documents like letter Q from consideration, concentrating on documents like letter P.
So, the preliminary answer to the question as to why information systems are implemented using logical rather than text databases is that the subset of records considered by information systems are very largely those that are usefully modeled using the assumptions underlying logical databases, and so can profit from the much richer querying capability of logical databases.
However, this is hardly a satisfactory explanation since it is circular. Information systems use logical databases because they are about managing the sorts of records that can be well managed by logical databases. We need a deeper understanding of these sorts of records.