Tags: aim, data bases, data management, distinction, document retrieval, file structure, file structures, gap, logical nature, logical structure, magnetic disk, outset, physical organisation, relationships, storage medium, surveys, tree structure,
Four
FILE STRUCTURES
Introduction
This chapter is mainly concerned with the way in which file structures are used in document
retrieval. Most surveys of file structures address themselves to applications in data
management which is reflected in the terminology used to describe the basic concepts. I
shall (on the whole) follow Hsiao and Harary1 whose terminology is perhaps slightly non-
standard but emphasises the logical nature of file structures. A further advantage is that it
enables me to bridge the gap between data management and document retrieval easily. A
few other good references on file structures are Roberts2 , Bertziss3 , Dodd4 , and Climenson5 .
Logical or physical organisation and data independence
There is one important distinction that must be made at the outset when discussing file
structures. And that is the difference between the logical and physical organisation of the
data. On the whole a file structure will specify the logical structure of the data, that is the
relationships that will exist between data items independently of the way in which these
relationships may actually be realised within any computer. It is this logical aspect that we
will concentrate on. The physical organisation is much more concerned with optimising the
use of the storage medium when a particular logical structure is stored on, or in it.
Typically for every unit of physical store there will be a number of units of the logical
structure (probably records) to be stored in it. For example, if we were to store a tree
structure on a magnetic disk, the physical organisation would be concerned with the best
way of packing the nodes of the tree on the disk given the access characteristics of the disk.
The work on data bases has been very much concerned with a concept called data
independence. The aim of this work is to enable programs to be written independently of the
logical structure of the data they would interact with. The independence takes the
following form, should the file structure overnight be changed from an inverted to a serial file
the program should remain unaffected. This independence is achieved by interposing a data
model between the user and the data base. The user sees the data model rather than the
data base, and all his programs communicate with the model. The user therefore has no
interest in the structure of the file.
There is a school of thought that says that says that applications in library automation
and information retrieval should follow this path as well6,7 . And so it should.
Unfortunately, there is still much debate about what a good data model should look like.
Furthermore, operational implementations of some of the more advanced theoretical systems
do not exist yet. So any suggestion that an IR system might be implemented through a data
base package should still seem premature. Also, the scale of the problems in IR is such that
efficient implementation of the application still demands close scrutiny of the file structure
to be used.
Nevertheless, it is worth taking seriously the trend away from user knowledge of file
structures, a trend that has been stimulated considerably by attempts to construct a theory
of data8,9 . There are a number of proposals for dealing with data at an abstract level. The
best known of these by now is the one put forward by Codd8 , which has become known as
the relational model. In it data are described by n-tuples of attribute values. More
formally if the data is described by relations, a relation on a set of domains D1 , . . . , D n can
be represented by a set of ordered n-tuples each of the form (d 1 , . . . , d n) where d i D i . As
50 File structures
it is rather difficult to cope with general relations, various levels (three in fact) of
normalisation have been introduced restricting the kind of relations allowed.
A second approach is the hierarchical approach. It is used in many existing data base
systems. This approach works as one might expect: data is represented in the form of
hierarchies. Although it is more restrictive than the relational approach it often seems to be
the natural way to proceed. It can be argued that in many applications a hierarchic
structure is a good approximation to the natural structure in the data, and that the resulting
loss in precision of representation is worth the gain in efficiency and simplicity of
representation.
The third approach is the network approach associated with the proposals by the Data
Base Task Group of CODASYL. Here data items are linked into a network in which any
given link between two items exists because it satisfies some condition on the attributes of
those items, for example, they share an attribute. It is more general than the hierarchic
approach in the sense that a node can have any number of immediate superiors. It is also
equivalent to the relational approach in descriptive power.
The whole field of data base structures is still very much in a state of flux. The
advantages and disadvantages of each approach are discussed very thoroughly in Date10 ,
who also gives excellent annotated citations to the current literature. There is also a recent
Computing Surveyll which reviews the current state of the art. There have been some very
early proponents of the relational approach in IR, as early as 1967 Maron12 and Levien13
discussed the design and implementation of an IR system via relations, be it binary ones.
Also Prywes and Smith in their review chapter in the Annual Review of Information Science and
Technology more recently recommended the DBTG proposals as ways of implementing IR
systems7 .
Lurking in the background of any discussion of file structures nowadays is always the
question whether data base technology will overtake all. Thus it may be that any
application in the field of library automation and information retrieval will be implemented
through the use of some appropriate data base package. This is certainly a possibility but
not likely to happen in the near future. There are several reasons. One is that data base
systems are general purpose systems whereas automated library and retrieval systems are
special purpose. Normally one pays a price for generality and in this case it is still too great.
Secondly, there now is a considerable investment in providing special purpose systems (for
example, MARC)14 and this is not written off very easily. Nevertheless a trend towards
increasing use of data-base technology exists and is well illustrated by the increased
prominence given to it in the Annual Review of Information Science and Technology.
A language for describing file structures
Like all subjects in computer science the terminology of file structures has evolved higgledy-
piggledy without much concern for consistency, ambiguity, or whether it was possible to
make the kind of distinctions that were important. It was only much later that the need for
a well-defined, unambiguous language to describe file structures became apparent. In
particular, there arose a need to communicate ideas about file structures without getting
bogged down by hardware considerations.
This section will present a formal description of file structures. The framework
described is important for the understanding of any file structure. The terminology is based
on that introduced by Hsiao and Harary (but also see Hsiao15 and Manola and Hsiao 16 ).
Their terminology has been modified and extended by Severance17 , a summary of this can be
found in van Rijsbergen18 . Jonkers19 has formalised a different framework which provides
an interesting contrast to the one described here.
Basic terminology
Given a set of 'attributes' A and a set of 'values' V, then a record R is a subset of the cartesian
product A x V in which each attribute has one and only one value. Thus R is a set of
Introduction 51
ordered pairs of the form (an attribute, its value). For example, the record for a document
which has been processed by an automatic content analysis algorithm would be
R = {(K1 , x 1 ), (K2 , x 2 ), . . . (Km, x m)}
The Ki 's are keywords functioning as attributes and the value x i can be thought of as a
numerical weight. Frequently documents are simply characterised by the absence or
presence of keywords, in which case we write
R = {Kt1 , Kt2 , . . . , Kti }
where Kti is present if x ti = 1 and is absent otherwise.
Records are collected into logical units called files. They enable one to refer to a set of
records by name, the file name. The records within a file are often organised according to
relationships between the records. This logical organisation has become known as a file
structure (or data structure).
It is difficult in describing file structures to keep the logical features separate from the
physical ones. The latter are characteristics forced upon us by the recording media (e.g.
tape, disk). Some features can be defined abstractly (with little gain) but are more easily
understood when illustrated concretely. One such feature is a field. In any implementation
of a record, the attribute values are usually positional, that is the identity of an attribute is
given by the position of its attribute value within the record. Therefore the data within a
record is registered sequentially and has a definite beginning and end. The record is said to
be divided into fields and the nth field carries the nth attribute value. Pictorially we have an
example of a record with associated fields in Figure 4.1.
R =
K1 K2
K3
K4
P1 P2
P3 P4
Figure 4.1. An example of a record with associated fields
The fields are not necessarily constant in length. To find the value of the attribute K4 , we
first find the address of the record R (which is actually the address of the start of the
record) and read the data in the 4th field.
In the same picture I have also shown some fields labelled Pi . They are addresses of
other records, and are commonly called pointers. Now we have extended the definition of
a record to a set of attribute-value pairs and pointers. Each pointer is usually associated
with a particular attribute-value pair. For example, (see Figure 4.2) pointers could be used
to link all records for which the value x 1 (of attribute K1 ) is a, similarly for x 2 equal to b, etc.
52 File structures
R1 a R2 b R3 a R4 b R5 a
c c c
Figure 4.2. A demonstration of the use of pointers to link records
To indicate that a record is the last record pointed to in a list of records we use the null
pointer . The pointer associated with attribute K in record R will be called a K-pointer. An
attribute (keyword) that is used in this way to organise a file is called a key.
The unify the discussion of file structures we need some further concepts. Following
Hsiao and Harary again, we define a list L of records with respect to a keyword K, or more
briefly a K-list as a set of records containing K such that:
(1) the K-pointers are distinct;
(2) each non-null K-pointer in L gives the address of a record within L;
(3) there is a unique record in L not pointed to by any record containing K;
it is called the beginning of the list; and
(4) there is a unique record in L containing the null K-pointer; it is the end
of the list.
(Hsiao and Harary state condition (2) slightly differently so that no two K-lists have a
record in common; this only appears to complicate things.)
From our previous example:
K1 -list : R1 , R2 , R5
K2 -list : R2 , R4
K4 -list : R1 , R2 , R3
Finally, we need the definition of a directory of a file. Let F be a file whose records
contain just m different keywords K1 , K2 , . . . , Km. Let ni be the number of records
containing the keyword Ki , and h i be the number of Ki -lists in F. Furthermore, we denote by
aij the beginning address of the jth Ki -list. Then the directory is the set of sequences
(Ki , n i , h i , ai1 , ai2 , . . . aih i ) i = 1, 2, . . . m
We are now in a position to give a unified treatment of sequential files, inverted files,
index-sequential files and multi-list files.
Sequential files
A sequential file is the most primitive of all file structures. It has no directory and no
linking pointers. The records are generally organised in lexicographic order on the value of
some key. In other words, a particular attribute is chosen whose value will determine the
order of the records. Sometimes when the attribute value is constant for a large number of
records a second key is chosen to give an order when the first key fails to discriminate.
The implementation of this file structure requires the use of a sorting routine.
Introduction 53
Its main advantages are:
(1) it is easy to implement;
(2) it provides fast access to the next record using lexicographic order.
Its disadvantages:
(1) it is difficult to update - inserting a new record may require moving a large
proportion of the file;
(2) random access is extremely slow.
Sometimes a file is considered to be sequentially organised despite the fact that it is not
ordered according to any key. Perhaps the date of acquisition is considered to be the key
value, the newest entries are added to the end of the file and therefore pose no difficulty to
updating.
Inverted files
The importance of this file structure will become more apparent when Boolean Searches are
discussed in the next chapter. For the moment we limit ourselves to describing its structure.
An inverted file is a file structure in which every list contains only one record. Remember
that a list is defined with respect to a keyword K, so every K-list contains only one record.
This implies that the directory will be such that ni = h i for all i, that is, the number of records
containing Ki will equal the number of Ki -lists. So the directory will have an address for
each record containing Ki . For document retrieval this means that given a keyword we can
immediately locate the addresses of all the documents containing that keyword. For the
previous example let us assume that a non-black entry in the field corresponding to an
attribute indicates the presence of a keyword and a black entry its absence. Then the
directory will point to the file in the way shown in Figure 4.3. The definition of an inverted
files does not require that the addresses in the directory are in any order. However, to
facilitate operations such as conjunction ('and') and disjunction ('or') on any two inverted
lists, the addresses are normally kept in record number order. This means that 'and' and 'or'
operations can be performed with one pass through both lists. The penalty we pay is of
course that the inverted file becomes slower to update.
54 File structures
File
Directory a 11 = a 41 a 21 = a
42
K 1 , 3, 3, a11, a 12 , a 13
a b
K 2, 2, 2, a21, a 22
K 3 c c
K 4 , 3,3, a41 a 42 a 43
a12 = a 43 a 22
a b
c
a 13
a
Figure 4.3. An inverted file
Index-sequential files
An index-sequential file is an inverted file in which for every keyword Ki , we have ni = h i =
1 and a11