Anazor: File Organization

A file is a collection or set (ordered or unordered) of data elements stored in a storage media.
A field is the smallest (fixed) indivisible logical unit of a file which holds a part of some data value.
A record is a set of logically related fields. A record size may be fixed or variable.
A file can also be defined as a set of logically related fields or a collection of records.
There are different operations that can be carried out on a file such as: read, write, rewind, open, modify, delete etc.

It is worthy to note that to access a file stored on a magnetic tape, unlike disk files, the entire file has to be searched sequentially before reaching the record.

The access path for a disk are:

1.   Disk Number
2.   Cylinder Number
3.   Track Number
4.   Sector Number

The unit of data transfer between memory and disk is known as BLOCK. Blocking factor is known as the number of records in a block and is denoted by Bfr.
Bfr = B/R where:

B ===>   Disk Block Size
R ===>   File Record Size

Parameters of a Disk are:

1.   Seek Time: This is the time it takes to move the read/write arm to the correct cylinder. Seek time is the largest in cost and average seek time is the same as the time it takes to traverse one - third of the cylinder.

2.   Rotational Latency Time: This is the time the disk unit takes to move or to position the read/write on the beginning of the sector where the file records are stored.

3.   Block Transfer Time: This is the time for the read/write head to pass over a disk block. During the block transfer time, the bytes of data can be transferred between the disk and main memory.

One of the main objective of file organization is to speed up the file retrieval time that is to reduce the I/O time.

There are 3 Basic Categories of File Organization which are:

1.   Sequential Organization
2.   Index Organization
3.   Random Organization

In sequential organization records are written consecutively when the file is created.

Records in a sequential file can be stored in two ways.

A.   Pile File
B.   Sorted File

In pile file, records are placed one after another as they arrive thereby portraying no form of sorting at all. The total time to fetch (read) a record from a pile file requires seek time (s), rotational latency time (r), block transfer time (btt) and number of blocks in the file.

File reorganization is a process whereby all records which are marked to be deleted are deleted and all inserted records are moved to their correct location.

In sorted file, records are placed in ascending or descending values of the primary key.

Sorted Sequential File: In a sorted sequential file, the record is inserted at the end of the file and then moved to the its correct location according to ascending or descending order. Records are stored in the order of the values of the key field. Note that a sequential file usually has an overflow area. This area is to avoid sorting the file after every deletion, insertion and/or modification. The overflow area is not itself sorted it is a pile file with fixed size record.

It is worthy to note that the order of record storage determines the order of retrieval. Each operation in a sequential file regenerates a new version of the file. The intensity or frequency of use of a sequential file is determined by a parameter called HIT RATIO.

Hit ratio can be defined as the ratio of the number of records accessed for responding to a query to the total number of records in the file.

Note that high hit ratio value is desirable. This means that a larger number of records are accessed to respond to query. Also note that interactive transactions have a very low hit ratio.

Advantages of a Sequential File:

1.   It is good for batch transactions
2.   It is simple to implement
3.   It is good for report generation, statistical computation and inventory control.

Disadvantages of a Sequential File:

1.   It is not good for interactive transactions.
2.   It has high overheads in file processing for simple queries.

Index Organization
This type of file organization tries to reduce the access time and may not reduce the storage requirement of a file. An index maps the key space to the record space.

Note that the index for a file may be created on the primary or secondary keys.

There are three types of Index which are:

1.   Primary Index: This a type of index whereby an ordered file of index record are of fixed length. Each index record has two field which are:
    A.   One that holds the primary key of the data file record.
    B.   The other holds pointer to disk block where the data file is stored.

2.   Non Dense Index: This is a type of index whereby the number of entries in the index file is far less than (<<) the no of records in the data file.

3.   Dense Index: This is a type of index whereby the number of entries in the index file is equal to the number of records in the data file.

The index sequential file has two parts which are:

1.   Index Part ===> This part stores pointers to the actual record location on the disk.
2.   Data Part ===> This part holds actual data records and it is made of two distinct areas which are:
    A.   Prime Area: This area holds the record of the file.
    B.   Overflow Area: This area holds the record of the file when the prime area overflows.

Virtual Storage Access Method has three parts which are:

1.   Index Set
2.   Index Sequence Set
3.   Data Blocks

In the index set, the index records are maintained physically in ascending sequence by primary key value. The index records in the index set are non dense which means that there is only one index record for a lower level index block.

Control interval is a set of data records which are physically stored in ascending key values.

Control Area is an ordered set of control intervals and free control intervals for a file.

Distributed free space is a set of free control intervals in a control area. The size of control area and the number of control intervals may be pre defined. At the end of file initialization the unfilled control intervals are set aside as DISTRIBUTED FREE SPACE.

Random access is an access through the index set to the index sequence set to the data.

Note that in indexing, the amount of I/O increases with the size of the index. This problem can be minimized by direct file organization where the address of the desired record can be found directly (no need for indexing or sequential search). Such files are created using some hashing function so they are called hashing organization or hashed files.

Hashing

There are two types of hashing which are:

1.   Static Hashing: This is the type of hashing where by the address space size is predefined and does not grow or shrink with the file.
2.   Dynamic Hashing: This is the type of hashing where the address space size can grow and shrink with the file.

Note that in hashing, key space is a set of primary keys while address space is a set of home addresses.

Address Distribution

With hashing, the address generated is random
No obvious connection between key and home address (This makes it sometimes called RANDOMIZING).

Record distribution in address space can be uniform or random.

Collision is a situation whereby a hashing function generates a home address for more than one record. The solution to the problem of collision is the use of progressive overflow also known as open addressing.

A bucket is a logical unit of storage where more than one records are stored and the set of records is retrieved in one disk access.

Division is the basis of hashing.

Dynamic hashing manages expansion by:

1.   Splitting a bucket when it becomes full.
2.   Distributing records between old and new buckets.

Virtual hashing is a type of hashing that uses multiple hashing functions. These functions are related and the selection of the function depends on the level of the bucket split.

Demerits of Virtual Hashing
1.   It leads to a waste of space.
2.   If two buckets n and n + j are in use then all buckets between n and n + j must be available to the algorithm for inserting records.
3.   The history of the insertion must be saved to access a record because many related hashing functions are used in the search operation.

It is worthy to note that in virtual hashing the position of one bucket is not related to the position of any bucket.

Anazor

Saturday, 11 April 2015

File Organization

No comments:

Post a Comment