Anazor: Summary of File and Database Organization

A file is a collection or set (ordered or unordered) of data elements stored in a storage media.
A field is the smallest (fixed) indivisible logical unit of a file which holds a part of some data value.
A record is a set of logically related fields. A record size may be fixed or variable.
A file can also be defined as a set of logically related fields or a collection of records.
There are different operations that can be carried out on a file such as: read, write, rewind, open, modify, delete etc.

It is worthy to note that to access a file stored on a magnetic tape, unlike disk files, the entire file has to be searched sequentially before reaching the record.

The access path for a disk are:

1.   Disk Number
2.   Cylinder Number
3.   Track Number
4.   Sector Number

The unit of data transfer between memory and disk is known as BLOCK. Blocking factor is known as the number of records in a block and is denoted by Bfr.
Bfr = B/R where:

B ===>   Disk Block Size
R ===>   File Record Size

Parameters of a Disk are:

1.   Seek Time: This is the time it takes to move the read/write arm to the correct cylinder. Seek time is the largest in cost and average seek time is the same as the time it takes to traverse one - third of the cylinder.

2.   Rotational Latency Time: This is the time the disk unit takes to move or to position the read/write on the beginning of the sector where the file records are stored.

3.   Block Transfer Time: This is the time for the read/write head to pass over a disk block. During the block transfer time, the bytes of data can be transferred between the disk and main memory.

One of the main objective of file organization is to speed up the file retrieval time that is to reduce the I/O time.

There are 3 Basic Categories of File Organization which are:

1.   Sequential Organization
2.   Index Organization
3.   Random Organization

In sequential organization records are written consecutively when the file is created.

Records in a sequential file can be stored in two ways.

A.   Pile File
B.   Sorted File

In pile file, records are placed one after another as they arrive thereby portraying no form of sorting at all. The total time to fetch (read) a record from a pile file requires seek time (s), rotational latency time (r), block transfer time (btt) and number of blocks in the file.

File reorganization is a process whereby all records which are marked to be deleted are deleted and all inserted records are moved to their correct location.

In sorted file, records are placed in ascending or descending values of the primary key.

Sorted Sequential File: In a sorted sequential file, the record is inserted at the end of the file and then moved to the its correct location according to ascending or descending order. Records are stored in the order of the values of the key field. Note that a sequential file usually has an overflow area. This area is to avoid sorting the file after every deletion, insertion and/or modification. The overflow area is not itself sorted it is a pile file with fixed size record.

It is worthy to note that the order of record storage determines the order of retrieval. Each operation in a sequential file regenerates a new version of the file. The intensity or frequency of use of a sequential file is determined by a parameter called HIT RATIO.

Hit ratio can be defined as the ratio of the number of records accessed for responding to a query to the total number of records in the file.

Note that high hit ratio value is desirable. This means that a larger number of records are accessed to respond to query. Also note that interactive transactions have a very low hit ratio.

Advantages of a Sequential File:

1.   It is good for batch transactions
2.   It is simple to implement
3.   It is good for report generation, statistical computation and inventory control.

Disadvantages of a Sequential File:

1.   It is not good for interactive transactions.
2.   It has high overheads in file processing for simple queries.

Index Organization
This type of file organization tries to reduce the access time and may not reduce the storage requirement of a file. An index maps the key space to the record space.

Note that the index for a file may be created on the primary or secondary keys.

There are three types of Index which are:

1.   Primary Index: This a type of index whereby an ordered file of index record are of fixed length. Each index record has two field which are:
    A.   One that holds the primary key of the data file record.
    B.   The other holds pointer to disk block where the data file is stored.

2.   Non Dense Index: This is a type of index whereby the number of entries in the index file is far less than (<<) the no of records in the data file.

3.   Dense Index: This is a type of index whereby the number of entries in the index file is equal to the number of records in the data file.

The index sequential file has two parts which are:

1.   Index Part ===> This part stores pointers to the actual record location on the disk.
2.   Data Part ===> This part holds actual data records and it is made of two distinct areas which are:
    A.   Prime Area: This area holds the record of the file.
    B.   Overflow Area: This area holds the record of the file when the prime area overflows.

Virtual Storage Access Method has three parts which are:

1.   Index Set
2.   Index Sequence Set
3.   Data Blocks

In the index set, the index records are maintained physically in ascending sequence by primary key value. The index records in the index set are non dense which means that there is only one index record for a lower level index block.

Control interval is a set of data records which are physically stored in ascending key values.

Control Area is an ordered set of control intervals and free control intervals for a file.

Distributed free space is a set of free control intervals in a control area. The size of control area and the number of control intervals may be pre defined. At the end of file initialization the unfilled control intervals are set aside as DISTRIBUTED FREE SPACE.

Random access is an access through the index set to the index sequence set to the data.

Note that in indexing, the amount of I/O increases with the size of the index. This problem can be minimized by direct file organization where the address of the desired record can be found directly (no need for indexing or sequential search). Such files are created using some hashing function so they are called hashing organization or hashed files.

Hashing

There are two types of hashing which are:

1.   Static Hashing: This is the type of hashing where by the address space size is predefined and does not grow or shrink with the file.
2.   Dynamic Hashing: This is the type of hashing where the address space size can grow and shrink with the file.

Note that in hashing, key space is a set of primary keys while address space is a set of home addresses.

Address Distribution

With hashing, the address generated is random
No obvious connection between key and home address (This makes it sometimes called RANDOMIZING).

Record distribution in address space can be uniform or random.

Collision is a situation whereby a hashing function generates a home address for more than one record. The solution to the problem of collision is the use of progressive overflow also known as open addressing.

A bucket is a logical unit of storage where more than one records are stored and the set of records is retrieved in one disk access.

Division is the basis of hashing.

Dynamic hashing manages expansion by:

1.   Splitting a bucket when it becomes full.
2.   Distributing records between old and new buckets.

Virtual hashing is a type of hashing that uses multiple hashing functions. These functions are related and the selection of the function depends on the level of the bucket split.

Demerits of Virtual Hashing
1.   It leads to a waste of space.
2.   If two buckets n and n + j are in use then all buckets between n and n + j must be available to the algorithm for inserting records.
3.   The history of the insertion must be saved to access a record because many related hashing functions are used in the search operation.

It is worthy to note that in virtual hashing the position of one bucket is not related to the position of any bucket.

A database is a collection of related data.

There are implicit properties of a database which are:

1.   A database represents some aspect of the real world sometimes called the mini world or the universe of discourse.
2.   A database is a logically coherent of data with some inherent meaning.
3.   A database is designed built and populated with data for a specific purpose.

Note that a database can be of any size or complexity and may be generated manually or computerized.

Database Management Systems (DBMS)

This is a collection of programs that enables users to create and maintain a database. DBMS is a general purpose software system that facilitates the process of defining, constructing and manipulating database for various applications. Defining a database involves specifying the data types, structures and constraints for the data to be stored in the database. Constructing a database is the process of storing the data itself on some storage medium that is controlled by the DBMS. Manipulating a database includes performing operations or functions such as querying the database to retrieve specific data, updating the database to reflect changes in the mini world and generating reports from the data.

A file is a collection of records that may or may not be ordered at a conceptual level.

Actors on the Scene are those people or personnel whose jobs involves the day - to - day use of a large database.
Workers behind the Scene are those people or personnel whose tasks are to maintain the database system environment only and are not actively interested in the database itself.

Categories of Actors on the Scene are:

1.   Database Administrator: In a database environment, the primary resource is the database itself and the secondary resource is the DBMS and related software. The database administrator is responsible for authorizing access to the database, for coordinating and monitoring its use and acquiring software and hardware.

2.   Database Designers: These people are responsible for identifying of the data to be stored in the database and for choosing appropriate structures to represent and store the data. Database designers typically interact with each potential group users and develop a view of the database that meets the data and processing requirements of the group. The final database design must be capable of supporting the requirements of all user groups.

3.   End Users: These are people whose jobs require access to the database for querying, updating and generating reports.
There are several categories of End Users which are:
    A.   Casual End Users: They are those who occasionally access the database but may need different information each time. They use a sophisticated query language to specify their requests and are typically middle or high level managers.
    B.   Naive or Parametric End Users: These are people whom their main job function revolves around constantly querying and updating the database using standard types of queries and updates called canned transactions that have been carefully programmed and tested. Example: Reservation checks in Airlines, hostels, bank teller checks etc.
    C.   Sophisticated End Users: These are people who familiarize themselves with the facilities of the DBMS so as to implement their applications to meet their complex requirements. Examples of sophisticated end users are: business analysts, engineers and scientists.
    D.   Stand Alone End Users: They maintain personal databases by using ready made program package that provide easy to use menu or graphics interface. For Example: The user of a payment receipt package used in various stores or supermarkets.

System analysts determines the requirements of end users and develop specifications of canned transactions that meet their requirements.
Application programmers implement these specifications as programs then test, debug, document and maintain these canned transactions.

Categories of Personnel/Workers behind the Scene are:

1.   DBMS System designers and implementers: These are personnel who design and implement the DBMS modules and interfaces as a software package. The DBMS must interface with other system software such as the operating system and compilers for various programming languages.

2.   Tool Developers: These are workers who design and implement tools that is software packages that facilitates database system design which aids in improving performances.

3.   Operators and Maintenance Personnel: They are the system administrative personnel who are responsible for the actual running and maintenance of the hardware and software environment of the database system.

There are 3 types of Database Organization which are:

1. Relational Database Organization
2. Hierarchical Database Organization
3. Network Database Organization

Database architecture is a client - server system architecture.
The system functionality of the client/server system architecture is distributed between two types of modules which are:

1.   Client Modules
2.   Server Modules

Client Modules: These are application programs and user interfaces that access the database, typically run in the client module. Hence, the client module handles user interaction and provides the user friendly interface such as FORM and GUI (Graphic User Interface).

Server Modules: This modules typically handles data storage, access, search and other functions.

It is worthy to note that one fundamental characteristics of the database approach is that it provides some level of data abstraction by hiding the details of data storage that are not needed by most database users.

Client/Server Architecture Concepts

Data Models: This is a collection of concepts that can be the necessary means to achieve abstraction. By structure of a database one means the data types, relationships and constraints that should hold on the data.

Categories of Data Models:
Data models can be categorized according to the type of concepts they use to describe the DB structure thus:

1.   High Level or Conceptual Data Models: This provides concepts that are close to the way many users perceive data. Conceptual data models use terms such as Entities, Attributes and Relationships.
An entity represents a real world object or concept such as an employee, student or project that is described in the database.
An attribute represents some property of interest that further describes an entity such as the employee's name or student's grades.
A relationship represents an interaction among entities. For Example: a relationship between an employee and a project.
Other additional data model concepts such as generalization, specialization and categories could be used depending on the designer's approach or interest are referred to as Enhanced Entity Relationship or Object Modelling.

2.   Low Level or Physical Data Models: This provides concept that describes the details of how data is stored in the computer. This is meant for computer specialist and not typical end users. Physical data models describe how these data is stored by representing information such as record formats, record orderings and access paths.
An access path is a structure that makes the search for a particular database record efficient.

3.   Representational or Implementation Data Models: This provides concepts that may be understood by end users but are not too far from the way the data are organized within the computer. Representational data models hide some details of data storage but can be implemented on a computer system in a direct way. This data models include: Relational data model, Network Data Model, Hierarchical Data Model. Representational data models represent data by using record structure and hence sometimes called RECORD BASES data models.

Database Schema
This is the description of a database which is specified during database design and is not expected to change frequently.
A displayed schema is called a SCHEMA DIAGRAM. Each object in the schema such as student or employee is referred to as SCHEMA CONSTRUCT.

Note that data in the database of a particular moment in time is called a database state or snapshot. This is also called the current set of occurrency or instances in the database. The DBMS is partly responsible for ensuring that every state of the database is a valid state that is a state that satisfies the structure and constraints specified in the schema. The DBMS stores the description of the schema constructs and constraint called the META DATA. The schema is sometimes called the INTENSION and a database state an EXTENSION.

The three schema architecture is an architecture for database systems which was proposed to help achieve and visualize the following characteristics of database approach are:

1.   Insulation of programs and data (program-data and program-operation independence)
2.   Support of multiple users views
3.   Use of a catalog to store the database description (schema)

The goal of three schema architecture is to separate the user applications and physical database.

In this architecture, schemas can be defined at the following three levels which are:

1.   The Internal Level: This describes the physical storage structure of the database and of data storage and access paths for the database.

2.   The Conceptual Level: This describes the structure of the whole database for a community of users. The conceptual schema hides the details of physical storage structures and concentrates on describing entities, data types, relationships, user operations and constraints. A high level data model or implementation data model can be used at this level.

3.   The External or View Level: Each external schema describes the part of the database that a particular user group is interested in and hides the rest of the database from the user group. A high level or implementation data model can be used at this level.

It is worthy to note the following points:

A.   The three schema architecture is a convenient tool for the user to visualise the schema levels in a database system.
B.   Most DBMS do not separate the three levels completely but support the 3 schemas architecture to some extent.
C.   The 3 schemas are only descriptions of data. The only data that actually exists is at the physical level.

Mappings
This is the process of transferring requests between levels.

Data Independence: This is the capacity to change the schema at one level of the database system without having to change the schema at the next higher level.

There are two types of data independence which are:

1.   Logical Independence: This is the capacity to change the conceptual schema without having to change the external schemas or application programs. The conceptual schema may be changed to expand the database (by adding a record type on the data items) or to reduce the database (by removing a record type or data item).

2.   Physical Independence: This is the capacity to change the internal schema without having to change the conceptual (or external) schema. Example: by creating additional access structure to improve the performance of retrieval or update.

The two types of mapping create an overhead during compilation or execution of a query or location leading to inefficiencies in the DBMS, hence few DBMS have implemented the full 3 schemas architecture.

Disk space management is the means of managing the allocation of disk space to file blocks.

The two problems encountered in disk space management are:

1.   Slower disk access time compared to memory access time.
2.   Larger number of blocks to be dealt with compared to the blocks available

A good space management mechanism should take the following into consideration which are:

1.   Disk Utilization
2.   Ability to make use of multi track and multi sector transfer
3.   Processing speed and allocation and deallocation of blocks
4.   Main memory requirement for a given algorithm

Two types of Disk space memory allocation are:

1.   Contiguous
2.   Non Contiguous

Two types of non contiguous allocation are:

1.   Chaining
2.   Indexing

Anazor

Tuesday, 21 April 2015

Summary of File and Database Organization

No comments:

Post a Comment