A file is a collection or set (ordered or unordered) of data elements stored in a storage media.
A field is the smallest (fixed) indivisible logical unit of a file which holds a part of some data value.
A record is a set of logically related fields. A record size may be fixed or variable.
A file can also be defined as a set of logically related fields or a collection of records.
There are different operations that can be carried out on a file such as: read, write, rewind, open, modify, delete etc.
It
is worthy to note that to access a file stored on a magnetic tape,
unlike disk files, the entire file has to be searched sequentially
before reaching the record.
The access path for a disk are:
1. Disk Number
2. Cylinder Number
3. Track Number
4. Sector Number
The
unit of data transfer between memory and disk is known as BLOCK.
Blocking factor is known as the number of records in a block and is
denoted by Bfr.
Bfr = B/R where:
B ===> Disk Block Size
R ===> File Record Size
Parameters of a Disk are:
1. Seek Time:
This is the time it takes to move the read/write arm to the correct
cylinder. Seek time is the largest in cost and average seek time is the
same as the time it takes to traverse one - third of the cylinder.
2. Rotational Latency Time:
This is the time the disk unit takes to move or to position the
read/write on the beginning of the sector where the file records are
stored.
3. Block Transfer Time: This is the time for
the read/write head to pass over a disk block. During the block transfer
time, the bytes of data can be transferred between the disk and main
memory.
One of the main objective of file organization is to speed up the file retrieval time that is to reduce the I/O time.
There are 3 Basic Categories of File Organization which are:
1. Sequential Organization
2. Index Organization
3. Random Organization
In sequential organization records are written consecutively when the file is created.
Records in a sequential file can be stored in two ways.
A. Pile File
B. Sorted File
In
pile file, records are placed one after another as they arrive thereby
portraying no form of sorting at all. The total time to fetch (read) a
record from a pile file requires seek time (s), rotational latency time
(r), block transfer time (btt) and number of blocks in the file.
File reorganization
is a process whereby all records which are marked to be deleted are
deleted and all inserted records are moved to their correct location.
In sorted file, records are placed in ascending or descending values of the primary key.
Sorted Sequential File:
In a sorted sequential file, the record is inserted at the end of the
file and then moved to the its correct location according to ascending
or descending order. Records are stored in the order of the values of
the key field. Note that a sequential file usually has an overflow area.
This area is to avoid sorting the file after every deletion, insertion
and/or modification. The overflow area is not itself sorted it is a pile
file with fixed size record.
It is worthy to note that the order
of record storage determines the order of retrieval. Each operation in a
sequential file regenerates a new version of the file. The intensity or
frequency of use of a sequential file is determined by a parameter
called HIT RATIO.
Hit ratio can be defined as the
ratio of the number of records accessed for responding to a query to the
total number of records in the file.
Note that high hit ratio
value is desirable. This means that a larger number of records are
accessed to respond to query. Also note that interactive transactions
have a very low hit ratio.
Advantages of a Sequential File:
1. It is good for batch transactions
2. It is simple to implement
3. It is good for report generation, statistical computation and inventory control.
Disadvantages of a Sequential File:
1. It is not good for interactive transactions.
2. It has high overheads in file processing for simple queries.
Index Organization
This
type of file organization tries to reduce the access time and may not
reduce the storage requirement of a file. An index maps the key space to
the record space.
Note that the index for a file may be created on the primary or secondary keys.
There are three types of Index which are:
1. Primary Index: This a type of index whereby an ordered file of index record are of fixed length. Each index record has two field which are:
A. One that holds the primary key of the data file record.
B. The other holds pointer to disk block where the data file is stored.
2. Non Dense Index:
This is a type of index whereby the number of entries in the index file
is far less than (<<) the no of records in the data file.
3.
Dense Index: This is a type of index whereby the number of entries in
the index file is equal to the number of records in the data file.
The index sequential file has two parts which are:
1. Index Part ===> This part stores pointers to the actual record location on the disk.
2. Data Part ===> This part holds actual data records and it is made of two distinct areas which are:
A. Prime Area: This area holds the record of the file.
B. Overflow Area: This area holds the record of the file when the prime area overflows.
Virtual Storage Access Method has three parts which are:
1. Index Set
2. Index Sequence Set
3. Data Blocks
In
the index set, the index records are maintained physically in ascending
sequence by primary key value. The index records in the index set are
non dense which means that there is only one index record for a lower
level index block.
Control interval is a set of data records which are physically stored in ascending key values.
Control Area is an ordered set of control intervals and free control intervals for a file.
Distributed
free space is a set of free control intervals in a control area. The
size of control area and the number of control intervals may be pre
defined. At the end of file initialization the unfilled control
intervals are set aside as DISTRIBUTED FREE SPACE.
Random access is an access through the index set to the index sequence set to the data.
Note
that in indexing, the amount of I/O increases with the size of the
index. This problem can be minimized by direct file organization where
the address of the desired record can be found directly (no need for
indexing or sequential search). Such files are created using some
hashing function so they are called hashing organization or hashed
files.
Hashing
There are two types of hashing which are:
1.
Static Hashing: This is the type of hashing where by the address space
size is predefined and does not grow or shrink with the file.
2. Dynamic Hashing: This is the type of hashing where the address space size can grow and shrink with the file.
Note that in hashing, key space is a set of primary keys while address space is a set of home addresses.
Address Distribution
With hashing, the address generated is random
No obvious connection between key and home address (This makes it sometimes called RANDOMIZING).
Record distribution in address space can be uniform or random.
Collision
is a situation whereby a hashing function generates a home address for
more than one record. The solution to the problem of collision is the
use of progressive overflow also known as open addressing.
A bucket is a logical unit of storage where more than one records are stored and the set of records is retrieved in one disk access.
Division is the basis of hashing.
Dynamic hashing manages expansion by:
1. Splitting a bucket when it becomes full.
2. Distributing records between old and new buckets.
Virtual hashing
is a type of hashing that uses multiple hashing functions. These
functions are related and the selection of the function depends on the
level of the bucket split.
Demerits of Virtual Hashing
1. It leads to a waste of space.
2.
If two buckets n and n + j are in use then all buckets between n and n +
j must be available to the algorithm for inserting records.
3. The
history of the insertion must be saved to access a record because many
related hashing functions are used in the search operation.
It is worthy to note that in virtual hashing the position of one bucket is not related to the position of any bucket.
A database is a collection of related data.
There are implicit properties of a database which are:
1. A database represents some aspect of the real world sometimes called the mini world or the universe of discourse.
2. A database is a logically coherent of data with some inherent meaning.
3. A database is designed built and populated with data for a specific purpose.
Note that a database can be of any size or complexity and may be generated manually or computerized.
Database Management Systems (DBMS)
This
is a collection of programs that enables users to create and maintain a
database. DBMS is a general purpose software system that facilitates
the process of defining, constructing and manipulating database for
various applications. Defining a database involves specifying the data types, structures and constraints for the data to be stored in the database. Constructing a database is the process of storing the data itself on some storage medium that is controlled by the DBMS. Manipulating a database
includes performing operations or functions such as querying the
database to retrieve specific data, updating the database to reflect
changes in the mini world and generating reports from the data.
A file is a collection of records that may or may not be ordered at a conceptual level.
Actors on the Scene are those people or personnel whose jobs involves the day - to - day use of a large database.
Workers behind the Scene
are those people or personnel whose tasks are to maintain the database
system environment only and are not actively interested in the database
itself.
Categories of Actors on the Scene are:
1. Database Administrator:
In a database environment, the primary resource is the database itself
and the secondary resource is the DBMS and related software. The
database administrator is responsible for authorizing access to the
database, for coordinating and monitoring its use and acquiring software
and hardware.
2. Database Designers: These people are
responsible for identifying of the data to be stored in the database and
for choosing appropriate structures to represent and store the data.
Database designers typically interact with each potential group users
and develop a view of the database that meets the data and processing
requirements of the group. The final database design must be capable of
supporting the requirements of all user groups.
3. End Users: These are people whose jobs require access to the database for querying, updating and generating reports.
There are several categories of End Users which are:
A. Casual End Users:
They are those who occasionally access the database but may need
different information each time. They use a sophisticated query language
to specify their requests and are typically middle or high level
managers.
B. Naive or Parametric End Users: These are
people whom their main job function revolves around constantly querying
and updating the database using standard types of queries and updates
called canned transactions that have been carefully programmed and
tested. Example: Reservation checks in Airlines, hostels, bank teller
checks etc.
C. Sophisticated End Users: These are people
who familiarize themselves with the facilities of the DBMS so as to
implement their applications to meet their complex requirements.
Examples of sophisticated end users are: business analysts, engineers
and scientists.
D. Stand Alone End Users: They maintain
personal databases by using ready made program package that provide easy
to use menu or graphics interface. For Example: The user of a payment
receipt package used in various stores or supermarkets.
System
analysts determines the requirements of end users and develop
specifications of canned transactions that meet their requirements.
Application
programmers implement these specifications as programs then test,
debug, document and maintain these canned transactions.
Categories of Personnel/Workers behind the Scene are:
1. DBMS System designers and implementers:
These are personnel who design and implement the DBMS modules and
interfaces as a software package. The DBMS must interface with other
system software such as the operating system and compilers for various
programming languages.
2. Tool Developers: These are
workers who design and implement tools that is software packages that
facilitates database system design which aids in improving performances.
3. Operators and Maintenance Personnel:
They are the system administrative personnel who are responsible for
the actual running and maintenance of the hardware and software
environment of the database system.
There are 3 types of Database Organization which are:
1. Relational Database Organization
2. Hierarchical Database Organization
3. Network Database Organization
Database architecture is a client - server system architecture.
The system functionality of the client/server system architecture is distributed between two types of modules which are:
1. Client Modules
2. Server Modules
Client Modules:
These are application programs and user interfaces that access the
database, typically run in the client module. Hence, the client module
handles user interaction and provides the user friendly interface such
as FORM and GUI (Graphic User Interface).
Server Modules: This modules typically handles data storage, access, search and other functions.
It
is worthy to note that one fundamental characteristics of the database
approach is that it provides some level of data abstraction by hiding
the details of data storage that are not needed by most database users.
Client/Server Architecture Concepts
Data Models:
This is a collection of concepts that can be the necessary means to
achieve abstraction. By structure of a database one means the data
types, relationships and constraints that should hold on the data.
Categories of Data Models:
Data models can be categorized according to the type of concepts they use to describe the DB structure thus:
1. High Level or Conceptual Data Models:
This provides concepts that are close to the way many users perceive
data. Conceptual data models use terms such as Entities, Attributes and
Relationships.
An entity represents a real world object or concept such as an employee, student or project that is described in the database.
An attribute represents some property of interest that further describes an entity such as the employee's name or student's grades.
A relationship represents an interaction among entities. For Example: a relationship between an employee and a project.
Other
additional data model concepts such as generalization, specialization
and categories could be used depending on the designer's approach or
interest are referred to as Enhanced Entity Relationship or Object Modelling.
2. Low Level or Physical Data Models:
This provides concept that describes the details of how data is stored
in the computer. This is meant for computer specialist and not typical
end users. Physical data models describe how these data is stored by
representing information such as record formats, record orderings and
access paths.
An access path is a structure that makes the search for a particular database record efficient.
3. Representational or Implementation Data Models:
This provides concepts that may be understood by end users but are not
too far from the way the data are organized within the computer.
Representational data models hide some details of data storage but can
be implemented on a computer system in a direct way. This data models
include: Relational data model, Network Data Model, Hierarchical Data
Model. Representational data models represent data by using record
structure and hence sometimes called RECORD BASES data models.
Database Schema
This is the description of a database which is specified during database design and is not expected to change frequently.
A
displayed schema is called a SCHEMA DIAGRAM. Each object in the schema
such as student or employee is referred to as SCHEMA CONSTRUCT.
Note
that data in the database of a particular moment in time is called a
database state or snapshot. This is also called the current set of
occurrency or instances in the database. The DBMS is partly responsible
for ensuring that every state of the database is a valid state that is a
state that satisfies the structure and constraints specified in the
schema. The DBMS stores the description of the schema constructs and
constraint called the META DATA. The schema is sometimes called the
INTENSION and a database state an EXTENSION.
The three schema
architecture is an architecture for database systems which was proposed
to help achieve and visualize the following characteristics of database
approach are:
1. Insulation of programs and data (program-data and program-operation independence)
2. Support of multiple users views
3. Use of a catalog to store the database description (schema)
The goal of three schema architecture is to separate the user applications and physical database.
In this architecture, schemas can be defined at the following three levels which are:
1. The Internal Level: This describes the physical storage structure of the database and of data storage and access paths for the database.
2. The Conceptual Level:
This describes the structure of the whole database for a community of
users. The conceptual schema hides the details of physical storage
structures and concentrates on describing entities, data types,
relationships, user operations and constraints. A high level data model
or implementation data model can be used at this level.
3. The External or View Level:
Each external schema describes the part of the database that a
particular user group is interested in and hides the rest of the
database from the user group. A high level or implementation data model
can be used at this level.
It is worthy to note the following points:
A. The three schema architecture is a convenient tool for the user to visualise the schema levels in a database system.
B. Most DBMS do not separate the three levels completely but support the 3 schemas architecture to some extent.
C. The 3 schemas are only descriptions of data. The only data that actually exists is at the physical level.
Mappings
This is the process of transferring requests between levels.
Data Independence:
This is the capacity to change the schema at one level of the database
system without having to change the schema at the next higher level.
There are two types of data independence which are:
1. Logical Independence:
This is the capacity to change the conceptual schema without having to
change the external schemas or application programs. The conceptual
schema may be changed to expand the database (by adding a record type on
the data items) or to reduce the database (by removing a record type or
data item).
2. Physical Independence: This is the
capacity to change the internal schema without having to change the
conceptual (or external) schema. Example: by creating additional access
structure to improve the performance of retrieval or update.
The
two types of mapping create an overhead during compilation or execution
of a query or location leading to inefficiencies in the DBMS, hence few
DBMS have implemented the full 3 schemas architecture.
Disk space management is the means of managing the allocation of disk space to file blocks.
The two problems encountered in disk space management are:
1. Slower disk access time compared to memory access time.
2. Larger number of blocks to be dealt with compared to the blocks available
A good space management mechanism should take the following into consideration which are:
1. Disk Utilization
2. Ability to make use of multi track and multi sector transfer
3. Processing speed and allocation and deallocation of blocks
4. Main memory requirement for a given algorithm
Two types of Disk space memory allocation are:
1. Contiguous
2. Non Contiguous
Two types of non contiguous allocation are:
1. Chaining
2. Indexing
No comments:
Post a Comment