A file is a collection or set (ordered or unordered) of data elements stored in a storage media.
A field is the smallest (fixed) indivisible logical unit of a file which holds a part of some data value.
A record is a set of logically related fields. A record size may be fixed or variable.
A file can also be defined as a set of logically related fields or a collection of records.
There are different operations that can be carried out on a file such as: read, write, rewind, open, modify, delete etc.
It
 is worthy to note that to access a file stored on a magnetic tape, 
unlike disk files, the entire file has to be searched sequentially 
before reaching the record.
The access path for a disk are:
1.   Disk Number
2.   Cylinder Number
3.   Track Number
4.   Sector Number
The
 unit of data transfer between memory and disk is known as BLOCK. 
Blocking factor is known as the number of records in a block and is 
denoted by Bfr.
Bfr = B/R where:
B ===>   Disk Block Size
R ===>   File Record Size
Parameters of a Disk are:
1.   Seek Time: 
 This is the time it takes to move the read/write arm to the correct 
cylinder. Seek time is the largest in cost and average seek time is the 
same as the time it takes to traverse one - third of the cylinder.
2.   Rotational Latency Time: 
 This is the time the disk unit takes to move or to position the 
read/write on the beginning of the sector where the file records are 
stored.
3.   Block Transfer Time:  This is the time for 
the read/write head to pass over a disk block. During the block transfer
 time, the bytes of data can be transferred between the disk and main 
memory.
One of the main objective of file organization is to speed up the file retrieval time that is to reduce the I/O time.
There are 3 Basic Categories of File Organization which are:
1.   Sequential Organization
2.   Index Organization
3.   Random Organization
 
In sequential organization records are written consecutively when the file is created.
Records in a sequential file can be stored in two ways.
A.   Pile File
B.   Sorted File
In
 pile file, records are placed one after another as they arrive thereby 
portraying no form of sorting at all. The total time to fetch (read) a 
record from a pile file requires seek time (s), rotational latency time 
(r), block transfer time (btt) and number of blocks in the file.
File reorganization
 is a process whereby all records which are marked to be deleted are 
deleted and all inserted records are moved to their correct location.
In sorted file, records are placed in ascending or descending values of the primary key.
Sorted Sequential File:
 In a sorted sequential file, the record is inserted at the end of the 
file and then moved to the its correct location according to ascending 
or descending order. Records are stored in the order of the values of 
the key field. Note that a sequential file usually has an overflow area.
 This area is to avoid sorting the file after every deletion, insertion 
and/or modification. The overflow area is not itself sorted it is a pile
 file with fixed size record.
It is worthy to note that the order
 of record storage determines the order of retrieval. Each operation in a
 sequential file regenerates a new version of the file. The intensity or
 frequency of use of a sequential file is determined by a parameter 
called HIT RATIO.
Hit ratio can be defined as the 
ratio of the number of records accessed for responding to a query to the
 total number of records in the file.
Note that high hit ratio 
value is desirable. This means that a larger number of records are 
accessed to respond to query. Also note that interactive transactions 
have a very low hit ratio.
Advantages of a Sequential File:
1.   It is good for batch transactions
2.   It is simple to implement
3.   It is good for report generation, statistical computation and inventory control.
Disadvantages of a Sequential File:
1.   It is not good for interactive transactions.
2.   It has high overheads in file processing for simple queries.
Index Organization
This
 type of file organization tries to reduce the access time and may not 
reduce the storage requirement of a file. An index maps the key space to
 the record space. 
Note that the index for a file may be created on the primary or secondary keys.
There are three types of Index which are:
1.   Primary Index: This a type of index whereby an ordered file of index record are of fixed length. Each index record has two field which are:
    A.   One that holds the primary key of the data file record.
    B.   The other holds pointer to disk block where the data file is stored.
2.   Non Dense Index: 
 This is a type of index whereby the number of entries in the index file
 is far less than (<<) the no of records in the data file.
3.  
 Dense Index:  This is a type of index whereby the number of entries in 
the index file is equal to the number of records in the data file.
The index sequential file has two parts which are:
1.   Index Part ===> This part stores pointers to the actual record location on the disk.
2.   Data Part ===> This part holds actual data records and it is made of two distinct areas which are:
    A.   Prime Area: This area holds the record of the file.
    B.   Overflow Area: This area holds the record of the file when the prime area overflows.
Virtual Storage Access Method has three parts which are:
1.   Index Set
2.   Index Sequence Set
3.   Data Blocks
In
 the index set, the index records are maintained physically in ascending
 sequence by primary key value. The index records in the index set are 
non dense which means that there is only one index record for a lower 
level index block.
Control interval is a set of data records which are physically stored in ascending key values.
Control Area is an ordered set of control intervals and free control intervals for a file.
Distributed
 free space is a set of free control intervals in a control area. The 
size of control area and the number of control intervals may be pre 
defined. At the end of file initialization the unfilled control 
intervals are set aside as DISTRIBUTED FREE SPACE.
Random access is an access through the index set to the index sequence set to the data.
Note
 that in indexing, the amount of I/O increases with the size of the 
index. This problem can be minimized by direct file organization where 
the address of the desired record can be found directly (no need for 
indexing or sequential search). Such files are created using some 
hashing function so they are called hashing organization or hashed 
files.
Hashing
There are two types of hashing which are:
1.  
 Static Hashing: This is the type of hashing where by the address space 
size is predefined and does not grow or shrink with the file.
2.   Dynamic Hashing: This is the type of hashing where the address space size can grow and shrink with the file.
Note that in hashing, key space is a set of primary keys while address space is a set of home addresses.
Address Distribution
With hashing, the address generated is random
No obvious connection between key and home address (This makes it sometimes called RANDOMIZING).
Record distribution in address space can be uniform or random.
Collision
 is a situation whereby a hashing function generates a home address for 
more than one record. The solution to the problem of collision is the 
use of progressive overflow also known as open addressing.
A bucket is a logical unit of storage where more than one records are stored and the set of records is retrieved in one disk access.
Division is the basis of hashing.
Dynamic hashing manages expansion by:
1.   Splitting a bucket when it becomes full.
2.   Distributing records between old and new buckets.
Virtual hashing
 is a type of hashing that uses multiple hashing functions. These 
functions are related and the selection of the function depends on the 
level of the bucket split.
Demerits of Virtual Hashing
1.   It leads to a waste of space.
2.  
 If two buckets n and n + j are in use then all buckets between n and n +
 j must be available to the algorithm for inserting records.
3.   The
 history of the insertion must be saved to access a record because many 
related hashing functions are used in the search operation.
It is worthy to note that in virtual hashing the position of one bucket is not related to the position of any bucket.
A database is a collection of related data.
There are implicit properties of a database which are:
1.   A database represents some aspect of the real world sometimes called the mini world or the universe of discourse.
2.   A database is a logically coherent of data with some inherent meaning.
3.   A database is designed built and populated with data for a specific purpose.
Note that a database can be of any size or complexity and may be generated manually or computerized.
Database Management Systems (DBMS)
This
 is a collection of programs that enables users to create and maintain a
 database. DBMS is a general purpose software system that facilitates 
the process of defining, constructing and manipulating database for 
various applications. Defining a database involves specifying the data types, structures and constraints for the data to be stored in the database. Constructing a database is the process of storing the data itself on some storage medium that is controlled by the DBMS. Manipulating a database
 includes performing operations or functions such as querying the 
database to retrieve specific data, updating the database to reflect 
changes in the mini world and generating reports from the data.
A file is a collection of records that may or may not be ordered at a conceptual level.
Actors on the Scene are those people or personnel whose jobs involves the day - to - day use of a large database.
Workers behind the Scene
 are those people or personnel whose tasks are to maintain the database 
system environment only and are not actively interested in the database 
itself.
Categories of Actors on the Scene are:
1.   Database Administrator:
 In a database environment, the primary resource is the database itself 
and the secondary resource is the DBMS and related software. The 
database administrator is responsible for authorizing access to the 
database, for coordinating and monitoring its use and acquiring software
 and hardware.
2.   Database Designers: These people are 
responsible for identifying of the data to be stored in the database and
 for choosing appropriate structures to represent and store the data. 
Database designers typically interact with each potential group users 
and develop a view of the database that meets the data and processing 
requirements of the group. The final database design must be capable of 
supporting the requirements of all user groups.
3.   End Users: These are people whose jobs require access to the database for querying, updating and generating reports.
There are several categories of End Users which are:
    A.   Casual End Users:
 They are those who occasionally access the database but may need 
different information each time. They use a sophisticated query language
 to specify their requests and are typically middle or high level 
managers.
    B.   Naive or Parametric End Users: These are 
people whom their main job function revolves around constantly querying 
and updating the database using standard types of queries and updates 
called canned transactions that have been carefully programmed and 
tested. Example: Reservation checks in Airlines, hostels, bank teller 
checks etc.
    C.   Sophisticated End Users: These are people
 who familiarize themselves with the facilities of the DBMS so as to 
implement their applications to meet their complex requirements. 
Examples of sophisticated end users are: business analysts, engineers 
and scientists.
    D.   Stand Alone End Users: They maintain 
personal databases by using ready made program package that provide easy
 to use menu or graphics interface. For Example: The user of a payment 
receipt package used in various stores or supermarkets.
System 
analysts determines the requirements of end users and develop 
specifications of canned transactions that meet their requirements.
Application
 programmers implement these specifications as programs then test, 
debug, document and maintain these canned transactions.
Categories of Personnel/Workers behind the Scene are:
1.   DBMS System designers and implementers:
 These are personnel who design and implement the DBMS modules and 
interfaces as a software package. The DBMS must interface with other 
system software such as the operating system and compilers for various 
programming languages.
2.   Tool Developers: These are 
workers who design and implement tools that is software packages that 
facilitates database system design which aids in improving performances.
3.   Operators and Maintenance Personnel:
 They are the system administrative personnel who are responsible for 
the actual running and maintenance of the hardware and software 
environment of the database system.
There are 3 types of Database Organization which are:
1.  Relational Database Organization
2.  Hierarchical Database Organization
3.  Network Database Organization
Database architecture is a client - server system architecture.
The system functionality of the client/server system architecture is distributed between two types of modules which are:
1.   Client Modules
2.   Server Modules
Client Modules:
 These are application programs and user interfaces that access the 
database, typically run in the client module. Hence, the client module 
handles user interaction and provides the user friendly interface such 
as FORM and GUI (Graphic User Interface).
Server Modules: This modules typically handles data storage, access, search and other functions.
It
 is worthy to note that one fundamental characteristics of the database 
approach is that it provides some level of data abstraction by hiding 
the details of data storage that are not needed by most database users.
Client/Server Architecture Concepts
Data Models:
 This is a collection of concepts that can be the necessary means to 
achieve abstraction. By structure of a database one means the data 
types, relationships and constraints that should hold on the data.
Categories of Data Models:
Data models can be categorized according to the type of concepts they use to describe the DB structure thus:
1.   High Level or Conceptual Data Models:
 This provides concepts that are close to the way many users perceive 
data. Conceptual data models use terms such as Entities, Attributes and 
Relationships.
An entity represents a real world object or concept such as an employee, student or project that is described in the database.
An attribute represents some property of interest that further describes an entity such as the employee's name or student's grades.
A relationship represents an interaction among entities. For Example: a relationship between an employee and a project.
Other
 additional data model concepts such as generalization, specialization 
and categories could be used depending on the designer's approach or 
interest are referred to as Enhanced Entity Relationship or Object Modelling.
2.   Low Level or Physical Data Models:
 This provides concept that describes the details of how data is stored 
in the computer. This is meant for computer specialist and not typical 
end users. Physical data models describe how these data is stored by 
representing information such as record formats, record orderings and 
access paths.
An access path is a structure that makes the search for a particular database record efficient.
3.   Representational or Implementation Data Models:
 This provides concepts that may be understood by end users but are not 
too far from the way the data are organized within the computer. 
Representational data models hide some details of data storage but can 
be implemented on a computer system in a direct way. This data models 
include: Relational data model, Network Data Model, Hierarchical Data 
Model. Representational data models represent data by using record 
structure and hence sometimes called RECORD BASES data models.
Database Schema
This is the description of a database which is specified during database design and is not expected to change frequently.
A
 displayed schema is called a SCHEMA DIAGRAM. Each object in the schema 
such as student or employee is referred to as SCHEMA CONSTRUCT.
Note
 that data in the database of a particular moment in time is called a 
database state or snapshot. This is also called the current set of 
occurrency or instances in the database. The DBMS is partly responsible 
for ensuring that every state of the database is a valid state that is a
 state that satisfies the structure and constraints specified in the 
schema. The DBMS stores the description of the schema constructs and 
constraint called the META DATA. The schema is sometimes called the 
INTENSION and a database state an EXTENSION.
The three schema 
architecture is an architecture for database systems which was proposed 
to help achieve and visualize the following characteristics of database 
approach are:
1.   Insulation of programs and data (program-data and program-operation independence)
2.   Support of multiple users views
3.   Use of a catalog to store the database description (schema)
The goal of three schema architecture is to separate the user applications and physical database.
In this architecture, schemas can be defined at the following three levels which are:
1.   The Internal Level: This describes the physical storage structure of the database and of data storage and access paths for the database.
 
2.   The Conceptual Level:
 This describes the structure of the whole database for a community of 
users. The conceptual schema hides the details of physical storage 
structures and concentrates on describing entities, data types, 
relationships, user operations and constraints. A high level data model 
or implementation data model can be used at this level.
 
3.   The External or View Level:
 Each external schema describes the part of the database that a 
particular user group is interested in and hides the rest of the 
database from the user group. A high level or implementation data model 
can be used at this level.
It is worthy to note the following points:
A.   The three schema architecture is a convenient tool for the user to visualise the schema levels in a database system.
B.   Most DBMS do not separate the three levels completely but support the 3 schemas architecture to some extent.
C.   The 3 schemas are only descriptions of data. The only data that actually exists is at the physical level.
Mappings
This is the process of transferring requests between levels.
Data Independence:
 This is the capacity to change the schema at one level of the database 
system without having to change the schema at the next higher level.
There are two types of data independence which are:
1.   Logical Independence:
 This is the capacity to change the conceptual schema without having to 
change the external schemas or application programs. The conceptual 
schema may be changed to expand the database (by adding a record type on
 the data items) or to reduce the database (by removing a record type or
 data item).
2.   Physical Independence: This is the 
capacity to change the internal schema without having to change the 
conceptual (or external) schema. Example: by creating additional access 
structure to improve the performance of retrieval or update.
The 
two types of mapping create an overhead during compilation or execution 
of a query or location leading to inefficiencies in the DBMS, hence few 
DBMS have implemented the full 3 schemas architecture.
Disk space management is the means of managing the allocation of disk space to file blocks.
The two problems encountered in disk space management are:
1.   Slower disk access time compared to memory access time.
2.   Larger number of blocks to be dealt with compared to the blocks available
A good space management mechanism should take the following into consideration which are:
1.   Disk Utilization
2.   Ability to make use of multi track and multi sector transfer
3.   Processing speed and allocation and deallocation of blocks
4.   Main memory requirement for a given algorithm
Two types of Disk space memory allocation are:
1.   Contiguous
2.   Non Contiguous
Two types of non contiguous allocation are:
1.   Chaining
2.   Indexing 


 
No comments:
Post a Comment