Saturday, December 14, 2019

File Organization and its types

The File is a collection of records. Using the primary key, we can access the records. The type and frequency of access can be determined by the type of file organization which was used for a given set of records.

Objectives of file organization are:
  • It contains an optimal selection of records, i.e., records can be selected as fast as possible.
  • To perform insert, delete or update transaction on the records should be quick and easy.
  • The duplicate records cannot be induced as a result of insert, update or delete. 
  • For the minimal cost of storage, records should be stored efficiently.
Types of File Organizations –

Sequential File Organization

In this method, files are stored sequentially. This method is the easiest method for file organization. This method can be implemented in two ways:
1. Pile File Method
2. Sorted File Method.

1. Pile File Method:

  • In this method, records are stored in a sequence one after another. The records will be inserted in the order in which they are inserted into tables.
  • If updating or deleting of any record has to be done, the record will be searched in the memory blocks. When it is found, then it will be marked for deleting, and the new record is inserted.


To insert a new record R2 in the sequence, then it will be placed at the end of the file.


2. Sorted File Method:

  • In this method, the new record is always inserted at the file's end, and then it will sort the sequence in ascending or descending order. Sorting of records is based on any primary key or any other key.
  • In the case of modification of any record, it will update the record and then sort the file, and lastly, the updated record is placed in the right place.

Suppose a new record R2 has to be inserted in the sequence, then it will be inserted at the end of the file, and then it will sort the sequence.


Advantages of sequential file organization

  • It contains a fast and efficient method for the huge amount of data.
  • In this method, files can be easily stored in cheaper storage mechanism like magnetic tapes.
  • It is simple in design. It requires no much effort to store the data. 
  • This method is used when most of the records have to be accessed like grade calculation of a student, generating the salary slip, etc.
  • This method is used for report generation or statistical calculations.
Disadvantages of sequential file organization
  • It will waste time as we cannot jump on a particular record that is required but we have to move sequentially which takes our time.
  • Sorted file method takes more time and space for sorting the records.
Heap file organization
  • In heap file organization, the records are inserted at the file's end. When the records are inserted, it doesn't require the sorting and ordering of records. It works with data blocks. It is the simplest and most basic type of organization. 
  • The heap file is also known as an unordered file. When the data block is full, the new record is stored in some other block. This new data block need not to be the very next data block, but it can select any data block in the memory to store new records.
  • In the file, every record has a unique id, and every page in a file is of the same size. It is the DBMS responsibility to store and manage the new records.

  • If we want to search, update or delete the data in heap file organization, then we need to traverse the data from staring of the file till we get the requested record.
  • If the database is very large then searching, updating or deleting of record will be time-consuming because there is no sorting or ordering of records. In the heap file organization, we need to check all the data until we get the requested record.
Advantages of Heap file organization
  • It is a very good method of file organization for bulk insertion. If there is a large number of data which needs to load into the database at a time, then this method is best suited.
  • In case of a small database, fetching and retrieving of records is faster than the sequential record.

Disadvantages of Heap file organization
  • This method is inefficient for the large database because it takes time to search or modify the record.
  • This method is inefficient for large databases.
Hash File Organization
Hash File Organization uses the computation of hash function on some fields of the records. The hash function's output determines the location of disk block where the records are to be placed.

If a record has to be received using the hash key, then the address is generated, and the whole record is retrieved using that address. If a new record has to be inserted, then the address is generated using the hash key and record is directly inserted. The same process is applied in the case of delete and update.

In this method, there is no effort for searching and sorting the entire file. In this method, each record will be stored randomly in the memory.



B+ Tree File Organization
A B+ tree is a balanced binary search tree that follows a multi-level index format. The leaf nodes of a B+ tree denote actual data pointers. B+ tree ensures that all leaf nodes remain at the same height, thus balanced. Additionally, the leaf nodes are linked using a link list; therefore, a B+ tree can support random access as well as sequential access.

  •  B+ tree file organization is the advanced method of an indexed sequential accessmethod. It uses a tree-like structure to store records in File.
  • It uses the same concept of key-index where the primary key is used to sort the records. For each primary key, the value of the index is generated and mapped with the record.
  • The B+ tree is similar to a binary search tree (BST), but it can have more than two children. In this method, all the records are stored only at the leaf node. Intermediate nodes act as a pointer to the leaf nodes. They do not contain any records.

The above B+ tree shows that:
  • There is one root node of the tree, i.e., 25.
  • There is an intermediary layer with nodes. They do not store the actual record. They have only pointers to the leaf node.
  • The nodes to the left of the root node contain the prior value of the root and nodes o the right contain next value of the root, i.e., 15 and 30 respectively.
  • There is only one leaf node which has only values, i.e., 10, 12, 17, 20, 24, 27 and 29.
  • Searching for any record is easier as all the leaf nodes are balanced.
  • In this method, searching any record can be traversed through the single path and accessed easily.

Disadvantages of B+ tree file organization
  • In this method, searching becomes very easy as all the records are stored only in the leaf nodes and sorted the sequential linked list.
  • Traversing through the tree structure is easier and faster.
  • The size of the B+ tree has no restrictions, so the number of records can increase or decrease and the B+ tree structure can also grow or shrink.
  • It is a balanced tree structure, and any insert/update/delete does not affect the performance of tree.
This method is inefficient for the static method.

Indexed sequential access method (ISAM)

 ISAM method is an advanced sequential file organization.In this method, records are stored in the file using the primary key. An index value is generated for each primary key and mapped with the record. This index contains the address of the record in the file. If any record has to be retrieved based on its index value, then the address of the data block is fetched and the record is retrieved from the memory.


Pros of ISAM:
  • In this method, each record has the address of its data block, searching a record in a huge database is quick and easy.
  • This method supports range retrieval and partial retrieval of records. Since the index is based on the primary key values, we can retrieve the data for the given range of value. In the same way, the partial value can also be easily searched, i.e., the student name starting with 'JA' can be easily searched.
Cons of ISAM
  • This method requires extra space in the disk to store the index value.
  • When the new records are inserted, then these files have to be reconstructed to maintain the sequence.
  • When the record is deleted, then the space used by it needs to be released. Otherwise, the performance of the database will slow down.
Types of Indexes
An index is a table which stores the key values and the corresponding addresses of the records in a file. Given a key value, its address is located in the index and the corresponding record can be accessed using this address.
When an index is used, the index rather than the file is used for locating a record. An index is usually defined on a single field of a file, called an Indexing Field. The index typically stores each value of the index field along with a list of pointers to all disk blocks that contain a record with that field value. The values in the index are ordered and the index file is much smaller than the data file.
    There are several types of indexes. These include primary index, clustering index, secondary index etc.

A primary index is an index specified on the primary key of a data file. Primary key is a field which contains unique values and uniquely identifies each record. By knowing a primary key field’s value we can access its record. The primary key is also known as the ordering key field and is used to physically order the file records on the disk.

A clustering index stores data similar to a phone directory where all people with the same last name are grouped together. A clustering index is specified on a field that does not have a distinct value, for each record. These records are then stored in ascending or descending order according to the data values in this field.

A third type of index, called a secondary index, can be specified on any field other than the primary key of the file. A secondary key is any field other than the primary key that is used to uniquely identify a record in a table.

Cluster file organization

  • When the two or more records are stored in the same file, it is known as clusters. These files will have two or more tables in the same data block, and key attributes which are used to map these tables together are stored only once.
  •  This method reduces the cost of searching for various records in differentfiles.
  • The cluster file organization is used when there is a frequent need for joining the tables with the same condition. These joins will give only a few records from both tables. In the given example, we are retrieving the record for only particular departments. This method can't be used to retrieve the record for the entire department.
In this method, we can directly insert, update or delete any record. Data is sorted based on the key with which searching is done. Cluster key is a type of key with which joining of the table is performed.


Types of Cluster file organization:

Cluster file organization is of two types:

1. Indexed Clusters: In indexed cluster, records are grouped based on the cluster key and stored together. The above EMPLOYEE and DEPARTMENT relationship is an example of an indexed cluster. Here, all the records are grouped based on the cluster key- DEP_ID and all the records are grouped.

2. Hash Clusters: It is similar to the indexed cluster. In hash cluster, instead of storing the records based on the cluster key, we generate the value of the hash key for the cluster key and store the records with the same hash key value.
Advantages of Cluster file organization
  • The cluster file organization is used when there is a frequent request for joining the tables with same joining condition.
  • It provides the efficient result when there is a 1:M mapping between the tables.
Disadvantages of Cluster file organization
  • This method has the low performance for the very large database.
  • If there is any change in joining condition, then this method cannot use. If we change the condition of joining then traversing the file takes a lot of time.
MULTIKEY FILE ORGANISATION

The ability to search on many keys is enabled by building multiple index files (multikey file organisation) “on top of” the data file. The physical database then consists of one or more data files and many index files, and each data file contains either one or several record types. Each index file supports
access by a particular field or group of fields.

Two approaches for providing additional access paths into a file of data records.
1. Multilist file organisation
2. Inverted file organization

1. Multilist File Organisation
 The basic approach to providing the linkage between an index and the file of datarecords is called multilist organisation. A multilist file maintains an index for each secondary key. The index for secondary key contains, instead of a list of primary keys related to that secondary key, only one primary key value related to that secondary key. That record will be linked to other records containing the same secondary key in the data file.
To facilitate searching on the primary key as well as on secondary keys, it is customary to maintain several indexes, one for each key. Using an index in this way reduces the length of the lists and thus the search time.

2. Inverted File Organisation
In inverted file organisation, a linkage is provided between an index and the file of data records. A key’s inverted index contains all of the values that the key presently has in the records of the data file. Each key-value entry in the inverted index points to all of the data records that have the corresponding value. Inverted files represent one extreme of file organisation in which only the index structures are important. The records themselves may be stored in any way (sequentially ordered by primary key, random, linked ordered by primary key etc.).


Both inverted files and multilist files have:
  •          An index for each secondary key.
  •          An index entry for each distinct value of the secondary key.
  •          The index may be tabular or tree-structured.
  •          The entries in an index may or may not be sorted.
  •          The pointers to data records may be direct or indirect.

The indexes differ in that

Thus an inverted index may have variable-length entries whereas a multilist index has fixed-length entries.

0 comments:

Post a Comment

Data Structures with C++



NET/SET/CS PG



Operating Systems



Computer Networks



JAVA



Design and Analysis of Algorithms



Programming in C++

Top