sajid khan

Bioinformatics Computing

Department Bioinformatics

Government Post Graduate Collage ,

Mandain Abbottabad.

Sajid Khan

Chapter 2. Databases

Data Life Cycle

The overall process, from data creation to disposal, is normally referred to as the data life cycle.

Key steps in the process include data creation and acquisition, use, modification, repurposing, and the end game-archiving and disposal.

The same process applies to data in a desktop workstation or, as in this illustration, to a large pharmacogenomic operation with multiple, disparate systems.

Data Creation and Acquisition

The process of data creation and acquisition is a function of the source and type of data.

FOR EXAMPLE

Data are generated by sequencing machines and microarrays in the molecular biology laboratory, and by clinicians and clinical studies in the clinic orhospital.

Depending on the difficulty in creating the data and the intended use, the creation process may be trivial and inexpensive or extremely complicated and costly.

FOR EXAMPLE

Recruiting test subjects to donate tissue biopsies is generally more expensive and difficult than identifying patients who are willing to provide less-invasive (and painful) tissue samples.

Major issues in the data-creation

In addition to cost, the major issues in the data-creation phase of the data life cycle include tool selection, data format, standards, version control, error rate, precision, and accuracy.

These metrics apply equally to clinical and genomic studies.

FOR EXAMPLE

Optical character recognition (OCR), which was once used extensively as a means of acquiring sequence information from print publications, has an error rate of about two characters per hundred, which is generally unacceptable.

Use

Once clinical and genomic data are captured, they can be put to a variety of immediate uses, from simulation, statistical analysis, and visualization to communications.

Issues: at this stage of the data life cycle include intellectual property rights, privacy, and distribution.

FOR EXAMPLE

unless patients have expressly given permission to have their names used, microarray data should be identified by ID number through a system that maintains the anonymity of the donor.

Data Modification

Data are rarely used in their raw form, without some amount of formatting or editing. In addition, data are seldom used only for their originally intended purpose, in part because future uses are difficult to predict.

FOR EXAMPLE

Microarray data may not be captured expressly for comparison with clinical pathology data, but it may serve that purpose well.

The data dictionary is one means of modifying data in a controlled way that ensures standards are followed.

A data dictionary can be used to tag all microarray data with time and date information in a standard format so that they can be automatically correlated with clinical findings.

Data Dictionary-Directed Data Modification

The time and date header for microarray data can be automatically modified so that it can be easily correlated with clinical findings.

Data that are modified or transformed by the data dictionary are normally stored in a data mart ordata warehouse so that the transformed data are readily available for subsequent analysis withoutinvesting time and diverting computational resources by repeatedly reformatting the data.The relationship between microarray data and clinical data, such as activity at a particular gene locus and overt aggression score, can be more easily computed because the datacan be sorted and compared by date of birth.

Archiving

Archiving, the central focus of the data life cycle, is concerned with making data available for future use.

Unlike a data repository, data mart, or data warehouse, which hold data that are frequently accessed, an archive is a container for data that is infrequently accessed, with the focus more on longevity than on access speed. In the archiving process-which can range from making a backup of a local database on a CD-ROM or Zip disk to creating a backup of an entire EMR system in a large Hospital-data are named, indexed, and filed in a way that facilitates identification later.

While university or government personnel archive the large online public databases, the archiving of locally generated data is a personal or corporate responsibility.

ARCHIVING ISSUES

Key issues in bioinformatics in the archival process range from the scalability of the initial solution to how to best provide for security.

RAID SYSTEM

Archives vary considerably in configuration and in proximity to the source data.

For example, servers typically employ several independent hard drives configured as a Redundant Array of Independent Disks (RAID system) that function in part as an integrated archival system.

The idea behind a RAID system is to provide real-time backup of data by increasing the odds that data written to a server will survive the crash of any given hard drive in the array. RAID was originally introduced in the late 1980s as a means of turning relatively slow and inexpensive hard disks into fast, large-capacity, more reliable storage systems.

RAID systems derive their speed from reading and writing to multiple disks in parallel. The increased reliability is achieved through mirroring or replicating data across the array and by using error detection and correction schemes.

RAID-3

Although there are seven levels of RAID, level 3 is most applicable to bioinformatics computing. In RAID-3, a disk is dedicated to storing a parity bit-an extra bit usedto determine the accuracy of data transfer-for error detection and correction. If analysis of the parity bit indicates an error, the faulty disk can be identified and replaced. The data can be reconstructed by using the remaining disks and the parity disk.

Repurposing

One of the major benefits of having data readily available in an archive is the ability to repurpose it for a variety of uses. For example, linear sequence data originally captured to discover new genes arecommonly repurposed to support the 3D visualization of protein structures.

One of the major issues in repurposing data is the ability to efficiently locate data in archives.

Major issues

For example, nucleotide sequence data indexed by chromosome number would be virtually impossible to locate if the database contains thousands of sequences indexed to each chromosome.Issues in the repurposing phase of the data life cycle include the sensitivity, specificity, false positives, and false negatives associated with searches. The usability of the user interface is also a factor, whether free-text natural language, search by example, or simple keyword searching is supported.

Major benefits

Disposal

The duration of the data life cycle is a function of the perceived value of the data, the effectiveness of the underlying process, and the limitations imposed by the hardware, software, and environmentalinfrastructure.

Eventually, all data die, either because they are intentionally disposed of when their value has decreased to the point that it is less than the cost of maintaining it, or because of accidental loss.Often, data have to be archived because of legal reasons, even though the data is of no intrinsic value to the institution or researcher. For example, most official hospital or clinic patient records must be maintained for the life of the patient.

When sequence data from one gene is no longer necessary, it canbe discarded from the local data warehouse leaving room for the next gene's sequence data.

Managing the Life Cycle

Managing the data life cycle is an engineering exercise that's a compromise between speed, completeness, longevity, cost, usability, and security.

For example, the media selected for archiving will not only affect the cost, but the speed of storage and longevity of the data. Similarly, using an in house tape backup facility may be more costly than outsourcing the task to networked vendor, but the in-house approach is likely to be more secure.

These tradeoffs are reflected in the implementation of the overall data-management process.

Database Technology

The purpose of a database is to facilitate the management of data, a process that depends on people, processes, and as described here, the enabling technology. Consider that the thousands of base pairs discovered every minute by the sequencing machines in public and private laboratories would be practically impossible to record, archive, and either publish or sell to other researchers without computer databases.

The database hierarchy has many parallels to the hierarchy in the human genome. Data stored in chromosomes, like a data archive, must be unpacked and transferred to amore immediately useful form before the data can be put to use.

Organic Analog of Database Hierarchy.

Working Memory

Limited working memory in volatile RAM is used for program execution, whereas an expansive disk or other nonvolatile memory serves as a container for data that can't fit in working.memory.

Volatility, working memory, and the volume of data that can be handled are key variables in memory systems such as databases.

For example, nucleotide sequences that will be used in pattern-matching operations in the online sequence databases will be formatted according to the same standard—such as the FASTA standard.

Database Architecture

One of the greatest challenges in bioinformatics is the complete, seamless integration of databases from a variety of sources.

This is not the case now, primarily because when databases such asGenBank and SWISS-PROT were designed, their architectures were designed primarily to support their particular function.

From a structural or architectural perspective, database technology can be considered either centralized or distributed.

In the centralized approach, typified by the data warehouse, data are processed in order to fit into a central database.

In a distributed architecture, data are dispersed geographically, even though they may appear to be in one location because of the databasemanagement system software.

CENTRALIZED OR DISTRIBUTED

Centralized Database Architecture

A centralized architecture, such as that all organizational activity in one location. This can be a formidable task, as it requires cleaning, encoding, andtranslation of data before they can be included in the central database. For example, once data to be included in a data warehouse have been identified, the data from each application are cleaned and merged with data from other applications. In addition, there are the usual issues of database design, provision for maintenance, security, and periodic modification.

A centralized database,such as a data warehouse, combines data from a variety of databases inone physical location.

Distributed database architecture

Distributed database architecture is characterized by physically disparate storage media. One advantage of using a distributed architecture is that it supports the ability to use a variety of hardware and software in a laboratory, allowing a group to use the software that makes their lives easiest, while still allowing a subset of data in each application to be shared throughout the organization.

Distributed databases can be configured to share data through dedicated, one-to-one custom interfaces (left) or by writing to a common interface standard (right).

Custom interfaces incur a work penalty on the order of two times thenumber of databases that are integrated.

Storage Area Network Architecture

A SAN is a dedicated network that connects servers and SAN-compatible storage devices. SAN devices can be added as needed, within the bandwidth limitations of the high-speed fiber network.

For example, Network Attached Storage (NAS) is one method of adding storage to a networked system of workstations. To users on the network, the NAS acts like a second hard drive on their workstations.However, a NAS device, like a file server, must be managed and archived separately. A similar approach is to use a Storage Service Provider (SSP), which functions as an Application ServiceProvider (ASP) with a database as the application.

In addition to SANs, there is a variety of other network-dependent database architectures.

Database Management Systems

The database management system (DBMS) is the set of software tools that works with a given architecture to create a practical database application.

The DBMS is the interface between the low level hardware commands and the user, allowing the user to think of data management in abstract,high-level terms using a variety of data models, instead of the bits and bytes on magnetic media. The DBMS also provides views or high-level abstract models of portions of the conceptual database that are optimized for particular users. In this way, the DBMS, like the user interface of a typical application, shields the user from the details of the underlying algorithms and data representationschemes.The DBMS also guards against data loss. For example, a DBMS should support quick recovery from hardware or software failures.

A key issue in working with a DBMS is the use of metadata, or information about data contained in the database. Views are one application of metadata-a collection of information about naming, classification, structure, and use of data that reduces inconsistency and ambiguity.

Metadata, Information, and Data in Bioinformatics

Metadata labels, simplifies, and provides context for underlying information and data.

For example, one way to think about the application of metadata is to consider the high level biomedical literature a means of simplifying and synthesizing the underlying complexity of molecular disease, protein structure, protein alignment, and protein and DNA sequence data.

From this perspective, data are base pair identifiers derived from observation, experiment, or calculation, information is data in context, such as the relationship of DNA sequences to protein structure, and metadata is a descriptive summary of disease presentations that provides additional context to the underlying information.

Data Models

The most common data models supported by DBMS products are flat, network, hierarchical, relational, object-oriented, and deductive data models, as illustrated graphically in. Even though long strings of sequencing data lend themselves to a flat file representation, the relational database model is by far the most popular in the commercial database industry and is found in virtually every biotech R&D laboratory.The most common data models in bioinformatics are relational, flat, and object-oriented.Structured Query Language (SQL) statement:SELECT *.* FROM author_subject_tableWHERE subject = "Neurofibromatosis"

Object-Oriented Data Representation

The object-oriented data model is natural for hiding the complexity of genomic data.

The object-oriented model combines the natural structure of the hierarchical model with the flexibility of the relational model. As such, the major advantage of the object-oriented model is that it can be used to represent complex genomic information, including non-record-oriented data, such as textual sequence data and images, in a way that doesn't compromise flexibility.Although the object-oriented approach holds great promise in bioinformatics, it still lags far behindrelational technology in the global database market.

Deductive Database

The deductive model is an extension of the relational database with a logic programming interface based on the principles of logic programming. The logic programming interface is composed of rules, facts, and queries, using the relational database infrastructure to contain the facts.The database is termed deductive because from the set of rules and the facts it is possible to derive new facts not contained in the original set of facts. Unlike logic programming languages such as PROLOG, which search for a single answer to a query using a top-down search, deductive databases search from bottom-up, starting from the facts to find all answers to a query.

Interfaces

Databases don't stand alone, but communicate with devices and users through external and user interfaces, respectively.

Getting data into a database can come about programmatically as in thecreation of a data warehouse or data mart through processing an existing database. More often, the data are derived from external sources, such as user input through keyboard activity, or devices connected to a computer or network.

Databases communicate with equipmentand users through a variety of external interfaces.

Common Gateway Interface (CGI)For example, unlike other scripting languages for Web page development, PHP offers excellent connectivity to most of the commondatabases, including Oracle, Sybase, MySQL, ODBC and many others.

Bioinformatics Database Implementation Issues

Thank You

Contact : gpgcm_bc ( Yahoo Group )

sajid khan

Documents

data modification data

example data

data dictionary

data format

genomic data

type of data

example microarray data

datacreation phase