Bioinformatics – K.V. Ramesh Anthropology

Bioinformatics is defined as the computational handling and processing of genetic
information. Its goal is to enable the discovery of new biological insight as well
as to create a global perspective from which unifying principles can be perceived.
It lays emphasis on organizing the data for accessing and generating information,
developing tools to analyse the data and interpret the data in a biologically
meaningful manner. In analysing the data, two approaches like data mining and
knowledge discovery are taken into consideration. These approaches involve
generating testing hypothesis regarding the function or structure of a gene or
protein of interest by homology searching. Besides deriving meaningful biological
information from the data; bioinformatics also serves the scientific community
with resources such as databases.
Bioinformatics plays an important role in many areas of biological research like
genomics, transcriptomics, proteomics, structural biology, genetics, molecular
biology and evolutionary biology. Bioinformatics knowledge is utilised in
genomics particularly, genome sequencing, mapping, genome annotation,
comparing multiple genomes, calculating evolutionary distance and single
nucleotide polymorphism discovery. Transcriptomics applications of
bioinformatics includes study of transcribed sequences, both full length cDNA
and expressed sequence tags and analysis of gene expression data. In proteomics,
bioinformatics is helpful in the analysis of protein sequences, protein abundance
and in determination of protein structure either empirically or computationally.
In molecular biology, analysis of protein-protein interactions and molecular
pathway and in systemic studies of gene regulation bioinformatics has significant
contribution. In genetics, bioinformatics is useful in the discovery of new
molecular genetic markers such as SNP’s and use of these and other markers to
dissect the genetic basis of disease and other phenotypes. Bioinformatics is also
helpful in the studies of evolution and phylogeny.
Tools used in Bioinformatics: Various tools are used in bioinformatics research
like internet, search engines like Google, Scirus, AltaVista, Lycos, HotBot,
Northern Light, Dogpile, databases like National centre for biotechnology
information (NCBI), PubMed and sequence analysis tools like BLAST, FASTA,
multiple sequence alignment (MSA), and visualisation tools such as RasMol,
Jmol and Cn3D.
Computational methods help in analysing the data and formulate hypotheses.
Sequence data is the most abundant type of biological data available electronically.
Pairwise sequence comparison is used in bioinformatics applications for sequence
based database searching, building evolutionary trees, identification of
characteristic features of protein families, create homology models, compare
genomes, explore sequence determinants of protein structure and to connect
expression data to genomic information.

Sequence data can be used for sequence analysis to know the sequence
characteristics, sequence comparison, multiple sequence alignment, motif
discovery and phylogenetic inference. Sequence databases can be found on
internet. Databases are searched mostly for similarity search. The tools used for
similarity search includes BLAST and FASTA.
BLAST: It stands for basic local alignment search tool. This algorithm is used to
perform sequence similarity search. A server at NCBI was established to support
BLAST. This server is used widely for sequence database searches. An
independent set of BLAST program was developed at Washington University
known as WU-BLAST. This BLAST also performs the similarity search like
NCBI and produces gapped local alignments. The BLAST requires different
statistical methods to evaluate sequence similarity score. BLAST algorithm
increases the speed of sequence alignment by searching first for common words
or k-tuples in query sequence and each database sequence. It searches the words
that are significant. In case of proteins, the significance of word matching is
evaluated by BLOSUM 62 amino acid substitution matrix. The word length is 3
for proteins and 11 for nucleic acids in BLAST algorithm. The latest version of
BLAST is BLAST 2. It reports the gapped alignment of query and database
sequences. BLAST has filtering feature for searching low complexity regions in
query sequence (repeats of sequence character) which produces artificial high
score alignments.
There are number of variations of BLAST programs like Blastp for comparing
an amino acid query sequence against a protein sequence of a database; Blastn
for comparing a nucleotide query sequence against nucleotide sequence of a
database; Blast x for searching six frame translation product of nucleotide
sequence against a protein database; Tblastn for searching a protein sequence
against a translated nucleotide sequence of a database; and Tblastx for comparing
six frame translations of nucleotide sequence query sequence against a six frame
translations of a nucleotide sequence data base. MegaBLAST searches similar
sequences that are 300 to 100,000 bp long. A long word is used for searching and
the gap penalty is calculated from the match and mismatch scores. RPS BLAST
scans conserved domains in a protein sequence. BLASTcl3 is a network client
BLAST which is used to access the BLAST server. Standalone BLASTs are
executable versions of all the BLAST program for the operating systems
Windows, Unix and Macintosh. PSI-BLAST (Position specific iterated BLAST),
PHI BLAST (pattern hit initiated BLAST) are used to search for domains in
query protein sequence and in database sequences. BEAUTY (BLAST enhanced
alignment utility) adds additional features to the BLAST like summarizing the
locations of HSPs, PFAM domains and Prosite pattern. BLAST searching with
cobbler sequence (consensus) is used to find majority residues in multiple
sequence alignment. BLAST2 program is used to align very long sequences.
FASTA: This program is used for aligning pairs of protein and DNA sequences.
It searches for matching sequence patterns or words called k-tuples. Patterns
contain k consecutive matches of letters. It attempts to build a local alignment
on word matches. This program is used for database searches. FASTA compares
the query protein or DNA sequences to the target sequences in the database and
give the best matched sequence and local alignment of matched sequences. To
search for similarity FASTA uses hashing method in which a table of the positions
of each word of length k, or k tuple is constructed for each sequence. The position

of each word is calculated by subtracting the position in the first sequence from
the position in the second and words having the same offset position show a
region of alignment between the two sequences. The number of comparisons
increases as the average sequence length. The k tuple length is 1or 2 for protein
and 4-6 for nucleic acid sequences in FASTA program. There are other versions
of the FASTA also reported. Among them TFASTA compares the query protein
sequence to a six frame translation of DNA sequence of the database; FASTF/
TFASTS compares a set of short peptide fragments against a protein sequence
database or a DNA sequence database translated in all six reading frames; FASTX
and FASTY translate a query DNA sequence in all three reading forwad frames
and compare all three frames to a protein sequence database; and TFASTX and
TFASTY compare a query protein sequence to a DNA sequence database,
translating each DNA sequence in all six possible reading frames.
Multiple Sequence alignment (MSA): It is an alignment of three or more
sequences and aims to place sequence positions related by function and evolution
in the same column of the alignment allowing for mismatches and gaps (deletions
or insertions). In msa, both global and local alignments are used. In global
alignment, dynamic programming algorithm is used for alignment of three
sequences, more than this number, only a small number of relatively short
sequences may be analysed. The methods used include progressive methods
(ClustalW,ClustalX, MAFFT, MAVID,MSA, MULTIPIPMAKER, POA,
PRALINE, T-COFFEE) which start by aligning most alike sequences followed
by building an alignment by the addition of more sequences; iterative methods
(DIALIGN, PRRP, SAGA) initially align group of sequences and then revise the
alignment to achieve a more reasonable result; methods of aligning the sequences
based on conserved pattern found in the same order in the sequences; statistical
methods generating probabilistic models of the sequences and graph based
methods.
Local MSA methods align the most similar regions in sequences. The approaches
include profile analysis which identifies highly conserved portion of the alignment
and produces a scoring matrix called a profile. A profile includes scores for
amino acid substitutions and gaps in each column of the conserved region. In
block analysis, blocks (substituted regions without gaps) are searched and used
in sequence alignments. Pattern searching or statistical methods scan a localised
region of sequence similarity in a set of sequences.
Structure visualisers: Protein structure data is stored as collections of x,y,z
coordinates. The connectivity between atoms in proteins has to be taken into
account and for the visualisation to be effective a virtual 3D environment which
needs to be created. A protein structure visualisation program needs to be able to
display use selected subsets of atoms with correct connectivity, draws standard
cartoon representations of proteins such as ribbons and cylinders and recolour
subsets of a molecule according to a specified parameter.
RasMol: It is a structure visualisation program tool and available for a wide
range of operating systems. It reads molecular structure files in the standard
PDB format. It comes in three display depths 8, 16 and 32 bit. The molecule can
be rotated in window. It has file menu commands for opening molecular structure
file, display menu commands for the molecular display style to formats including
ball and stick, cartoons and space fill. The colour menu allows colour changes of

the molecule, option menu allows changes of the display style and export menu
facilitates writing the displayed images in common electronic image formats
such as GIF, PostScript and PPM. Help common allows the creation of own
combination of colours and structure display formats.
Cn3D: It allows viewing protein structure files NCBI ASN.1 format. It opens
two windows: a colour structure viewer in which a molecule can be rotated,
coloured according to different properties and rendered in different display
formats; a sequence viewer, which allows you to view sequences and alignments
corresponding to the displayed protein and to add graphics to the sequence display
to highlight the location of secondary structure features.
Biological Databases: A biological database is a large, organised body of
persistent data, usually associated with computerised software designed to update,
query and retrieve components of the data stored within the system. These
databases are helpful to gain a insight into biological phenomena from the
structure of biomolecules and their interactions to the whole metabolism of
organism and to understand the evolution of species. Databases are classified as
primary databases (DDBJ, EMBL, Gene Bank), protein sequence databases
(SWISS-PROT, Protein Information Resource), protein sequence databases
(Pfam, PROSITE) protein structure databases (PDB, SCOP), protein-protein
interaction databases (BioGRID, STRING), pathway databases (KEGG),
microarray databases (Array express, Gene expression omnibus).
DDBJ (DNA Data Bank of Japan): It is run by National Institute of Genetics,
Japan. It is the only nucleotide sequence database in Asia. It works in collaboration
with EMBL and Gene Bank. It collects experimentally determined sequence
data mainly from Japanese researches but also accepts from others as well. The
database is a collection of “entry” which is the unit of the data. Each entry includes
nucleotide sequence and the information of submitters, references, source
organisms, and the biological nature such as gene function and other property of
the sequence
EMBL(European Molecular Biology Laboratory):This database is maintained
by European Molecular Bioinformatics Institute, Cambridge, U.K. It is Europe
primary nucleotide sequence database. The data consists of DNA and RNA
sequences drawn from individual researchers, genome sequencing projects and
patent applications. As on 30th August, 2012 it contain 252,106,363 sequence
entries comprising 450,481,663,919 nucleotides.
Gene Bank: It is a NIH genetic sequence database. As on April 2011, it had a
collection of 126,551,501,141 bases in 135,440,924 sequence record. Entrenz
nucleotide is used for sequence identification and annotations. Entrez nucleotide
is divided into core nucleotide (main collection), dbEST (expressed sequenced
tags) and dbGSS (genome survey sequences). BLAST programme can be used
to align query sequences to Genebank sequences.
SWISS-PROT: It is a curate protein sequence database formed in 1986 and
maintained by the Department of Medical Biochemistry of the University of
Geneva and the European Bioinformatics Institute (EBI). The characteristic
features of SWISS-PROT include availability of high level annotations, a minimal
redundancy and integration with other databases.

PIR (Protein Information Resource): It was established in 1984 by the National
Biomedical Research Foundation (NBRF) and helps researchers in the
identification and interpretation of protein sequence information and provides
tools.
Pfam: It is a database of protein families. It has two components namely PfamA and Pfam-B. Pfam-A contains large portion of sequences with high quality
and manually curated protein families, while, the Pfam-B has a collection of low
quality families and useful for identifying functional regions. Pfam also generates
clans, a grouping of related protein families.
PROSITE: This database has a collection of protein families and domains.
Patterns and profiles of more than thousand protein families and domains data is
available in this database.
PDB (Protein Data Bank): This database is of experimentally determined
structures of proteins, nucleic acids and complex assemblies.It was established
in 1971 at Brookhaven National Laboratories. In 1998 it came under the umbrella
of Research colloboratory for structural bioinformatics. It contains information
about coordinates, deposited structures and method of structure determination.
As on October 2nd 2012, 85 thousand records of structures were available.
SCOP (The Structural Classification of Proteins): It organises and classifies
proteins based on their evolutionary and structural relationships. It is organised
into four hierarchical levels: family, super family, fold, and classes.
BioGRID (The Biological General Repository for Interaction Datasets): It
is a database with collections of genetic and protein interaction data from model
organisms and humans. It has holdings of 557,934 interactions scanned from
34,996 publications in the primary literature as on 1st October, 2012.
STRING (Functional Protein Association Network): It is a database of known
and predicted protein interactions. The interactions include direct (physical) and
indirect (functional) associations. The data is drawn from genomic context, highthroughput experiments, co expression and previous knowledge. This database
makes an attempt to integrate interaction data from these sources for a large
number of organisms and transfers information between organisms. It has a
collection of 5214234 proteins of 1133 organisms.
KEGG: It was initiated in 1995. This database integrates genomic, chemical
and systemic functional information. It provides a reference knowledge base for
linking genomes to life through the process of PATHWAY mapping to infer
systemic behaviours of the cell or the organism. It also links genomes to the
environment via BRITE mapping. KEGG BRITE, an ontology database showcase
functional hierarchies of various biological objects, including molecules, cells,
organisms, diseases and drugs and also relationships among them. KEGG
PATHWAY presently has a combined map of about 120 existing pathway maps.
Smaller pathway modules are stored in KEGG MODULE. KEGG DRUG
contains information about all approved drugs in the US and Japan, and KEGG
DISEASE provide a link in to disease genes, pathways, drugs and diagnostic
markers.

Micro array databases: Microarray is a hybridization of a nucleic acid sample
(target) to a very large set of oligonucleotide probes, which are attached to a
solid support, to determine sequence or to detect variations in a gene sequence
or expression or for gene mapping. Various organisations have created microarray
databases like National Center for Biotechnology Information,US-Gene
expression Omnibus, European Bioinformatics Institute -Array express and
National Institute of Genetics, Japan- Center for information biology gene
expression.The purpose of creation of these databases is to store the minimum
information about microarray experiments and to allow the researchers to repeat
the experiments.
Applications of Bioinformatics (BI):
• is useful for gene and protein prediction, evolutionary distance calculation,
active site identification, construction of novel mutations and characterisation
of alleles of diseases.
• is applied in the analysis of the organisation of genes and genomes.
• may be helpful in identification of regulatory elements in genes.
• focuses on the development of algorithms to assess relationships among
members of large data sets.
• allows the study of evolutionary events like gene duplication, lateral gene
transfer.
• is useful in conservation of genomes of endangered species and biodiversity
management.
• facilitate matching of data generated by mass spectrometers to the protein
sequence databases.
• is helpful in the automation, processing, quantification and analysis of large
amount of data from biomedical images.
• provide first hand information on chemical structure, reaction kinetics and
synthetic methods, toxic chemical substances in the form of databases.
• can be used to study invention, implementation of structures and algorithms
to improve communication, understanding and management of medical
information.
• may be used to develop models for simulating intracellular molecular
processes to predict dynamic behaviour of living cells.
• tools are used for prediction of function of genes and 3D structure prediction
of proteins, selection of suitable ligands, identification of protein motifs and
domains.
• is useful to maintain the sequence data and allow analysis of data by making
availability of tools.
Challenges: The key challenges to bioinformatics involve the management of
current flood of raw data, aggregate information, the diversity of the sources and
formats, variable conditions and descriptions of experimentation, the quality of
experimental evidence for data, data processing and evolving knowledge arising
from the study of the genome and its manifestations.