Introduction to New and Emerging areas in Human Genetics

  • 1.1 Introduction
  • 1.2 Proteomics
  • 1.3 Bioinformatics
  • 1.4 Pharmacogenomics
  • 1.5 Stem Cell Research

1.1 INTRODUCTION

Like any discipline human genetics is growing by leaps and bounds with the advancement of science and technologies for the noble mission of human welfare and enhancement of quality of life of the people.

Sequencing of the genomes has led to an accumulation of a wealth of sequence data. The functions of a large number of genes are unknown. Deciphering the functions of genes will require a clearer understanding of post transcriptional and post translational modifications. Since defective proteins are the ultimate causes of most diseases, it is important to consider the proteome (set of all proteins) in relation to the genome, the study of which constitutes the subject matter of proteomics.

Advancement of sequencing technologies, and the consequent of sequencing of genomes of various organisms including human, and availability of large-scale public-domain genomic and related databases have necessitated the development of the discipline of Bioinformatics. Bioinformatics has been useful for themaintenance and analysis of sequences within the reach of scientific community and public in the form of databases. Development of various algorithms have facilitated the comparison of sequences and genomes for predicting the genes and their function, translation of nucleotide sequences into amino acid sequences, predicting the structure of proteins by homology modeling, single nucleotide polymorphism discovery, finding the susceptible biomarkers (or genetic markers) of diseases and identification of drug targets for the better management of diseases.

There is a difference of about 0.1% in the DNA sequences of any two individuals. This variation often resulted in variation in drug response. Studying at genome level for characteristic genetic variants responsible for variation in response to drugs has resulted in the birth of pharmacogenomics. Lack of adequate knowledge of genotype characteristics that contribute to adverse reactions to various drugs is also increasing patient-care costs. Genotype guided management of diseases holds great promise.

Decreasing birth rates, improvement of nutrition and control of infectious diseases have improved the lifespan, and at the same time increased the prevalence of diseases like diabetes, cardiovascular diseases and cancers. With the advancement of medical technology and pharmaceutical research many of these diseases can be managed clinically. Finally, patients of these diseases sometimes face endstage organ dysfunction requiring organ transplantation. Lack of awareness on organ donation and pending clearances on cadaver organ transplantation have fostered the development of stem cell research efforts. Somatic cell nuclear transfer is one of the tools of stem cell research which helps in developing autologous and immune rejection free cells for therapeutic purposes. Developments in the stem cell research have helped to shape the therapies for the management of some disorders. Stem cell research has raised the hope of suffering patients, although the implementation of the results of this research for patient welfare is still in its infancy.

Let us discuss in detail on the above said areas of Human Genetics.

1.2 PROTEOMICS

The term ‘Proteome’ was coined by Marc Wilkins in 1994. This term is a linguistic equivalent to genome and deals with large scale analysis of the complete, or at least the major, set of proteins. The Proteome can be defined in terms of the sequence, structure, abundance, localisation, modification, interaction and biochemical function of its components.

Importance of proteomics

  • 1) Genes are instruction carriers, while the proteins are the functional molecules of the cells and a true understanding of them can come from the direct study of proteins.
  • 2) Unlike the genome whose content with few exceptions remain the same irrespective of cell type or environmental conditions, the proteome is dynamic whose content varies under different conditions due to the regulation of transcription, RNA processing, protein synthesis and protein modification. Study of the proteome can provide the glimpse of the cell in action. 3) A good understanding on the structure and function of a protein may provide clues to introduce mutations in order to better understand their function.
  • 4) Transcriptome may not represent the true insight on proteome, because not all mRNA in the cell are translated and rates of protein synthesis and protein turnover may vary among transcripts.
  • 5) Difference in the stability of mRNA and efficiencies in translation can affect the generation of new proteins. Some transcripts may give rise to multiple proteins. For instance, 22 different forms of alpha-1-antitrypsin were observed in plasma. The individual functions of these proteins can be studied at the protein level only.
  • 6) In certain body fluids like serum, cerebrospinal fluid and urine where nucleic acids are not represented, proteins only provide information about determinants of disease progression. In case of degradation and cross linking of nucleic acids in fixed biological specimens, protein may only act as a source material for further study. In many diseases proteins are the drug targets.
  • 7) Proteomics attempts to bridge the gap between our understanding of genome sequence and cellular behaviour.

A brief account on proteins is as follows:

Proteins: The term ‘protein’ is derived from the Greek word ’Proteios’, meaning of the first order. J.J.Berzelius in 1938 coined the term ‘protein’ to describe a class of macromolecules abundant in living organisms. Proteins constitute about 50% of dry weight of the cell. They are made up of amino acids which were earlier 20 in number, but recently two more such as selenocysteine and pyrrolysine were added to the list. Except in marine microorganisms, in which both D and L amino acids were found, proteins contain L-amino acids only. The D and L refers to the property of amino in response plane polarized light. The amino acids form peptide bonds involving carboxylic group of one amino acid with amino group of other amino acids. The involved amino acids are called amino residues. Depending on the number of amino acid involvement in peptide bonds they are called bi, tri, tetra etc. A protein may have one or more polypeptides (a string of amino acids) folded mostly into either globular or fibrous form.

Proteins are synthesized by the translation of mRNA into polypeptides on ribosomes. After translation, they undergo 400 types of reversible and irreversible chemical reactions like glycosylation, phosphorylation, which are collectively called as post translational modifications. At any given time in a cell, the level of protein depends on the rate of transcription of the gene, the efficiency of translation of mRNA into protein and the rate of degradation of the protein. Various agents like oxidants, radiation, chemicals cause modification of proteins which lead to their degradation. Phosporylation accompanied by conjugation with ubiquitinin and lysosomal enzymes also effect degradation of proteins.

Structure: The structure of protein can be explained at four levels.

  1. Primary: This contains the linear sequence of amino acids which are bonded by  covalent peptide bonds or linkages. This structure determines its function and  the composition of amino acids is responsible for physical and chemical properties.
  2. b) Secondary: This is formed by twisting of polypeptide chain leading to spatial arrangement of protein. The basis of secondary structure depends on the pattern of hydrogen bonds between amide and carboxylic groups. á helix and â sheets are the known two main types of secondary structures. á helix has a rigid arrangement of a polypeptide chain in which amino acid side chains extend outward from the central axis. It is stabilized by extensive hydrogen bonding. In case of â sheets, hydrogen bonds are observed between the neighbour segments of polypeptide chains. The arrangement of polypeptide in â sheets is either parallel (same direction) or anti-parallel (opposite direction).
  3. c) Tertiary: This structure provides the stability of the protein. It is the three dimensional structure. In this, hydrophobic side chains are held interior and hydrophilic groups are seen on the surface of the protein.
  4. d) Quaternary: It is the spatial arrangement of subunits (polypeptide chains).These subunits are held together by noncovalent bonds like hydrogen bonds, hydrophobic interactions and ionic bonds. Depending on the number of polypeptides these subunits are known as mono, di,tri or tetramers. If they are identical they are called homo or if unrelated they are known hetero.

Subfields of Proteomics

1) Sequence and Structural proteomics: Protein sequences allow the designing of probes or primers which can be used to isolate the cDNA or genomic sequence. It is the protein sequence which acts as a bridge between the activity of a protein and the genetic basis of a particular phenotype. Increasing deposition of protein sequences and consequent development of statistical techniques are facilitating the comparison of proteins. Three primary sequence databases are Genbank, EMBL (European Molecular Biology Laboratory), DDBJ (DNA Data Bank of Japan) that provide translated protein sequences from DNA sequences, whereas, SWISSPROT is a dedicated protein sequence data bank.

Similar sequences may gives rise to similar structures and this idea has given birth to new branch of proteomics known as structural proteomics which paved the way for storage, presentation, comparison, inferring evolutionary relationships and prediction of theoretical protein models, a boon in the absence of crystallographic protein structures, for drug discovery research.

2) Expression proteomics: It is concerned with the analysis of protein abundance, separation of protein mixtures, the identification of individual components and their systematic quantitative analysis. This sub branch lays emphasis on differences representing alternative states like health and disease and characterisation of post-translational modifications. The key tools used in investigations involve 2D gel electrophoresis, mass spectrometry, multidimensional chromatography and protein chips.

3) Interaction proteomics: It deals with the genetic and physical interactions among proteins as well as interactions between proteins and nucleic acids or small molecules. Study of protein interactions not only provides insight on the function of individual proteins but also how proteins function in pathways, networks, and complexes. It seeks to achieve creation of proteome linkage map based on binary interactions between individual proteins and higher order interactions determined by the systematic analysis of protein complexes. Interactions between proteins and nucleic acids emphasize on processes such as gene regulation, while interaction of proteins with small molecules may enlighten on the interaction of enzymes with substrates and receptors with their ligands and also may play an important role in drug development process. The key approaches used in studies of this kind of interaction are yeast, two hybrid system, mass spectrometry, biochemical assays and X-ray crystallography.

4) Functional Proteomics: This lays emphasis on testing protein functions on a large scale such as testing expressed proteins for different enzymatic activities.

Tools used in Proteomics

  • 1) Databases: Protein Expressed sequence tag and complete genome sequence provide information on all the expressed proteins in organisms.
  • 2) Mass Spectrometry: It provides information on molecular measurement (>100KDa) and sequence analysis of proteins.
  • 3) Soft ware tools: These tools determine the sequence of a protein with the aid of specialised algorithms and provide automated survey of large amounts of mass spectrometry data for protein sequence matches.
  • 4) Protein Separation technologies: They resolve complex protein mixtures into individual proteins and permit comparison of differences in protein levels between two samples. The key technologies include 2Dimensional gel electrophoresis, SDS-Poly acrylamide gel electrophoresis, high performance liquid chromatography, capillary electrophoresis, affinity and ion exchange chromatography.

Applications

  • 1) Identification and cataloguing of proteins.
  • 2) Identification of proteins in a sample of differentiation, developmental state, disease state and exposed to a drug, chemical or physical stimulus.
  • 3) Determinate how proteins interact with each other in living systems and characterisation of proteins in more complex networks.
  • 4) Mapping of proteins in post-translational modifications.

Challenges: No single technological approach is suitable for every  application. Integration and automation of these approaches, using of better materials, advancement in instrument design and methodology for improving sensitivity, resolution and repeatability are the challenges before the proteomics community in order to provide a comprehensive analysis of complex biological system.

1.3 BIOINFORMATICS

Bioinformatics is defined as the computational handling and processing of genetic information. Its goal is to enable the discovery of new biological insight as well as to create a global perspective from which unifying principles can be perceived. It lays emphasis on organizing the data for accessing and generating information, developing tools to analyse the data and interpret the data in a biologically meaningful manner. In analysing the data, two approaches like data mining and knowledge discovery are taken into consideration. These approaches involve generating testing hypothesis regarding the function or structure of a gene or protein of interest by homology searching. Besides deriving meaningful biological information from the data; bioinformatics also serves the scientific community with resources such as databases.

Bioinformatics plays an important role in many areas of biological research like genomics, transcriptomics, proteomics, structural biology, genetics, molecular biology and evolutionary biology. Bioinformatics knowledge is utilised in genomics particularly, genome sequencing, mapping, genome annotation, comparing multiple genomes, calculating evolutionary distance and single nucleotide polymorphism discovery. Transcriptomics applications of bioinformatics includes study of transcribed sequences, both full length cDNA and expressed sequence tags and analysis of gene expression data. In proteomics, bioinformatics is helpful in the analysis of protein sequences, protein abundance and in determination of protein structure either empirically or computationally. In molecular biology, analysis of protein-protein interactions and molecular pathway and in systemic studies of gene regulation bioinformatics has significant contribution. In genetics, bioinformatics is useful in the discovery of new molecular genetic markers such as SNP’s and use of these and other markers to dissect the genetic basis of disease and other phenotypes. Bioinformatics is also helpful in the studies of evolution and phylogeny.

Tools used in Bioinformatics: Various tools are used in bioinformatics research like internet, search engines like Google, Scirus, AltaVista, Lycos, HotBot, Northern Light, Dogpile, databases like National centre for biotechnology information (NCBI), PubMed and sequence analysis tools like BLAST, FASTA, multiple sequence alignment (MSA), and visualisation tools such as RasMol, Jmol and Cn3D.

Computational methods help in analysing the data and formulate hypotheses. Sequence data is the most abundant type of biological data available electronically. Pairwise sequence comparison is used in bioinformatics applications for sequence based database searching, building evolutionary trees, identification of characteristic features of protein families, create homology models, compare genomes, explore sequence determinants of protein structure and to connect expression data to genomic information.

Sequence data can be used for sequence analysis to know the sequence characteristics, sequence comparison, multiple sequence alignment, motif discovery and phylogenetic inference. Sequence databases can be found on internet. Databases are searched mostly for similarity search. The tools used for similarity search includes BLAST and FASTA.

BLAST: It stands for basic local alignment search tool. This algorithm is used to perform sequence similarity search. A server at NCBI was established to support BLAST. This server is used widely for sequence database searches. An independent set of BLAST program was developed at Washington University known as WU-BLAST. This BLAST also performs the similarity search like NCBI and produces gapped local alignments. The BLAST requires different statistical methods to evaluate sequence similarity score. BLAST algorithm increases the speed of sequence alignment by searching first for common words or k-tuples in query sequence and each database sequence. It searches the words that are significant. In case of proteins, the significance of word matching is evaluated by BLOSUM 62 amino acid substitution matrix. The word length is 3 for proteins and 11 for nucleic acids in BLAST algorithm. The latest version of BLAST is BLAST 2. It reports the gapped alignment of query and database sequences. BLAST has filtering feature for searching low complexity regions in query sequence (repeats of sequence character) which produces artificial high score alignments.

There are number of variations of BLAST programs like Blastp for comparing an amino acid query sequence against a protein sequence of a database; Blastn for comparing a nucleotide query sequence against nucleotide sequence of a database; Blast x for searching six frame translation product of nucleotide sequence against a protein database; Tblastn for searching a protein sequence against a translated nucleotide sequence of a database; and Tblastx for comparing six frame translations of nucleotide sequence query sequence against a six frame translations of a nucleotide sequence data base. MegaBLAST searches similar sequences that are 300 to 100,000 bp long. A long word is used for searching and the gap penalty is calculated from the match and mismatch scores. RPS BLAST scans conserved domains in a protein sequence. BLASTcl3 is a network client BLAST which is used to access the BLAST server. Standalone BLASTs are executable versions of all the BLAST program for the operating systems Windows, Unix and Macintosh. PSI-BLAST (Position specific iterated BLAST), PHI BLAST (pattern hit initiated BLAST) are used to search for domains in query protein sequence and in database sequences. BEAUTY (BLAST enhanced alignment utility) adds additional features to the BLAST like summarizing the locations of HSPs, PFAM domains and Prosite pattern. BLAST searching with cobbler sequence (consensus) is used to find majority residues in multiple sequence alignment. BLAST2 program is used to align very long sequences.

FASTA: This program is used for aligning pairs of protein and DNA sequences. It searches for matching sequence patterns or words called k-tuples. Patterns contain k consecutive matches of letters. It attempts to build a local alignment on word matches. This program is used for database searches. FASTA compares the query protein or DNA sequences to the target sequences in the database and give the best matched sequence and local alignment of matched sequences. To search for similarity FASTA uses hashing method in which a table of the positions of each word of length k, or k tuple is constructed for each sequence. The positionof each word is calculated by subtracting the position in the first sequence from the position in the second and words having the same offset position show a region of alignment between the two sequences. The number of comparisons increases as the average sequence length. The k tuple length is 1or 2 for protein and 4-6 for nucleic acid sequences in FASTA program. There are other versions of the FASTA also reported. Among them TFASTA compares the query protein sequence to a six frame translation of DNA sequence of the database; FASTF/ TFASTS compares a set of short peptide fragments against a protein sequence database or a DNA sequence database translated in all six reading frames; FASTX and FASTY translate a query DNA sequence in all three reading forwad frames and compare all three frames to a protein sequence database; and TFASTX and TFASTY compare a query protein sequence to a DNA sequence database, translating each DNA sequence in all six possible reading frames.

Multiple Sequence alignment (MSA): It is an alignment of three or more sequences and aims to place sequence positions related by function and evolution in the same column of the alignment allowing for mismatches and gaps (deletions or insertions). In msa, both global and local alignments are used. In global alignment, dynamic programming algorithm is used for alignment of three sequences, more than this number, only a small number of relatively short sequences may be analysed. The methods used include progressive methods (ClustalW,ClustalX, MAFFT, MAVID,MSA, MULTIPIPMAKER, POA, PRALINE, T-COFFEE) which start by aligning most alike sequences followed by building an alignment by the addition of more sequences; iterative methods (DIALIGN, PRRP, SAGA) initially align group of sequences and then revise the alignment to achieve a more reasonable result; methods of aligning the sequences based on conserved pattern found in the same order in the sequences; statistical methods generating probabilistic models of the sequences and graph based methods.

Local MSA methods align the most similar regions in sequences. The approaches include profile analysis which identifies highly conserved portion of the alignment and produces a scoring matrix called a profile. A profile includes scores for amino acid substitutions and gaps in each column of the conserved region. In block analysis, blocks (substituted regions without gaps) are searched and used in sequence alignments. Pattern searching or statistical methods scan a localised region of sequence similarity in a set of sequences.

Structure visualisers: Protein structure data is stored as collections of x,y,z coordinates. The connectivity between atoms in proteins has to be taken into account and for the visualisation to be effective a virtual 3D environment which needs to be created. A protein structure visualisation program needs to be able to display use selected subsets of atoms with correct connectivity, draws standard cartoon representations of proteins such as ribbons and cylinders and recolour subsets of a molecule according to a specified parameter.

RasMol: It is a structure visualisation program tool and available for a wide range of operating systems. It reads molecular structure files in the standard PDB format. It comes in three display depths 8, 16 and 32 bit. The molecule can be rotated in window. It has file menu commands for opening molecular structure file, display menu commands for the molecular display style to formats including ball and stick, cartoons and space fill. The colour menu allows colour changes ofthe molecule, option menu allows changes of the display style and export menu facilitates writing the displayed images in common electronic image formats such as GIF, PostScript and PPM. Help common allows the creation of own combination of colours and structure display formats.

Cn3D: It allows viewing protein structure files NCBI ASN.1 format. It opens two windows: a colour structure viewer in which a molecule can be rotated, coloured according to different properties and rendered in different display formats; a sequence viewer, which allows you to view sequences and alignments corresponding to the displayed protein and to add graphics to the sequence display to highlight the location of secondary structure features.

Biological Databases: A biological database is a large, organised body of persistent data, usually associated with computerised software designed to update, query and retrieve components of the data stored within the system. These databases are helpful to gain a insight into biological phenomena from the structure of biomolecules and their interactions to the whole metabolism of organism and to understand the evolution of species. Databases are classified as primary databases (DDBJ, EMBL, Gene Bank), protein sequence databases (SWISS-PROT, Protein Information Resource), protein sequence databases (Pfam, PROSITE) protein structure databases (PDB, SCOP), protein-protein interaction databases (BioGRID, STRING), pathway databases (KEGG), microarray databases (Array express, Gene expression omnibus).

DDBJ (DNA Data Bank of Japan): It is run by National Institute of Genetics, Japan. It is the only nucleotide sequence database in Asia. It works in collaboration with EMBL and Gene Bank. It collects experimentally determined sequence data mainly from Japanese researches but also accepts from others as well. The database is a collection of “entry” which is the unit of the data. Each entry includes nucleotide sequence and the information of submitters, references, source organisms, and the biological nature such as gene function and other property of the sequence

EMBL(European Molecular Biology Laboratory): This database is maintained by European Molecular Bioinformatics Institute, Cambridge, U.K. It is Europe primary nucleotide sequence database. The data consists of DNA and RNA sequences drawn from individual researchers, genome sequencing projects and patent applications. As on 30th August, 2012 it contain 252,106,363 sequence entries comprising 450,481,663,919 nucleotides.

Gene Bank: It is a NIH genetic sequence database. As on April 2011, it had a collection of 126,551,501,141 bases in 135,440,924 sequence record. Entrenz nucleotide is used for sequence identification and annotations. Entrez nucleotide is divided into core nucleotide (main collection), dbEST (expressed sequenced tags) and dbGSS (genome survey sequences). BLAST programme can be used to align query sequences to Genebank sequences.

SWISS-PROT: It is a curate protein sequence database formed in 1986 and maintained by the Department of Medical Biochemistry of the University of Geneva and the European Bioinformatics Institute (EBI). The characteristic features of SWISS-PROT include availability of high level annotations, a minimal redundancy and integration with other databases.

PIR (Protein Information Resource): It was established in 1984 by the National Biomedical Research Foundation (NBRF) and helps researchers in the identification and interpretation of protein sequence information and provides tools.

Pfam: It is a database of protein families. It has two components namely PfamA and Pfam-B. Pfam-A contains large portion of sequences with high quality and manually curated protein families, while, the Pfam-B has a collection of low quality families and useful for identifying functional regions.  Pfam also generates clans, a grouping of related protein families.

PROSITE: This database has a collection of protein families and domains. Patterns and profiles of more than thousand protein families and domains data is available in this database.

PDB (Protein Data Bank): This database is of experimentally determined structures of proteins, nucleic acids and complex assemblies.It was established in 1971 at Brookhaven National Laboratories. In 1998 it came under the umbrella of Research colloboratory for structural bioinformatics. It contains information about coordinates, deposited structures and method of structure determination. As on October 2nd 2012, 85 thousand records of structures were available.

SCOP (The Structural Classification of Proteins): It organises and classifies proteins based on their evolutionary and structural relationships. It is organised into four hierarchical levels: family, super family, fold, and classes.

BioGRID (The Biological General Repository for Interaction Datasets): It is a database with collections of genetic and protein interaction data from model organisms and humans. It has holdings of 557,934 interactions scanned from 34,996 publications in the primary literature as on 1st October, 2012.

STRING (Functional Protein Association Network): It is a database of known and predicted protein interactions. The interactions include direct (physical) and indirect (functional) associations. The data is drawn from genomic context, highthroughput experiments, co expression and previous knowledge. This database makes an attempt to integrate interaction data from these sources for a large number of organisms and transfers information between organisms. It has a collection of 5214234 proteins of 1133 organisms.

KEGG: It was initiated in 1995. This database integrates genomic, chemical and systemic functional information. It provides a reference knowledge base for linking genomes to life through the process of PATHWAY mapping to infer systemic behaviours of the cell or the organism. It also links genomes to the environment via BRITE mapping. KEGG BRITE, an ontology database showcase functional hierarchies of various biological objects, including molecules, cells, organisms, diseases and drugs and also relationships among them. KEGG PATHWAY presently has a combined map of about 120 existing pathway maps. Smaller pathway modules are stored in KEGG MODULE. KEGG DRUG contains information about all approved drugs in the US and Japan, and KEGG DISEASE provide a link in to disease genes, pathways, drugs and diagnostic markers.

Micro array databases: Microarray is a hybridization of a nucleic acid sample (target) to a very large set of oligonucleotide probes, which are attached to a solid support, to determine sequence or to detect variations in a gene sequence or expression or for gene mapping. Various organisations have created microarray databases like National Center for Biotechnology Information,US-Gene expression Omnibus, European Bioinformatics Institute -Array express and National Institute of Genetics, Japan- Center for information biology gene expression.The purpose of creation of these databases is to store the minimum information about microarray experiments and to allow the researchers to repeat the experiments.

Applications of Bioinformatics (BI):

  • is useful for gene and protein prediction, evolutionary distance calculation, active site identification, construction of novel mutations and characterisation of alleles of diseases.
  • is applied in the analysis of the organisation of genes and genomes.
  • may be helpful in identification of regulatory elements in genes.
  • focuses on the development of algorithms to assess relationships among members of large data sets.
  • allows the study of evolutionary events like gene duplication, lateral gene transfer.
  • is useful in conservation of genomes of endangered species and biodiversity management.
  • facilitate matching of data generated by mass spectrometers to the protein sequence databases.
  • is helpful in the automation, processing, quantification and analysis of large amount of data from biomedical images.
  • provide first hand information on chemical structure, reaction kinetics and synthetic methods, toxic chemical substances in the form of databases.
  • can be used to study invention, implementation of structures and algorithms to improve communication, understanding and management of medical information.
  • may be used to develop models for simulating intracellular molecular processes to predict dynamic behaviour of living cells.
  • tools are used for prediction of function of genes and 3D structure prediction of proteins, selection of suitable ligands, identification of protein motifs and domains.
  • is useful to maintain the sequence data and allow analysis of data by making availability of tools.

Challenges: The key challenges to bioinformatics involve the management of current flood of raw data, aggregate information, the diversity of the sources and formats, variable conditions and descriptions of experimentation, the quality of experimental evidence for data, data processing and evolving knowledge arising from the study of the genome and its manifestations.

1.4 PHARMACOGENOMICS

Pharmacogenomics may be defined as the genome wide analysis of genetic determinants of drug efficacy and toxicity. This branch of science began as pharmacogenetics when Vogel gave its name in 1959.The main goal of pharmacogenomics is evaluating the role of genetic variants in drug metabolism and effect, developing innovative ways of minimizing harmful drug effects and optimising care for individual patients. Various tools like gene mapping, gene sequencing, statistical genetics and gene expression are used to derive information.

Various factors influence the drug response like age, genetics, sex, race/ethnicity, disease state, absorption, distribution, metabolism, excretion, body weight, height, receptor sensitivity, organ dysfunction, accompanying medications, smoking, diet, alcohol, stress, pollution, socio-economic status, drug adherence, physiological changes including pregnancy, lactation, unmeasured nucleotide or structural variation, complex methylation/epigenetic mechanisms and geneenvironment interactions. Around 30 to 60% of medication response rate for treating many diseases like depression, schizophrenia, rheumatoid arthritis has been observed. It has been reported that genetic factors can account 20 to 95% variation in drug response. Patients are classified into poor, normal and rapid metabolizers depending on the response to the drug intake. When a standard dose is given to the poor metabolizer, the drug is metabolized slowly resulting in an increased risk of toxicity and for the ultra metabolizer, the standard dose may be ineffective.

Genetic variations related to drug response can be classified in to three types namely pharmacokinetic, pharmacodynamic and idiosyncratic, based on their mechanism of action. Variations involving pharmacokinetics are associated with drug transporters and metabolizing enzymes and lead to alterations in the uptake, distribution and elimination of drugs. The Pharmacodynamic type of variations occur in the drug target or a component of the target pathway leading to altered drug efficacy. The target of pharmacodynamics includes receptors, ion channels, enzymes, transducer and regulatory proteins and immune molecules. A third type of variations known as idiosyncratic involved in unintended actions of a drug outside its therapeutic indication.

Inter individual variations in drug response have been reported based on genetic variations in drug metabolizing enzymes, receptors, transporters and pathways. The drug metabolizing enzymes are heterogeneous group of proteins involved in metabolism of drugs. They are two groups namely oxidative drug metabolizing enzymes and conjugative drug metabolizing enzymes. The oxidative drug metabolizing enzymes include cytochrome P450 and Flavin mono oxygenase; these catalyze the introduction of an oxygen atom into substrate molecules leading to hydroxylation or demethylation. The conjugative enzymes catalyze the coupling of endogenous small molecules to xenobiotics that results in the formation of soluble compounds that are more readily excreted. This enzyme family has UDP glycosyltransferases (UGTs), glutathione transferases, sulfotransferases and N acetyltransferases as members.

Classical examples of variation in drug response: Though many examples are available, few are given below.

Warfarin: It is an anticoagulant used to prevent stroke and venous thromboembolism. Its use is limited by narrow therapeutic window, variability in dose–response, interactions with drugs and diet and risk of serious bleeding. The dose is prescribed based on anticoagulation response measured by laboratory assay like internalised normalised ratio. Clinical factors account for 17-21% of variation and genetic polymorphism in genes such as Cytochrome P450(CYP)2C9 and vitamin K expoxide reductase complex subunit 1(VKORC1) are responsible for 30-35% variation in warfarin dosing. Over coagulation and risk of bleeding was observed in carriers of at least one or more variant alles of the CYP2C9 genotype. Incease risk of adverse cardiac events were observed in those who possess the variant VKORC1. A 42.6% benefit of warfarin treatment was observed after genotype guided drug regimen.

Clopidogrel: In cardiac patients, who are undergoing percutaneous coronary interventions (PTCA) clopidogreal is a standard medication. Clopidogreal is a prodrug that requires metabolic activation in a reaction catalyzed by chytochrome P450 enzyme CYP2C19 into its active metabolite. It has been reported that around 25% of patients experience subtherapeutic antiplatlet response. A lower capacity to metabolize clopidogrel into its active metabolite and inhibit platelet activation and higher risk of adverse cardiovascular events were observed variant allele carriers when compared to wild type allele of CYP2C19 .Another mediator of Clopidogrel platelet effect has been reported and it was Paraoxagenase 1(PON1). It was found to drive the conversion of the drug into the active metabolite. A polymorphism in PON1 (PON1Q192R) was observed to affect the platelet response, clopidogrel pharmacokientics and the risk for thrombosis. Limited platelet inhibition and decreased plasma levels of both active PON1 and clopidogrel metabolites were observed homozygous individuals of PON1QQ192.

Flaxacillin: It is an antibiotic used for the treatment of staphylococcal infections. The usage of this drug has been associated with cholestatic hepatitis in approximately 8.5 cases per 100,000 patients. A single nucleotide polymorphism in HLA-B*5701 has shown strong association with hepatic injury.

Abacavir: It is a nucleoside analogue used to treat patients with HIV type 1 infection. Within six weeks of teatment, approximately 5% of patients developed a hypersensitivity reaction involving multisystem with symptoms of fever, rash and gastrointestinal discomforts which subsided within 72 hours of discontinuation of the drug. Approximately 74% carriers of haplotype HLAB*5701 showed hypersensitivity when administered with abacavir.

Ribavirin: Chronic hepatitis C is a liver disease characterised by hepatitis C infection. The patients with this infection are treated with pegylated interferon and ribavirin. Approximately 50% of patients depending on ethnic origin show positive response to this treatment and become virus free. Pharmacogenomic studies have shown that CC genotype at the SNP rs12979860, 3kb upstream of the IL28B gene is associated with response to pegylated interferon and ribavirin for patients with chronic genotype 1 infection and natural clearance and the presence of G allele at rs8099917 is associated with non-response

Approaches of Pharmacogenomics

Candidate gene approach: Candidate gene use experimentally derived a priori knowledge about a disease or a drug involving both public and proprietary.

databases for identifying candidate genes whose expression may impact drug action or disease pathogenesis. In this approach, genes are identified based on metabolic pathways, molecular targets, biological response pathways and/or disease risk. Based on the perceived likelihood of involvement of drug response the genes are ranked. Though this approach is tested in unrelated subjects, family studies, but population based studies are commonly employed as they detect relative risk as low as 1.5. The main focus of this approach is finding whether there are differences between the case (non responder) and the control (responder) in genetic variation that is assumed to be functional and involved in the observed phenotypic variability. One of the successful examples of this approach is the identification of individual response to the drug 6-mercaptopurine. The drawbacks of this approach is that it fails to consider a potential contribution of other genes particularly those whose function is yet to be understood; it is always not possible to have a information on the functionality of genetic variability or may be unreliable; it relies on the variability of specific point in the whole gene sequence.

Genome wide approach: This approach does not require a prior knowledge of the target gene. It attempts to identify the association between genetic variants and a given disorder by directing marker SNPs and analysing differences between case and control groups. This approach has been successful in mapping rare highly penetrant diseases in family pedigrees and identifying genes for monogenic traits. The strategies in this approach are based on the Linkage Disequilibrium relationships and structuring of haplotype blocks in the genome. It requires thousands of single nucleotide polymorphism (SNP) as genetic markers and it has been estimated that a minimum of 300,000 to 500,000 evenly spaced SNPs needed to find a marker within the range of disequilibrium. One of the successful examples of genome wide approach is identification of genetic variant in the drug transporter gene SLCO1b1 responsible for statin induced myopathy. The success of genome wide approach depends on study design, sample size, quality control of genotype, collection bias, individual sample data, and ability of high throughput technologies to produce volume of data and ethnicity details. Validating data in an independent cohort in unbiased approach is requisite. The reported odds ratios are relatively small. The practical utility of information generated by using this approach remains controversial. Whole genome studies involve usage of microarrays which is out of reach of routine clinical practice. Complexity in analysis and requirement of additional guarantees in diagnosis make this approach far from applicable. Matching control groups to factors such as underlying disease and ancestry, contributions of genetic variants not detected by current platforms, analysis of gene-gene and gene-environment interactions in determining phenotype, affordable sequencing, storing of sequence data, information management and methods of genome analysis are the challenging issues in this approach.

Applications of Pharmacogenomics

  • 1) Pharmacogenomic studies are helpful in developing therapeutic agents suitable for genetically identifiable human sub group populations.
  • 2) Pharmacogenomics research can decrease the time and number of subjects needed for clinical trials.
  • 3) Pharmacogenomics may facilitate the identification of biomarkers to optimize drug selection, dose and treatment duration and avert adverse drug reactions.
  • 4) Pharmacogenomics may be helpful in reducing national health care bills in developing countries by taking into consideration of genomic variations between populations.
  • 5) Study of the genotypes of populations with little admixture may be helpful in predicting drug responses without testing each individual.
  • 6) Pharmacogenomics may help to improve our understanding of the mechanisms underlying variability in human physiology and its response to drug therapy with a final goal of improving therapy.
  • 7) Information generated from pharmacogenomics studies may help health professionals and patients to make informed decisions about treatment options.
  • 8) Generation of data on new and existing drugs will help in effective utilisation of scarce resources.
  • 9) Pharmacogenomics may be helpful in reducing the costs associated with inappropriate drug treatments or hospitalisations due to serious adverse reactions.
  • 10) Pharmacogenomics testing may produce collateral information which may be medically beneficial for ex. polymorphism in dopime receptor though pharmacogenomics information may help in smoking cessation.
  • 11) Long term applications of pharmacogenomics include reducing the burden of disease, improving the economic efficiency of the health care system and reducing some disparities in health care acess and health outcomes.
  • 12) Pharmacogenomics may be helpful in offering alternatives to the traditional drug development.

Challenges: The challenges of pharmacogenomics include establishing the clinical utility in order to support the value of genotyping; unaffordability of technology for wider application of pharmacogenomics outside the research and development setting; unequal treatment or health disparities due to social and the consequent ethical and legal issues connected to pharmacogenomics testing; measurement challenges such as presence of multiple pathways involved in drug effects, multiple polymorphisms, gene-environment interactions, length of time between testing and clinical outcomes and multiple determinants of clinical outcomes; time and cost intensivity; developing technology to findout specific SNP; finding patients who fit into the criteria and possible users of the drugs; defining cut-off within adverse drug reponse distributions; lack of reproducibility of some gene-drug pairs and the questionable utility of the findings in a large population.

1.5 STEM CELL RESEARCH

Stem cells are precursor undifferentiated cells that are characterised by self renewal and differentiation. The most common definition is based on the properties of hematopoietic stem cells such as multipotency, asymmetric divisions, quiescence, life- long self renewal, niche dependence and long term repopulation ability upon in vivo transplantation. All stem cells may not follow these properties like embryonic stem cells not self renewing in vivo beyond the blastocyst stage; muscle cells are not multipoint; and mesenchymal stem cells do not transplant robustly. Based on context and organism, stem cell properties of monopotency, transient proliferation, lack of niche and inability to transplant in vivo are acceptable. Some stem cells like mesenchymal stem cells have the potential of generating cells not only of their lineage but also other lineages. This phenomenon is known as plasticity or transdifferentiation. Homeostatis is the central mechanism in mammalian cells through which stem cells maintain and preserve organ and tissue integrity by self renewal and multilineage differentiation.

Stem cell populations are heterogeneous. Though stem cells express specific protein markers but accumulating evidence suggests that these markers are transient and dynamic. Stem cells express a vast range of genes at the mRNA levels. Stem cells undergo asymmetric mitotic division and produce two identical daughter cells. Of the two cells, at least one cell retains the stem cell properties, while other differentiates. This process is governed by stem cell niche (environment) and the detachment from this niche would result in differentiation. Stem cells may also follow stochastic differentiation, wherein stem cells make a combination of asymmetric divisions and symmetric ones leading to the formation of either two stem cells (symmetric renewal) or two differentiated cells (symmetric differentiation). Stem cells in olfactory epithelium and muscle follow this kind of cell division.

There is no consensus on the uniformity in the classification of stem cells in the literature. Stem cells were classified as embryonic or adult/non-embryonic/ somatic (postnatal to adult)/extra- embryonic stem cells (adult, cancer and induced pluripotent) or hematopoietic and non-hematopoietic (mesenchymal stem cells), based on their origin. In terms of developmental potential, stem cells were categorized as totipotent, multipotent and unipotent. Totipotent stem cells are able to differentiate into all types of embryo tissues in the trophoblast. These cells are found after the first cell divisions in zygote (ex.zygote). Stem cell are pluripotent because of their ability to differentiate into cells of the three germinal layers (ectoderm, mesoderm and endoderm) except trophectoderm lineage. Embryonic stem cells belong to this category. Multipotent stem cells like hemopoietic can produce a limited range of differentiated cell lineages. Only one specific cell type is generated by unipotent stem cells such as muscle progenitors.

Types of stem cells

Embryonic stem cells

  • 1) These cells are primitive cells and can self renew and differentiate into all cells from all three germ layers such as ectoderm, endoderm and mesoderm.
  • 2) Derived from inner cell mass of blastocyst stage at approximately five days of development using a immunosurgical technique.
  • 3) Sources are embryos created via in vitro fertilisation/somatic cell nuclear transfer and fetuses obtained through elective abortion.
  • 4) These cells do not conform to several cell requirements such as niche dependence, capacity to undergo asymmetrical cell division.
  • 5) Mouse not human embryonic stem cells require leukemia inhibitor factor for their propagation.
  • 6) These cells are grown on feeder layers.
  • 7) These cells can be maintained in undifferentiated state for at least 80 passages.
  • 8) They can form embryoid bodies, the cell aggregations containing all three embryonic germ layers.
  • 9) These cells have the potential of forming teratomas in vivo.
  • 10) These cells are considered as the optimal stem cell source for regenerative medicine applications in view of their potential to form any tissue in the body.

11) Application of these cells are limited by ethical, political, biologic and regulatory hurdles.

Epiblast-stem cells (EpiSC)

  • 1) Cells are derived from pre- gastrula embryo.
  • 2) Express surface markers like mouse embryonic stem cells, Oct4, Nanog and Ssea-1, and form teratoma.
  • 3) These cells do not require feeders or leukemia inhibitory factor (Lif) for culture but can be expanded in the presence of Activin/nodal signaling.
  • 4) These cells are unable to contribute to somatic cells and the germ line following injection into blastocyst or following morula aggregation.

XEN Stem cells

  • 1) These cells are isolated from extra embryonic endoderm.
  • 2) Express markers like Sox7, Hnf4, Gata4 and Foxa2 but lack expression of Oct4 and Nanog.
  • 3) In a chimera assay, these cells contributed to the parietal endoderm and to the parietal yolk sac at later stages during embryo development.

XEN P Stem cells

  • 1) Express genes Oct4, Gata 6 and Ssea-1 and isolated from rat blastocysts.
  • 2) These cells require leukemia inhibitory factor (Lif) for exvivo maintenance.
  • 3) Upon morula aggregation or injection in the blastocyst, these cells contribute to primitive/visceral and parietal extra embryonic endodermal lineages but not the embryo proper.
  • 4) They form tumors when injected postnatal.

Hematopoietic stem cells (HSCs)

  • 1) These are best characterised stem cells.
  • 2) The size of the total pool of HSCs remains roughly the same in the absence of injury, about half of all HSC divisions must, at the population level, be self-renewing.
  • 3) In the steady state, HSCs redistribute via the bloodstream among distinct anatomical locations and therefore are likely to be found in all tissues of the body.
  • 4) These cells are multipotent and can differentiate into all myeloid and lymphoid blood lineages.
  • 5) Through fusion HSC contribute to other tissues.
  • 6) Sources of these cells are bone marrow, peripheral blood and umbilical cord blood.
  • 7) These stem cell transplantations are performed using HLA matched siblings, parents or donors.
  • 8) These cells represent less than 0.05% of the total bone marrow.
  • 9) Enriched using a complement of cell surface antigens.

Mesenchymal Stem cells

  • 1) These are adherent cells with fibroblast like morphology and are capable of self replication through many passages.
  • 2) These cells are pluripotent and are capable to differentiate into multiple types of tissue like bone, cartilage, muscle, neuron, cardiomyocyte and hepatocytes.
  • 3) These cells have proangiogenic and immunomodulatory effects.
  • 4) These cells have the potential utility for treating a variety of diseases and disorders like graft versus host disease, organ transplantation, cardiovascular disease, brain and spinal cord injury, lung, liver and kidney diseases and skeletal injuries.
  • 5) They have conserved long telomere lengths and do not form teratomas in vivo.
  • 6) These cells are isolated from a number of tissues like bone marrow, adipose tissue, umbilical cord blood, placenta, amniotic fluid, amniotic membrane, gingival, circulating blood, synovium, trabecular bone, dermis, dental pulp and lung and have the capacity of expansion in vitro on a clinical scale.
  • 7) These cells participate in maintaining essential environment to support the hematopoietic stem cells in the bone marrow.

Pancreatic stem cells (PSC)

  • 1) They provide an alternative renewable source of surrogate â cells.
  • 2) They have the potential for lineage-restricted differentiation and are capable of developing into a pancreatic phenotype.
  • 3) The adult pancreatic stem or progenitor cells are found in duct cells, exocrine tissue, nestin positive islet-derived progenitor cells, neurogenin-3-positive cells, pancreas-derived multipotent precursors and mature â cells.
  • 4) In the present state of knowledge, the expansion potential of PSC is limited.
  • 5) Pdx-1, nestin and Ngn-3 markers have been shown to be expressed by these cells.

Cancer stem cells (CSCs)

  • 1) These are small population of cancer cells that have the ability of unlimited growth, self renewal, as well as differentiation into more specialised cancer cell types.
  • 2) CSCs have been identified and isolated in various hematological as well as solid malignancies.
  • 3) CSC may form new tumour tissue when transferred into immunodeficient animal models and have been shown to survive and regenerate tumours tissue even after large percentage of tissues has been destroyed by chemotherapy.
  • 4) Standard pathways for self-renewal of normal stem cells, such as Wnt, Notch and Hedgehog signaling, are also present in CSCs and have an important role in their function.
  • 5) These cells produce higher levels of proangiogenic factors than their differentiated counterparts, and exhibiti more potent proangiogenic capability.

Applications

  • 1) Stem cells provide an opportunity to study the growth and differentiation of cells into tissues.
  • 2) Stem cells can be used to produce large amounts of one cell type.
  • 3) These cells can be to test new drugs for effectiveness and chemicals for toxicity.
  • 4) The damaging side effects of medical treatments might be repaired with stem cell treatment.
  • 5) Somatic cell nucleus transfer technique stem cells created by using patient cell would avoid any tissue rejection problems that could be encountered in other stem cell therapeutic approaches.
  • 6) In view of their migratory properties these cells can be used to target organs ex.tumours.
  • 7) These cells can be used as vehicles to carry therapeutic molecule which they excrete spontaneously.
  • 8) Stem cell therapies, in future, may circumvent the traditional use of chemicals as therapeutic drugs.
  • 9) Stem cell transplantations are used to treat or greatly ameliorate a variety of genetic diseases ranging from inherent defects of hematopoietic cell production (thalassaemia) or function to metabolic diseases (lysosomal storage diseases) mostly affecting solid organs.
  • 10) In acute myeloid leukemia and high grade lymphoma, hematopoietic cell therapy is used as adjuvant therapy.
  • 11) Stem cells therapies are promising in the management of variety of disease conditions like cardiovascular diseases, neurological diseases (parkinson’s disease, Amyotrophic lateral sclerosis, Huntington’s diseases or Alzheimer’s disease, Duchenne muscular dystrophy), diabetes, eye diseases and bone diseases.
  • 12) Mesenchymal stem cells have been shown to be helpful in rapid engraftment of allogenic bone marrow transplantations. These cells were found inhibiting T cell growth, reducing graft versus host disease and effective in a large number of steroid resistant patients.
  • 13) Early wound healing, delayed progression in human multiple system atrophy and beneficial effects in patients with hemorrhagic cystitis, penmomediastium and perforated colon due to the transplantion of mesenchymal stem cells.
  • 14) In experimental studies entire organs were generated from stem cells and this raises the hope of using stemcells/progenitors for tissue engineering applications to generate organs for therapeutic purposes.

Challenges: Overcoming the immunological barriers, understanding of tissue restrictive signals, betterment of methodologies for stem cell isolation, in vitro propagation and transplantation either allo or xeno conditions by awakening resident stem cells, lineage tracing and long term real time follow-up of single cells including the real time assessments of gene and protein expression may go a long way in better utilisation of stem cell therapies for improving the quality of life in people afflicted with various disease conditions.

Sample Questions

  • 1) What is bioinformatics? Mention in brief the important sequence analysis tools.
  • 2) What are the different subfields of proteomics?
  • 3) Give a brief note on genome wide approach.
  • 4) What are the different types of stem cells?
  • 5) What are the tools employed in proteomic studies?
  • 6) What is Pharmacogenomics? Discus with examples.