Protein Databases

    Protein Databases

    Proteins, also known as the essential building blocks of the human body, are a vital part of living organisms. Studies regarding the structure and functions of proteins are executed to learn about the metabolic pathways of cells. This is strictly physiological. Wondering what term is used to define the study of proteins? Well, it is known as Proteomics. This term is derived from the word Proteome. An entire set of proteins is referred to as a proteome. Mark Wilkins in the year 1994 first titled the term proteome, as a metaphor for the term genome.

    It might interest you to learn that the composition and nature of proteins synthesized by a human body vary with time. Sometimes, any one protein can undergo noticeable modifications under conditions of stress. Therefore, describing Proteomics as a complex and challenging subject is not quite incorrect.

    Experimenting and studying protein structures is impossible without referring to informative protein databases. These databases are replete with information about protein structures in the form of three-dimensional coordinates, angles, and unit cell dimensions. Other than providing an immensely useful experimental dataset, this information benefit processes like X-Ray Crystallography and structure-based drug design.

    Speaking of protein databases, mention of the PDB is a must. PDB or the ‘Protein Data Bank’ was founded in 1971. It was the central archive, storing all information about the experimentally determined protein databases. The PDB is further controlled by an International Corporation, popular as the Worldwide Protein Data Bank.

    Traditional Methods of Studying Protein Structures:

    • ELISA – This method, also called ‘The Enzyme-Linked Immunosorbent Assay’ is quite an age-old technique. To detect the nature of sample proteins and measure them accurately, ELISA methods are used.
    • MSIA – Do you know which method has earned a ‘GOLD’ standard ranking to study quantitative proteomics? First brought into use by Randall Nelson, MSIA is still practiced and considered the best blend of mass-spectrometry and the traditional immunoassay technique.
    • SISCAPA – A very popular method of studying proteins today is the SISCAPA or the ‘Stable Isotope Standard Capture with Anti-Peptide Antibodies.’

    Examples of popular protein databases (structure):

    • OCA – This is a browser-database storing relevant information on protein structure and functions.
    • ModBase – Putting the concepts of comparative modeling into use, all experimental data calculated in the form of three-dimensional models are stored in the ModBase.
    • ProtCID – ProtCID or the Protein Common Interface Database has in store some valuable experimental databases of homologous proteins.

    A Library of Protein Family Cores

    We have taken structural alignments of protein families and computed average core structures for each family. The core structures can be divided into residues with low spatial variation and those with high spatial variation. Amino acids with low spatial variance occupy the same relative position in all family members. This library is useful for building models, threading, and exploratory analysis. It is also a useful mechanism for summarizing variability in NMR structures.

    Amino Acids Sequence Database (PRF/SEQDB)
    This database consists of amino acid sequences of peptides and proteins, including sequences predicted from genes. You can also search literature in which the sequence is presented. Sequences not included in EMBL, GenBank, and SwissProt are also found in PRF/SEQDB since it is constructed based on all amino acid sequences of peptides and proteins reported in the literature.

    Analysis of protein and dprotein-DNA interactions.

    Cytokine Family Database
    The Cytokine Family Database (dbCFC) is a collection of EST (Expressed Sequence Tag) records of cytokines deposited in the NCBI GenBank. It provides information about identifying EST records to cytokine members and related data in other databases.

    MHCPEP is a database comprising over 13000 peptide sequences known to bind MHC molecules. Entries were compiled from published reports and direct submissions of experimental data. Each entry contains the peptide sequence, its MHC specificity, and, when available, experimental method, observed activity, binding affinity, source protein, anchor positions, and publication references.

    OWL is a non-redundant composite of 4 publicly-available primary sources: SWISS-PROT, PIR (1-3), GenBank (translation), and NRL-3D. SWISS-PROT is the highest priority source, all others being compared against it to eliminate identical and trivially-different sequences. The strict redundancy criteria render OWL relatively “small” and efficient in similarity searches.

    PDB – The Protein Data Bank
    An international repository for the processing and distribution of 3-D macromolecular structure data primarily determined experimentally by X-ray crystallography and NMR.

    PIR – Protein Information Resource
    The Protein Information Resource (PIR), in collaboration with the Munich Information Center for Protein Sequences(MIPS) and the Japanese International Protein Sequence Database (JIPID), maintains the PIR-International Protein Sequence Database – a comprehensive, annotated, and non-redundant set of protein sequence databases in which entries are classified into family groups and alignments of each group are available.

    The PIR-NREF is a Non-redundant REFerence protein database designed to provide a timely and comprehensive collection of all protein sequence data, keeping pace with the genome sequencing projects and containing source attribution and minimal redundancy.

    The Protein Mutant Database (PMD) covers natural and artificial mutants, including random and site-directed ones, for all proteins except members of the globin and immunoglobulin families. The PMD is based on literature, not on proteins. Each entry in the database corresponds to one article, which may describe one, several, or several protein mutants.

    PROTONET: Automatic hierarchical classification of proteins
    ProtoNet provides a global classification of proteins into hierarchical clusters.

    PROW, Protein Reviews on the Web
    Protein Reviews On the Web is an online resource that features PROW Guides, authoritative short, structured reviews on proteins and protein families. The Guides provide approximately 20 standardized categories of information (abstract, biochemical function, ligands, references, etc.) for each protein. PROW Guides are assembled via KBTool, an underlying organizational tool that stores various information types.

    The Restriction Enzyme Database collects information about restriction enzymes and related proteins. It contains references, recognition and cleavage sites, isoschizomers, commercial availability, methylation sensitivity, and crystal and sequence data. DNA methyltransferases, homing endonucleases, nicking enzymes, specificity subunits, and control proteins are also included. Most recently, putative DNA methyltransferases and restriction enzymes are also listed.

    SPTR is a comprehensive protein sequence database that combines the high quality of annotation in SwissProt with the completeness of the weekly updated translation of protein-coding sequences from the EMBL nucleotide database (TrEMBL).

    SWISS-PROT is a curated protein sequence database that strives to provide a high level of annotations (such as the description of the function of a protein, its domains structure, post-translational modifications, variants, etc.), a minimal level of redundancy and a high level of integration with other databases.

    The WD-repeat Family of Proteins
    A library of WD-repeat-containing proteins has been constructed in which the repeats appear as multi-aligned sets. WD-repeat-containing proteins contain 4 or more copies of the WD-repeat (tryptophan-aspartate repeat), a sequence motif approximately 31 amino acids long that encodes a structural repeat.

    The Transcription Factor Database.