Searching the Database
In the future, all of the information held in the Annotation Alignment database will be searchable.
Search fields will include:
- feature annotations (GFF format)
- pattern searches on the alignment
- original data file (FASTA format)
- metadata describing the taxa included in the database
GenomeMine
Metadata about viral genomes will be stored in the
GenomeMine database.
At present the GenomeMine database holds information on complete flavivirus
genomes, extracted or calculated from GenBank genome annotations (view GenomeMine report for flaviviruses).
| |
Data Curation
In the future, we intend to curate more information about these viruses including details of host specificity, geographic distribution and degree of pathogenicity.
Sequence Ontology
Annotated features will conform to the Sequence Ontology used by the GFF3 convention wherever possible.
Life Science Identifer (LSID) Compliant
Datasets will be LSID compliant so that they can be easily exchanged and harvested by other projects.
|
Related Software for Alignment
The resource we intend to build will accept, in the first instance, a
finished alignment and a GFF formated files of annotation.
Below are listed a variety of software programs for generating, editing, and
manipulating alignments and links to the GFF file format definition.
Alignment Software
- clustalW : Clustal W is a general purpose multiple alignment program for DNA or proteins. The sensitivity of the commonly used progressive multiple sequence alignment method has been greatly improved for the alignment of divergent protein sequences. It is designed to be run interactively, or to assign options via the command line.
- Strap : An editor for multiple sequence alignments of proteins. The computer program STRAP supports the analysis of hundreds of proteins and integrates amino acid sequence, secondary structure, 3D-structure and genomic- and mRNA-sequence and residue annotation.
- PileUp : PileUp is a multiple alignment program from the GCG program suite. PileUp does progressive pairwise comparisons on every possible pair of sequences to find the best alignment, then repeats the process until the alignment is complete.
- t-coffee : T-Coffee is a multiple sequence alignment package. Given a set of sequences (Proteins or DNA), T-Coffee generates a multiple sequence alignment. Version 2.00 and higher can mix sequences and structures.
- aln : Global alignment of a pair of DNA or protein sequences, or groups of sequences. Supports spliced alignment when one of the sequences is a genomic DNA sequence.
- swg : Local alignment of a pair of DNA or protein sequences by Smith-Waterman-Gotoh algorithm. Currently spliced alignment is not supported. Profile version is very slow.
- prrn : Global multiple alignment of a set of protein or DNA sequences by doubly nested iterative refinement method.
- SAM : The SAM program is a collection of of flexible software tools for creating, refining, and using linear hidden Markov models for biological sequence analysis. It can take an input sequence, search a database (by default, NCBI's nr non-redundant protein database) and return a multiple sequence alignment of the similar sequences. To use it: /usr/localapps/msa/sam/bin/target99 -seed inputsequencefile -out output
- HMMER : HMMER also uses hidden Markov models for sensitive database searching. HMMER is available as part of the GCG program suite. A parallel version is available on Biowulf.
- SAGA : SAGA uses genetic algorithms for multiple sequence alignment.
- POA : POA uses partial-order graphs to create multiple sequence alignments.
- MAFFT : MAFFT includes two progressive alignment programs and two iterative improvement programs
- PIMA : pima performs a multi-sequence alignment of a set of (presumably related) sequences using an extension of our covering pattern construction algorithm (Smith and Smith 1990, 1992).
- MAP Multiple Sequence Alignment : The MAP program computes a multiple global alignment of sequences using iterative pairwise method. The underlying algorithm for aligning two sequences computes a best overlapping alignment bewteen two sequences without penalizing terminal gaps. In addition, long internal gaps in short sequences are not heavily penalized. So MAP is good at producing an alignment where there are long terminal or internal gaps in some sequences. The MAP program is designed in a space-efficient manner, so long sequences can be aligned.
- SIM : SIM is a program which finds a user-defined number of best non-intersecting alignments between two protein sequences or within a sequence. Once the alignment is computed, you can view it using LALNVIEW, a graphical viewer program for pairwise alignments
- DIALIGN : a software program for multiple alignment developed by Burkhard Morgenstern et al. While standard alignment methods rely on comparing single residues and imposing gap penalties, DIALIGN constructs pairwise and multiple alignments by comparing whole segments of the sequences. No gap penalty is used. This approach is especially efficient where sequences are not globally related but share only local similarities, as is the case with genomic DNA and with many protein families.
- AMPS : AMPS is a suite of programs for multiple sequence alignment. The programs include options to incorporate non-sequence information such as secondary structures. AMPS also implements flexible pattern matching and database scanning options. AMPS includes functions for running randomisations to estimate the significance of sequence similarities.
- MASIA 2.0 : MASIA (Multiple Aligned Sequences Investigation and Analysis) is a program with GUI to search for consistent patterns in multiple aligned sequences. Predictions of secondary structures and inside/outside properties of residues at each position in an aligned sequences are based on generalized rules for globular proteins, which are derived from observations of known 3D-structures of proteins. The secondary and tertiary structure of a protein is related to chemical characterisitics of the individual amino acid residues, but a clear picture of the secondary structure may not be apparent for one protein sequence alone. Comparing many aligned, related sequences can reveal patterns of sequence conservation that indicate the location of residues essential for the function, folding or solubility of the protein.
- AMAS : AMAS is a program to analyse multiple alignments of protein sequences. It allows the identification of functional residues by comparison of sub-groups of sequences arranged on a tree
- Multalin : Multalin creates a multiple sequence alignment from a group of related sequences using progressive pairwise alignments.
- ToPLign: Toolbox for Protein Alignment : Computing, analysis and visualization of pairwise, multiple, threading, and parametric alignments.
- AllAll : Calculate Phylogenetic Trees, Alignments, dSplits, Probabilistic ancestral sequence, {Kabat-Wu, probability, maximum likelihood} variation index, prediction of Surface/Interior/Active site, prediction of parse regions.
- PredictProtein : An automatic service for protein database searches and the prediction of aspects of protein structure. Database searches: generation of multiple sequence alignments (MaxHom), detection of functional motifs (PROSITE) ,detection of composition-bias (SEG), detection of protein domains (PRODOM), fold recognition by prediction-based threading (TOPITS)
- GeneBee : GeneBee Multiple alignment: pairwise motifs to multiple motifs to 'supermotifs' to construction of multiple alignment.
- Match-Box : The Match-Box software proposes protein sequence multiple alignment tools based on strict statistical criteria. The method circumvents the gap penalty requirement: in the Match-Box method, gaps are the result of the alignment and not a governing parameter of the matching procedure. A reliability score is provided below each aligned position. The Match-Box program is particularly suitable for finding and aligning conserved structural motives, in particular in protein core.
- Protein Structure Prediction : This is a hidden Markov model (HMM) protein structure prediction server. The server has used UCSC's SAM-T98 method to create a library of HMMs, one per PDB structure (about 2500 HMMs total). You can search this database of HMMs with a protein sequence. Compare Sequence Against Protein Model Library, Protein Query Against A Database , Tune Up a Multiple Alignments, Compare Two Alignments, Build SAM-T98 Alignment, Generate Weights for a Multiple Alignment, Build SAM-T98 HMM
- The Gibbs Motif Sampler : The Gibbs Motif Sampler will allow you to identify motifs, conserved regions, in DNA or protein sequences. The goal is to take a given set of amino acid or nucleotide sequences and determine common motif elements within them. It is possible that more than one characteristic motif type exists in which case multiple gene regulation sites are probable. One approach known as site sampling assumes that each sequence contains exactly one motif element for each motif type. The alternative Bernoulli motif sampler assumes that each sequence can contain zero or more motif elements of each motif type.
- The MEME System : MEME is a tool for discovering motifs in a group of related DNA or protein sequences. A motif is a sequence pattern that occurs repeatedly in a group of related protein or DNA sequences. MEME represents motifs as position-dependent letter-probability matrices which describe the probability of each possible letter at each position in the pattern. Individual MEME motifs do not contain gaps. Patterns with variable-length gaps are split by MEME into two or more separate motifs. MEME takes as input a group of DNA or protein sequences (the training set) and outputs as many motifs as requested. MEME uses statistical modeling techniques to automatically choose the best width and description for each motif.
- Meta-MEME : Meta-MEME is a software toolkit for building and using motif-based hidden Markov models of DNA and proteins. The input to Meta-MEME is a set of similar protein sequences, as well as a set of motif models discovered by MEME. Meta-MEME combines these models into a single, motif-based hidden Markov model and uses this model to search a sequence database for homologs.
- DbClustal : The server will: Blast search the latest Swissprot+SPtrEMBL database with your sequence, Extract anchors between your query and the Blast top hits with Ballast, Do a multiple alignment with top hits and your query, you'll get the results as an MSF file. This method was published by Thompson J.D. et al. in Nucleic Acid Research (Vol.28, pp 2919-2926).
- MASS : MASS is an efficient method for multiple alignment of protein structures and detection of structural motifs. Exploiting the secondary structure representation aids in filtering out noisy results and in making the method highly efficient and robust. MASS disregards the sequence order of the secondary structure elements. Thus, it can find non-sequential and even non-topological structural motifs. An important novel feature of MASS is subset alignment detection: It does not require that all the input molecules be aligned. Rather, MASS is capable of detecting structural motifs shared only by a subset of the molecules.
- Bioedit : BioEdit is a mouse-driven, easy-to-use sequence alignment editor and sequence analysis program
- AltAVisT : Alternative Alignment Visualization Tool - Version 1.0. AltAVist (Alternative Alignment Visualization Tool) is a WWW-based software program that is able to compare two alternative multiple alignments of a given sequence set to each other. Regions where both alignments coincide are color-coded to visualize the local agreement between the two alignments and to identify those regions of the alignments that can be considered to be most reliable.
- MUSCLE : MUSCLE is public domain multiple alignment software for protein and nucleotide sequences. MUSCLE stands for multiple sequence comparison by log-expectation.
- MAST : MAST is a tool for searching biological sequence databases for sequences that contain one or more of a group of known motifs.
- CINEMA : CINEMA is a Colour INteractive Editor for Multiple Alignments. The program allows visualisation and manipulation of both protein and DNA sequences. In Version 2.0, the functionality has been extended to include split-screen editing, and applet customisation through the use of 'pluglets'. This file describes the range of functions of the current CINEMA applet and how they can be employed to align your sequences.
- Joy : Joy is an analysis and formatting program for multiple protein sequence alignments or single protein structures. It was developed to display three-dimensional (3D) structural information in a sequence alignment and help to understand the conservation of amino acids in their specific local environments.
- bioperl : A collection of Perl modules for processing data for the life sciences, A project made up of biologists, bioinformaticians, computer scientists, An open source toolkit of building blocks for life sciences applications, Supported by Open Bioinformatics Foundation (O|B|F), http://www.open-bio.org/ , Collaborative online community
- JESAM : The JESAM tools consist of many parts, but the two main CORBA components are : first, a sequence alignment engine with a database, and second, a clustering process with persistent storage. A production environment using JESAM also relies on many scripts to gather new sequences, run updates, check processes are running, start after reboot etc. Web access to results requires some HTML pages, a web server and access to Interoperable Object References IORs with which to contact CORBA objects, either directly, or through a naming service.
- ALION : A multiple sequence alignment tool
Alignment Databases
- PASS2 : a semi-automated database of Protein Alignments Organised as Structural Superfamilies
- Histone Database : Primary reference: The Histone Database: a comprehensive resource for histones and histone fold-containing proteins
- HOMSTRAD : a collection of protein families, clustered on the basis of sequence and structural similarity. The database is unique in that the protein family sequence alignments have been specially annotated using the program, JOY, to highlight a wide range of structural features. Such data are useful for identifying key structurally conserved residues within the families.
- EMBL-ALIGN : The EBI accepts submissions of nucleotide sequence alignment data (from phylogenetic and population analysis etc.) via Webin-Align, the EBI's WWW-based submission tool. As an additional service to the scientific community, amino acid alignments are also accepted.
- SUPFAM : This database consists of colonies of potentially related homologous protein domains, with and without three-dimensional structural information, forming superfamilies.
- ALIGN : ALIGN is a compendium of sequence alignments: it is a companion resource to PRINTS. For each fingerprint, there is a corresponding alignment in NBRF format, the root name of which is the relevant PRINTS code. The database codes used in the alignments reflect the primary source from which they came: the codes may change between database releases, so alignments derived from early versions of OWL will include original rather than current codes.
- HSSP : Structural homology can be inferred from the level of sequence similarity. (4) The threshold of sequence similarity sufficient for structural homology depends strongly on the length of the alignment. Here, we first quantify the relation between sequence similarity, structure similarity and alignment length by an exhaustive survey of alignments between proteins of known structure and report a homology threshold curve as a function of alignment length. We then produce a database of homology-derived secondary structure of proteins (HSSP) by aligning to each protein of known structure all sequences deemed homologous on the basis of the threshold curve. For each known protein structure, the derived database contains the aligned sequences, secondary structure, sequence variability and sequence profile.
References
Gritsun, T. S., Tuplin, A. K., and Gould, E. A. (2005). Origin, evolution and function of flavivirus RNA inuntranslated and coding regions: implications for virus transmission. In Flaviviridae: pathogenesis, molecularbiology and genetics, M. Kalitzky, and P. Borowski eds. (Norwich, UK, Horizon Scientific Press), pp. In press.
|