To enable the study of orphan genes in a comparative genomic context, we have performed a comprehensive analysis of 330 complete bacterial genomes. The OrphanMine contains the results of all against all blastp searches of 972526 predicted genes. The database can be explored using the PHP web interface providing the ability to search for 'real' orphans based on a number of factors (similarity scores, sequence length, GC content and percent low complexity sequence). It also provides a method for ranking predicted proteins according to how likely it is to be 'real'. The OrphanMine provides the microbial community with a means of investigating sets of taxonomically restricted genes and thus gain deeper insight into the properties and functions of orphan genes.
2. I can't view the 'Help' pages. Why?
The OrphanMine 'Help' pages are specific to the section of the database you are currently browsing. By using pop-up windows it is possible to view help relevant to the page you are viewing whilst remaining at that page. However some browsers block pop-up windows. By changing this setting on your web-browser you should be able to view the 'Help' pages. Alternatively all the help pages are concatenated into one page here.
3. How do I query OrphanMine using the OrphanMine SQL interface?
The OrphanMine SQL page accepts SQL SELECT queries in the same way as if you were interacting with a MySQL database on a command line. By typing 'show tables' and clicking on the submit button, you will be able to view the tables making up the database, for example 'genome3'. If you wanted to know what columns were present in a particular table, type 'describe tablename', e.g. 'describe genome3'. It is possible to download the output from your query as a tab delimited text file. This can then be loaded into excel if required
The tables that contain the majority of data that is likely to be of use to the casual user are shown below:
Genome3: Contains meta-data describing each genome (data set independent) Dataset3: Contains data set dependent genomic data such as orphan number Orf3: Contains meta-data describing each predicted protein in all bacterial genomes contained in the database. Orphan3: Contains data linking predicted proteins with orphan data sets.
Example queries include:
SELECT Species, Division, Domain, Taxonomy, NC_number, Genome_size, Orfs, Orphans, truncate (Percent_orphans,2), Small, Large, truncate (Mean_size,2), True_paralogues, truncate (Percent_lc,2),gc_content, uniprot
FROM genome3, dataset3, join_dataset3
WHERE genome3.Genome_id = 1 and
join_dataset3.Dataset_number = 1 and
join_dataset3.dataset_id = dataset3.dataset_id and
genome3.Genome_id = dataset3.Genome_id
The above query obtains all the genomic meta data for Haemophilus influenzae when looking at data set 1.
SELECT Orf_name, True_para_orphan, NC_number, Length, Description, Orf_id, truncate (Low_complexity,2),gc, uniprot
FROM orphan3, genome3, orf3
WHERE orf3.Genome_id = 2 and
orf3.Genome_id = genome3.Genome_id and
orf3.orf = orphan3.orf_name and
orphan3.Dataset_number = 3
This query will obtain all the meta-data describing the predicted orphans in Mycoplasma genitalium when using data set 3. The final example, shown below, obtains all meta-data for the given predicted protein.
SELECT * from orf3 where orf = 'NC_000907orf0001'
The database schema is shown below:
4. How do I create my own ad hoc dataset of predicted genes?
By using the filters provided on the customise page (www.genomics.ceh.ac.uk/orphan_mine/customise.php)it is possible to limit the genes that are contained within your dataset. By default the page searches on all predicted proteins in all bacterial proteomes. If no e-value is selected (i.e. you have entered no number in the e-value text box) then your search will be run on all predicted proteins. However if you want to search on a restricted dataset it is possible to enter an e-value cut-off. Hence the search will only involve those predicted proteins that have no matches at the given threshold. It is possible to limit your dataset by entering parameters for GC content, % low complexity, length (actual and percentile) and best match genome (determined by e-value score on BLAST report).
The results of the search is shown by the number of predicted proteins and the bars down the right hand side, these bars indicate the percentage of genes in the genome that fit your search requirement. By clicking on 'View Predicted Proteins', you can view all the predicted proteins fitting your criteria.
5. How do I know what the 4 different pre-computed datasets represent?
By clicking on the name of the dataset, a pop-up window will appear (if your browser is configured to accept pop-ups). This window will explain how the selected orphan dataset was generated.
6. How do I create and obtain a list of lineage specific genes?
Use the tools provided on the TRG page (www.genomics.ceh.ac.uk/orphan_mine/restriction_v2.php). To do so, please follow the instructions below:
a) Select a reference genome. This is the genome that you will be searching for TRGs.
b) Select the genomes that you would like to find genes in that are shared with the reference genome. To do this, simply check the relevant boxes in the 'Genes shared by' column. (optional, see (C) below)
c) Select the genomes that you do not want to find the genes in. To do this, check the relevant boxes in the 'Genes not present in' column. (optional, see (B) below)
d) If there are any genomes that you wish to remove from the analysis, make sure that neither box is selected.
e) Press the 'Find Restricted Genes' button. Please be patient whilst waiting for the results.
Examples
(i) You want to find the genes in Mycoplasma pneumoniae that are restricted to the sequenced Mycoplasma genomes. Firstly select Mycoplasma pneumoniae as your reference genome from the drop down menu. Scroll down the page until you reach the Mycoplasmas (in the firmicutes section) and check the box in the left column ('Genes shared by') for each of them and deselect all the right hand boxes. Ensure the right hand boxes ('Genes not present in') are selected for all other genomes. Press the 'Find Restricted Genes' button. After a short delay a page should be loaded displaying the TRGs found in Mycoplasma pneumoniae down to the Orphan level.
(ii) You want to find the genes in Mycoplasma pneumoniae that are found in the other Mycoplasmas, regardless of whether they are found elsewhere or not. Firstly, press the 'Remove All' button from the top of the 'Genes not present in' column. This will prevent the analysis restricting your search. Select your chosen reference genome from the drop down menu, in this case Mycoplasma pneumoniae, and then select the left hand box for all Mycoplasmas. Press the 'Find Restricted Genes' button. This will list all the genes that are shared with other members of the Mycoplasmas.
(iii) You want to find the genes in Mycoplasma pneumoniae that are not found in Mycoplasma genitalium, regardless of whether they are found elsewhere. Firstly, press the 'Remove All' button from the top of the 'Genes not present in' column. Select your chosen reference genome from the drop down menu, in this case Mycoplasma pneumoniae, and then select the right hand box for Mycoplasma genitalium. Press the 'Find Restricted Genes' button. This will list all the genes that are found in Mycoplasma pneumoniae but not in Mycoplasma genitalium.
7. What method is used for ranking the predicted proteins in my dataset?
Five criteria were chosen for use in scoring predicted proteins according to how likely they are to be a real gene. These criteria are: (i) Length
(ii) Low Complexity
(iii) Neighbourhood Distribution
(ii) Average Amino Acid Cost
(ii) Difference in GC content of sequence and genome
For each genome and for each criterion, the distribution of non-orphans was generated and percentiles for that distribution calculated. Each predicted protein was given a score from 0-100 depending on where they fell within the distribution. Longer proteins and those with higher values of neighbourhood distribution would score more highly. Neighbourhood distribution is calculated by determining how many genomes have sequences sharing similarity to the 5 genes flanking the subject gene in both directions. A higher value of neighbourhood distribution means that the genes neighbouring the subject gene are found in many other genomes. In the case of percent low complexity and average amino acid cost, the lower the value, the higher the score. GC composition was calculated by subtracting the GC content of the sequence from the GC content of the genome. The sequences scoring highly in this criterion were those closest to the mean value. These scores from the criteria were summed and divided by the number of criteria used. Finally, the value was divided by 100 to give a score between 0 and 1. 0 would be the worst possible candidate for a real gene, 1 would be a perfect candidate for being a real gene.
OrphanMine allows you to rank the predicted proteins in your dataset according to any combination of the 5 criteria. Once ranked you can download the ranked protein list in GFF3 format or tab delimited format for further analysis. The idea of ranking the proteins is to provide users with a method of prioritising genes for experimental characterisation.
8. If I have any problems associated with OrphanMine, whom should I contact?