IntroductionMsatfinder examines sequence files (generally, small genomes) in GenBank, FASTA, EMBL and Swissprot (though ASCII can also be read) formats, and determines the number, type and position of microsatellite repeats. The software is designed to run on unix or Linux computers. It has been reported as working on Mac OSX 10.3.9 and 10.4.7, Gentoo Linux (x86 and ppc) and Debian Linux (x86). To the top. InstallationThe msatfinder script
You may make a file containing a list of all input files that you'd like msatfinder to run on, or supply their names on the command line (e.g. with a glob). If you use a list, then run msatfinder as: ./msatfinder -l name_of_list_file To run on the sample files provided you could provide their names or glob the file suffixes. For example: ./msatfinder *.gbkto search the Genbank file provided. Msatfinder can in fact be placed anywhere you like, for example in /usr/local/bin. As long as you have a configuration file in the same directory as your data then msatfinder will work. For example: cd /home/user/msat_data/bacteria /usr/local/bin/msatfinder *.gbk Msatfinder should run without any need to configure it. If you'd like to change any of the parameters, please see the configuration section, below. There are a variety of options that can be used each time msatfinder is run, and these are described in the searching for microsatellites section. DependenciesThere are two types of dependency required by msatfinder: Perl modules, and external programs. Perl modulesAll of the following must be installed for all the scripts to work properly: More information on installing Perl modules can be found at CPAN. However, many of them will be installed as standard with Perl 5.8.3 (if so, man or Perldoc should provide information). External programsThe following external applications are used by msatfinder:
We use Gentoo Linux and BioLinux as the former has packages available for all these dependencies and the latter has them already installed. Dependency search prioritiesMsatfinder looks for (external program) dependencies as follows:
This system will allow you to specify alternative versions of dependencies, which will have priority over the ones on the user's PATH, if required. This may be useful if the user has more than one version of EMBOSS and running a specific version is required. To the top. Using msatfinderOverviewMsatfinder finds was designed to find perfect repeats (e.g. A(13) would be detected, but AAAAAATAAAAAA would not) in annotated (e.g. GenBank, EMBL, Swissprot) or unannotated (Fasta,raw) format files, but is also capable of finding interrupted microsatellites. It can be used to examine both protein and nucleic acid sequences. If given an annotated file it will extract information about each microsatellite and the sequence it is found in. In addition, for nucleic acid sequences it will determine whether it is possible to create PCR primers containing each repeat region found (EMBOSS/primer3 are required for this feature to work). The various features and output files may be controlled by editing the configuration file (msatfinder.rc) - notes are provided in the file on how to edit it, and more detail is given in this manual. Input and output filesInput may be many separate sequence files, or multiple sequences in a file. To convert file formats readseq may be useful. Input filesInput file types are automatically detected - if the file format cannot be determined a warning will be given. Allowed file types are:
Input sequences may be amino acid or nucleic acid. Please note that if you use ‘ASCII’ format (i.e. raw sequence in a text file) then msatfinder will treat each new line as a new sequence, and blank lines are likely to cause it to fail. You may name your input files as you please, but naming of the sequences in input files is particularly important for the use of msatfinder. As it is possible to use a file containing many sequences as the input to msatfinder, the script uses BioPerl to extract a unique name for each sequence. This is then used to name output files, such as the microsatellite FASTA files. It's therefore possible to have (for example) an input file called “MACO.gbk”, but the FASTA files will be called such things as “NC_004117.122038.CGA.6.fasta”. In the case of FASTA files, the unique identifier extracted is the first entry following the “>”. If you have no unique identifier in your files then msatfinder will attempt to generate one, but it is best if you can supply sequence files with each sequence already labeled. Output filesWhen run, msatfinder creates some directories to store the output files — this is convenient as there are often large numbers of output files. If you install a version of msatfinder on your own computer then these directory names can be changed by editing the msatfinder.rc file. An html file (results.html) is created in the same directory in which msatfinder is run — open this in a browser to view all of msatfinder's output. Results.html contains links to the contents of the following seven subdirectories:
The following files will be found in the directory called “Repeats/”:
Column headers for “repeats” and “sequence” filesThe headers in these tables depend on the file format provided (e.g. Swissprot protein files don't have GC content). The headers are not found in the exact order shown below. These column headings are found in both the repeats and sequence files in the Repeats/ directory.
These additional column headings are found in the sequence file.
These additional column headings are found in the repeats file.
File namesSome of the output files produced by msatfinder have names such as “NC_000871.33400.AT.6.fasta”. These files are the fasta files of msat and flank sequence, primer files, mine files and feature tables. The various sections are as follows.
ConfigurationAll of the parameters that can be customised by the user are found in the msatfinder.rc file. A brief description of how to set each of these is described in the file, but here we give some supplementary information. Variables in the COMMON and FINDER sections of the configuration file control the behaviour of msatfinder. Most of the default values will be acceptable in most cases.
Setting thresholds and motif typesIt is possible to search for motif types of any length. However, configuring this may not be very intuitive. The default entries in msatfinder.rc dictate that msatfinder searches for motifs of length 1 to 6 (eg. A to ATTCCG). This, along with the thresholds for each motif length, is encoded by the line :
motif_threshold = "1,12|2,8|3,5|4,5|5,5|6,5" In this case, a motif of length 1 (eg. A) must be repeated 12 times before being reported. A motif of length 2 (eg. AC) must be repeated 8 times, &c. If the user wishes to search for longer motifs, for example motifs 20 elements in length, then they simply need to add a pipe symbol (‘|’), followed by the motif length required and the corresponding threshold, eg. 2. The line in msatfinder.rc would then read:
motif_threshold = "1,12|2,8|3,5|4,5|5,5|6,5|20,2" Likewise, if the user does not wish to search for a certain type of motif, eg. those of length 1, then they simply delete the '1,12' and the unneeded pipe symbol (pipes should not be at the beginning and end of the string). Doing so would reduce the time needed for searching. Searching for microsatellitesRunning msatfinder is described in the README and the installation section. N.B.You must have an msatfinder.rc config file present in the same directory as your data for msatfinder to run (see the configuration section. If one is not present, the program will stop and give an error message suggesting that you place a copy of msatfinder.rc in the data directory. You may still use the help option if you don't have an msatfinder.rc file present (see below). Msatfinder has a lot of output files, and users may sometimes want to back these up or suppress them. Running msatfinder with the --help or -h option will print out a list of the options available. The full list of options is:
Microsatellite searching enginesThere are three “engines” currently implemented for microsatellite searching. These all operate in slightly different ways, and if you don't find the default useful then the others may prove effective. They are as follows.
The program is reasonably fast. For example, searching a large, microsatellite-rich bacterial genome such as Xylella fastidiosa (NC_002488, 2.68 Mb) with default thresholds takes under two minutes. The exact time will vary depending on your computing power. Interrupted microsatellites“Interrupted” microsatellites include several possible features. One is that a microsatellite tract consists of two motifs of the same class (e.g. dinucleotides) adjacent to each other (example 1). Another is that a long microsatellite tract may have one or more point mutations in it, making it appear to be several shorter tracts (example 2). Sometimes this latter category may include a pseudo-frameshift (example 3), thus appearing to be two different motifs. Examples
Once microsatellites have been detected by whichever engine has been selected, the complete list of msats is scanned to determine the relative positions of each microsatellite within each genome. Microsatellites are joined together when they meet two criteria:
Each “cluster” of microsatellites thus found is combined into a single interrupted microsatellite, using the usual nomenclature. The motif type will be altered to contain all the motif types in the order they are found. So, example 1 (above) would become ac-tg.16, and example 2 would be a-a.21. In the latter case, the “-” is kept in to mark that this is an interrupted microsatellite. To the top. Help for using msatfinder on-lineThe on-line version of msatfinder is available for those who don't want to do a local install. It offers the same features as the downloadable version, but due to server limitations can only accept sequences up to 70Mb. Explanations of each of the on-line options are shown below. Motif selectionAllows a selection of the motif types, mono (e.g. (A)12)to hexa (e.g. (ATAACA)20), for which one wishes to search. By default all the types are selected, but if you are not interested in mononucleotides, for example, then switching them off will save time and result in a smaller results file to download. Threshold selectionFor each motif type, there is a minimum number of repeat units that msatfinder will look for. For example, the default setting is 12 units for mononucleotides and 5 for everything else, so that an (AT)5 (i.e. ATATATATAT) will be detected but an (AT)4 will not. You can set these as low as 3, but be prepared for a long wait and a lot of output. The default values are recommended for most files, but if you find nothing of use you may wish to lower the thresholds a little to see if any smaller microsatellites are present. The interface states that there is a minimum of three repeat units, but there are some combinations of thresholds that may cause msatfinder to fail. One example is if you have a threshold for monos that is lower than the number of boxes ticked under “choose the microsatellite motifs to search for”. The reason for this is that the software may have trouble discriminating between some types of microsatellites, for example: ccccctccccctccccctccccct ...could be (ccccct)4 but could also be 4x (c)5, as the computer sees it. To prevent this from happening, msatfinder will not run in cases where such confusion should occur. If you find that it fails when you are looking for very small microsatellites, try re-running but turning off the larger microsatellites, e.g. look for monos and dis only. As well as setting the thresholds manually, they can also be calculated using Markov chain analysis or the Poly program. Engines for finding perfect repeatsEngines are available for finding perfect and imperfect repeats, as well as low-complexity regions. Three engines, which were all created by the msatfinder team, fall into the first class:
Advanced optionsThe default settings are probably best, but you may like to experiment with these. These options fall into four groups. The first option is the search engine to be used, if the user is searching for imperfect repeats or low-complexity regions. These engines are all based on standard repeat-finding software. They are:
Download optionsAfter running your analysis, you will be able to view the results on-line and/or download them as a compressed archive in tar.gz (unix) or zip (windows) formats. Select your preferred format before running the analysis. Find interrupted msatsIf this box is checked (it's unchecked by default) then msatfinder will process the microsatellites found to determine if any could be joined into larger microsatellites, according to certain rules. Typically, about 10% of microsatellites found could be so joined. Randomise sequenceIf this box is checked (it's unchecked by default) then each input sequence will be randomised. The algorithm is for each individual letter in the sequence in turn to be moved to a random position. All further analyses are carried out on this randomised sequence. The randomised sequence is stored in Fasta/sequenceID.random.fasta. Where multiple randomisations are carried out, they are done serially, not in parallel. Thresholds by Markov chain analysisThe repeat thresholds mark the minimum number of motif repeats required before a microsatellite is reported and are used to eliminate repeats which might be observed by chance. Any microsatellite found to have fewer repeats than this threshold is discarded. The threshold values can be provided manually or can be calculated from the sequence using Markov chain analysis. By default Msatfinder employs a fifth-order Markov chain which uses the observed frequencies of codon pairs present in the sequence and provides an accurate measure of expected thresholds, particularly for long (>1Mb) sequences. For shorter sequences, where we expect to observe fewer instances of each permutation of paired codons, a lower-order Markov chain should be employed to retain accuracy. In a Markov chain analysis, a file containing all non-standard letters in the sequence is generated and stored as Counts/sequenceID.nonstandard.count. NB. Markov chain analysis can only be carried out on nucleotide sequences. There are three options available to the user:
Thresholds by PolyThe third option for selecting microsatellite thresholds, along with manual choice and Markov chains, it to use the well-known program Poly, written by Jeff Bizzaro. The original publication is at http://www.biomedcentral.com/1471-2105/4/22. To speed up the calculations the algorithm has been transported from Python to C. The user has two options:
Upload fileChoose a file from your local system to run msatfinder on. Files under 7Mb will be analysed and the results returned immediately to the user. Larger files require that the user submit their email address. Once the analysis is complete, a link to the results will be emailed to the user. Results are stored for 36 hours on the server before being deleted. If you have very large sequences, or many of them, we recommend a local installation and are happy to offer assistance, or run your sequence for you. See file formats for allowable file types. Paste sequenceSee file formats for input file formats. If you paste anything in here, it will be used instead of any uploaded file data, so make sure that this box is cleared if you'd like to upload a file. Once you've submitted your sequence, click "search" and wait for a few moments (an animated picture will be shown whilst msatfinder runs). A brief summary will be displayed and a link to the downloadable file and the viewable results will be shown. These links will become inactive after a couple of hours, so please download them immediately if you'd like to keep the results. If your submitted file is larger than 7Mb, you must provide your email address. A link to the results of the analysis will then be sent to you. No record of the email addresses are kept past the point that the email is sent out. OutputThe output will be viewable on-line for 36 hours, and can be downloaded as a tar or zip file within that time, so we recommend that you bookmark the link to your output. The various files that are included in the download directory are described below. The most important is "results.html" that includes links to all the other output files so that you may view them in a browser. Your input sequence and a configuration file will also be saved for future reference, in case of any problem with the results. The output files produced by the online version of msatfinder are identical to those produced by the local version. A detailed description of these is available in the output file section. Please note that MINE files will not be produced by default - they must be enabled under “advanced options”. What if it didn't work?Occasionally, Msatfinder will fail to run on some input files. There are various reasons that this might happen, which are described below and also mentioned in the bugs section.
If you encounter a problem that does not seem to fit into these categories, please contact us and we will endeavour to fix the problem or analyse your data for you as soon as possible. To the top. Other informationBug reporting/known bugsThough we are using this software successfully, there's a small chance of bugs turning up somewhere. Should you find any, please contact us (details below) and we will squash them. If you're using msatfinder, there are a few things that you ought to bear in mind.
We welcome suggestions for new features or other improvements. File conversion/handling methodsFor converting files one could use Seqret (from EMBOSS), or use readseq. Coding styleThe braces in this code are written thus:
This is simply because it's easier to read this way. However, if you disagree, you may wish to try this. To the top. To the main msatfinder page. |