aaGapsToDNA.pl [-t geneticCodeTbl] alignedAASeq dnaSeq
When you are doing multiple sequence alignment, you may want to adjust the alignment with the amino acid sequences, and insert the corresponding gaps to DNA sequences. This script takes two filenames of FASTA files as the input: the first file (alignedAAseq) is aligned amino acid sequences (gaps are indicated by '-' in this file), and the second file is the corresponding DNA sequences WITHOUT any gaps. The resulting aligned DNA sequences are printed out to STDOUT, so you can capture the output by '>':
aaGapsToDNA.pl alignedAASeq.fasta dnaSeq.fasta > alignedDNA.fasta
If you need another translation table, create a file (say mtCodeTbl.txt), and give the name of this file as the argument to option -t:
aaGapsToDNA.pl -t mtCodeTbl.txt alignedAASeq.fasta dnaSeq.fasta > alignedDNA.fasta
The format of translation table file follows NCBI (e.g. this). For example, you can put the following to a file for the Invertebrate Mitochondrial Code
AAs = FFLLSSSSYY**CCWWLLLLPPPPHHQQRRRRIIMMTTTTNNKKSSSSVVVVAAAADDEEGGGG Starts = ---M----------------------------MMMM---------------M------------ Base1 = TTTTTTTTTTTTTTTTCCCCCCCCCCCCCCCCAAAAAAAAAAAAAAAAGGGGGGGGGGGGGGGG Base2 = TTTTCCCCAAAAGGGGTTTTCCCCAAAAGGGGTTTTCCCCAAAAGGGGTTTTCCCCAAAAGGGG Base3 = TCAGTCAGTCAGTCAGTCAGTCAGTCAGTCAGTCAGTCAGTCAGTCAGTCAGTCAGTCAGTCAG
fasta2shortName.pl input.fasta
fastaConcat.pl seq1.fasta seq2.fasta seq3.fasta ...This script takes several fasta files as the input, and make concatenated sequences for each sample. Each file should contain the same number of samples in the same order. You could specify at least two filenames.
seq1.fasta:
>seq1 AAAAAA >seq2 TTT >seq3 GGGGseq2.fasta:
>seq1 TT >seq2 GG >seq3 CCThen
fastaConcat.pl seq1.fasta seq2.fasta > out.fastacreates out.fasta:
>seq1 AAAAAATT >seq2 TTTGG >seq3 GGGGCCRequires Bioperl
fastaMissingChar.pl [-m char] [fastaFileName1 ...]Print out the name of sequences with characters other than ATGC-. If -m is specified, the ambiguous characters are repleced with the specified character. e.g. -m '?' will place ? to the ambigous characters. If multiple files are given, sequences in all files are marged. If no argument is given, it will take STDIN as the input.
Example:
in.fasta:
>seq1 ATGCNGGXX >seq2 AAAA--AAThen following command
fastaMissingChar.pl in.fastaprints out the name of sequences with nonstandard characters, and the nonstandard characters:
seq1 NXX
fastaMissingChar.pl -m '?' in.fasta > clean.fastacreates following clean.fasta:
>seq1 ATGC?GG?? >seq2 AAAA--AARequires Bioperl.
fastaSortByName.pl [-r] [-g] [fastaFileName1 ...]Sort FASTA sequences alphabetically by names. If multiple files are given, sequences in all files are marged before sorting. If no argument is given, it will take STDIN as the input.
-r option will sort in reverse order.
-g option will remove all gap characters ('-') from the sequence data.
Requires Bioperl
selectSeqs.pl [-hv] -f seqNamesFile [fastaFile]or
selectSeqs.pl [-hv] -m 'pattern' [fastaFile]With the first method with -f, the file (seqNamesFile) contains the list of sequences names which you want to select. Each line contains a sequence name. Comments can be added with "#" to this file.
With the second method with -m, you gives a regular expression, and you can select the sequences whose names matches with the pattern.
If you want to exclude the selected sequences, you can add -v option (either with -s or -m operations, but -vs is not tested extensively.
If name of input file (fastaFile) is not given, STDIN is used.
Example:
input.fasta:
>cat AAA >tiger TTT >lion GGG >panther CCCSelect the sequence, whose name starts from 't' or 'l'.
selectSeqs.pl -m '^[tl]' input.fasta > selected.fastaThen, selected.fasta contains:
>tiger TTT >lion GGG
seqOrient.pl [-r refSeqNumber] inputfile.fasta-r integer: Specified sequence is used as the reference This program read in the sequence file, which may contain sequences with opposite orientation (reverse complement), and output a fasta where all sequences are in the same orientations. By default, it will use the 1st sequence as the reference. It makes the complement of the sequences with revseq of EMBOSS and see if the complement aligns better with the reference seq. If so, the complement will be used. It will print out the fasta file with the corrected orientation to the STDOUT. For the pairwise alignment, matcher of EMBOSS is used. The scores of the alignments are printed to STDERR. Example:
inputfile.fasta:
> seq1 ATGCGAAGTCTTGTG >seq2 CACTAGACTCAT > seq3 ATGCTAGTG >seq4 CTCAAGACTTCGCATUsing the orientation of the second sequence (seq2) as the reference, it will make sure that all sequences are in the same orientation:
seqOrient.pl -r 3 inputfile.fasta > oriented.fastaThen the output file (oriented.fasta) contains:
>seq3 ATGCTAGTG >seq1 ATGCGAAGTCTTGTG >seq2 ATGAGTCTAGTG >seq4 ATGCGAAGTCTTGAGAdditionally, it will print the following message on the screen (STDERR):
seq3 - seq1: score reg=21, comp=11 seq3 - seq2: score reg=20, comp=30, complement of seq2 is used seq3 - seq4: score reg=11, comp=21, complement of seq4 is usedThis tells that alignment score between seq3 (reference) and uncomplemented seq1 ("reg"ular) is 21, and it is better than the score ("comp"lemented=11) of alignment between seq3 and reverse-complemented seq1. So there is no need to reverse-complement seq1. However, for seq2 and seq3, reverse-complements have higher scores, so the reverse complements are used.
Requires: Bioperl, EMBOSS
This script takes an aligned DNA or amino acid seq file in fasta format (as an argument or STDIN), and calculate the number of singleton observed in the data. When DNA seq is given, it assumes the first nucleotide of each sequence in the file corresponds to the 1st position of a codon. Then number of singletons for each of the three codon positions are calculated. Obviously, the three positions are meaningles when amino acid sequences are given (just use the total number for AA).
uniqHaplo.pl [-a] input.fastaThis program read in the sequence file (input.fasta) and extract unique haplotypes. By default, program assumes that it is a DNA sequence file, but if you use option -a, the input file can be amino acid. With DNA as the input, the fasta file may contain sequences with opposite orientation (reverse complement).
It identifies identical alleles by going through all pairwise comparisons. When the shorter sequence of the two is identical to the substring of the longer, they are considered as a same allele. Gaps '-' will be removed before the comparison. The longest sequences of each allele will be printed to STDOUT. These output format of the unique alleles is in FASTA format. Information about which alleles are identical and the difference in the lengths are printed in STDERR. When the sequences with the opposite direction are included (in case of DNA sequences), it makes the complement of the sequences with revseq of EMBOSS, and the comparison is made. Requires Bioperl and EMBOSS.
fasta2mega.pl fasta2nexus.pl fasta2paml.pl fasta2phylip.pl fasta2sites.pl phylip2fasta.pl phylip2paml.plExample, convert fasta to paml format
fasta2paml.pl seq.fasta > seq.paml