Sequence Alignment for Phylogenetic Analysis: Difference between revisions

Latest revision as of 13:16, 18 April 2019

Locate Sequences and Generate FASTA File

The easiest way to find sequences is to start with a seed sequence then do BLAST searches restricting to RefSeq and the species of interest.
To find a seed sequence start with NCBI Gene, then find the first Refseq mRNA (should start with NM) then click on that and find the protein (should start with NP)
Paste that into your FASTA file (see next section) and name accordingly.
Paste that sequence or its NP id into NCBI Protein Blast.
Set the parameters to:
- Database: Reference Proteins (refseq_protein)
- Organism: Start with mouse (Mus musculus) or human (Homo sapiens), depending on your goal consider adding zebrafish (Danio rerio), Drosophila melanogaster, chicken (Gallus gallus) and Caenorhabditis elegans

Generating a FASTA File

FASTA format is described here, and here you need each sequence to start with a >SEQUENCENAME followed by a return and then the sequence, in this case the protein sequence. An example of a FASTA file would be:

>SEQUENCE_1

MTEITAAMVKELRESTGAGMMDCKNALSETNGDFDKAVQLLREKGLGKAAKKADRLAAEG

LVSVKVSDDFTIAAMRPSYLSYEDLDMTFVENEYKALVAELEKENEERRRLKDPNKPEHK

IPQFASRKQLSDAILKEAEEKIKEELKAQGKPEKIWDNIIPGKMNSFIADNSQLDSKLTL

MGQFYVMDDKKTVEQVIAEKEKEFGGKIKIVEFICFEVGEGLEKKTEDFAAEVAAQL

>SEQUENCE_2

SATVSEINSETDFVAKNDQFIALTKDTTAHIQSNSLQSVEELHSSTINGVKFEEYLKSQI

ATIGENLVVRRFATLKAGANGVVNGYIHTNGRVGVVIAAACDSAEVASKSRDLLRQICMH

Save sequences in notepad, notepad++ or sublime (not Word) as a <FILENAME>.fasta file.
Sequence names cannot have spaces. Generally its better to name it as mm_Gdf15-NM_004864.4 where mm indicates mouse, Gdf15 is the gene name and NM indicates a RefSeq mRNA. If there are multiple mRNA's for the gene, name them

Create Multiple Sequence Alignment using CLUSTAL Omega

CLUSTAL Omega is available at https://www.ebi.ac.uk/Tools/msa/clustalo/
Select output format NEXUS to import into Mr Bayes or PHYLIP format to import into PhyoBayes
Generate phlogenetic trees with PhyloBayes or Mr. Bayes Using Mr Bayes to For Phlyogenetic Analysis.

PhyloBayes Analysis

Mark in your notes the software version used.
The PhyloBayes manual can be found here.

@@ Line 1: / Line 1: @@
 == Locate Sequences and Generate FASTA File ==
+* The easiest way to find sequences is to start with a seed sequence then do BLAST searches restricting to RefSeq and the species of interest.
+* To find a seed sequence start with NCBI Gene, then find the first Refseq mRNA (should start with NM) then click on that and find the protein (should start with NP)
+* Paste that into your FASTA file (see next section) and name accordingly.
+* Paste that sequence or its NP id into [https://blast.ncbi.nlm.nih.gov/Blast.cgi?PROGRAM=blastp&PAGE_TYPE=BlastSearch&LINK_LOC=blasthome NCBI Protein Blast].
+* Set the parameters to:
+** Database: Reference Proteins (refseq_protein)
+** Organism: Start with mouse (''Mus musculus'') or human (''Homo sapiens''), depending on your goal consider adding zebrafish (''Danio rerio''), ''Drosophila melanogaster'', chicken (''Gallus gallus'') and ''Caenorhabditis elegans''
+=== Generating a FASTA File===
+* FASTA format is described [https://zhanglab.ccmb.med.umich.edu/FASTA/ here], and [https://en.wikipedia.org/wiki/FASTA_format here] you need each sequence to start with a >SEQUENCENAME followed by a return and then the sequence, in this case the protein sequence.  An example of a FASTA file would be:
+<code>
+>SEQUENCE_1
+MTEITAAMVKELRESTGAGMMDCKNALSETNGDFDKAVQLLREKGLGKAAKKADRLAAEG
+LVSVKVSDDFTIAAMRPSYLSYEDLDMTFVENEYKALVAELEKENEERRRLKDPNKPEHK
+IPQFASRKQLSDAILKEAEEKIKEELKAQGKPEKIWDNIIPGKMNSFIADNSQLDSKLTL
+MGQFYVMDDKKTVEQVIAEKEKEFGGKIKIVEFICFEVGEGLEKKTEDFAAEVAAQL
+>SEQUENCE_2
+SATVSEINSETDFVAKNDQFIALTKDTTAHIQSNSLQSVEELHSSTINGVKFEEYLKSQI
+ATIGENLVVRRFATLKAGANGVVNGYIHTNGRVGVVIAAACDSAEVASKSRDLLRQICMH
+</code>
+* Save sequences in notepad, [https://notepad-plus-plus.org/ notepad++] or [https://www.sublimetext.com/ sublime] (not Word) as a <FILENAME>.fasta file.
+* Sequence names cannot have spaces.  Generally its better to name it as '''mm_Gdf15-NM_004864.4''' where mm indicates mouse, Gdf15 is the gene name and NM indicates a [https://www.ncbi.nlm.nih.gov/refseq/ RefSeq mRNA].  If there are multiple mRNA's for the gene, name them
 == Create Multiple Sequence Alignment using CLUSTAL Omega ==
 * CLUSTAL Omega is available at https://www.ebi.ac.uk/Tools/msa/clustalo/
-* Select output format NEXUS to import into Mr Bayes
+* Select output format NEXUS to import into Mr Bayes or PHYLIP format to import into PhyoBayes
-* Generate phlogenetic trees with Mr. Bayes [[Using Mr Bayes to For Phlyogenetic Analysis]]
+* Generate phlogenetic trees with [http://megasun.bch.umontreal.ca/People/lartillot/www/download.html PhyloBayes] or  Mr. Bayes [[Using Mr Bayes to For Phlyogenetic Analysis]].
+=== PhyloBayes Analysis ===
+* Mark in your notes the software version used.
+* The PhyloBayes manual can be found [http://megasun.bch.umontreal.ca/People/lartillot/www/phylobayes4.1.pdf here].