Difference between revisions of "Sequence Alignment for Phylogenetic Analysis"

From Bridges Lab Protocols
Jump to: navigation, search
(Wrote initial page)
 
(Added details about BLAST search)
 
(2 intermediate revisions by the same user not shown)
Line 1: Line 1:
 
== Locate Sequences and Generate FASTA File ==
 
== Locate Sequences and Generate FASTA File ==
 +
* The easiest way to find sequences is to start with a seed sequence then do BLAST searches restricting to RefSeq and the species of interest.
 +
* To find a seed sequence start with NCBI Gene, then find the first Refseq mRNA (should start with NM) then click on that and find the protein (should start with NP)
 +
* Paste that into your FASTA file (see next section) and name accordingly.
 +
* Paste that sequence or its NP id into [https://blast.ncbi.nlm.nih.gov/Blast.cgi?PROGRAM=blastp&PAGE_TYPE=BlastSearch&LINK_LOC=blasthome NCBI Protein Blast].
 +
* Set the parameters to:
 +
** Database: Reference Proteins (refseq_protein)
 +
** Organism: Start with mouse (''Mus musculus'') or human (''Homo sapiens''), depending on your goal consider adding zebrafish (''Danio rerio''), ''Drosophila melanogaster'', chicken (''Gallus gallus'') and ''Caenorhabditis elegans''
  
 +
=== Generating a FASTA File===
 +
* FASTA format is described [https://zhanglab.ccmb.med.umich.edu/FASTA/ here], and [https://en.wikipedia.org/wiki/FASTA_format here] you need each sequence to start with a >SEQUENCENAME followed by a return and then the sequence, in this case the protein sequence.  An example of a FASTA file would be:
 +
 +
<code>
 +
>SEQUENCE_1
 +
 +
MTEITAAMVKELRESTGAGMMDCKNALSETNGDFDKAVQLLREKGLGKAAKKADRLAAEG
 +
 +
LVSVKVSDDFTIAAMRPSYLSYEDLDMTFVENEYKALVAELEKENEERRRLKDPNKPEHK
 +
 +
IPQFASRKQLSDAILKEAEEKIKEELKAQGKPEKIWDNIIPGKMNSFIADNSQLDSKLTL
 +
 +
MGQFYVMDDKKTVEQVIAEKEKEFGGKIKIVEFICFEVGEGLEKKTEDFAAEVAAQL
 +
 +
>SEQUENCE_2
 +
 +
SATVSEINSETDFVAKNDQFIALTKDTTAHIQSNSLQSVEELHSSTINGVKFEEYLKSQI
 +
 +
ATIGENLVVRRFATLKAGANGVVNGYIHTNGRVGVVIAAACDSAEVASKSRDLLRQICMH
 +
</code>
 +
 +
* Save sequences in notepad, [https://notepad-plus-plus.org/ notepad++] or [https://www.sublimetext.com/ sublime] (not Word) as a <FILENAME>.fasta file.
 +
* Sequence names cannot have spaces.  Generally its better to name it as '''mm_Gdf15-NM_004864.4''' where mm indicates mouse, Gdf15 is the gene name and NM indicates a [https://www.ncbi.nlm.nih.gov/refseq/ RefSeq mRNA].  If there are multiple mRNA's for the gene, name them
  
 
== Create Multiple Sequence Alignment using CLUSTAL Omega ==
 
== Create Multiple Sequence Alignment using CLUSTAL Omega ==
  
 
* CLUSTAL Omega is available at https://www.ebi.ac.uk/Tools/msa/clustalo/
 
* CLUSTAL Omega is available at https://www.ebi.ac.uk/Tools/msa/clustalo/
* Select output format NEXUS to import into Mr Bayes
+
* Select output format NEXUS to import into Mr Bayes or PHYLIP format to import into PhyoBayes
* Generate phlogenetic trees with Mr. Bayes [[Using Mr Bayes to For Phlyogenetic Analysis]]
+
* Generate phlogenetic trees with [http://megasun.bch.umontreal.ca/People/lartillot/www/download.html PhyloBayes] or  Mr. Bayes [[Using Mr Bayes to For Phlyogenetic Analysis]]
 +
 
 +
=== PhyloBayes Analysis ===
 +
 
 +
* Mark in your notes the software version used.
 +
* The PhyloBayes manual can be found [http://megasun.bch.umontreal.ca/People/lartillot/www/phylobayes4.1.pdf here].

Latest revision as of 13:16, 18 April 2019

Locate Sequences and Generate FASTA File

  • The easiest way to find sequences is to start with a seed sequence then do BLAST searches restricting to RefSeq and the species of interest.
  • To find a seed sequence start with NCBI Gene, then find the first Refseq mRNA (should start with NM) then click on that and find the protein (should start with NP)
  • Paste that into your FASTA file (see next section) and name accordingly.
  • Paste that sequence or its NP id into NCBI Protein Blast.
  • Set the parameters to:
    • Database: Reference Proteins (refseq_protein)
    • Organism: Start with mouse (Mus musculus) or human (Homo sapiens), depending on your goal consider adding zebrafish (Danio rerio), Drosophila melanogaster, chicken (Gallus gallus) and Caenorhabditis elegans

Generating a FASTA File

  • FASTA format is described here, and here you need each sequence to start with a >SEQUENCENAME followed by a return and then the sequence, in this case the protein sequence. An example of a FASTA file would be:

>SEQUENCE_1

MTEITAAMVKELRESTGAGMMDCKNALSETNGDFDKAVQLLREKGLGKAAKKADRLAAEG

LVSVKVSDDFTIAAMRPSYLSYEDLDMTFVENEYKALVAELEKENEERRRLKDPNKPEHK

IPQFASRKQLSDAILKEAEEKIKEELKAQGKPEKIWDNIIPGKMNSFIADNSQLDSKLTL

MGQFYVMDDKKTVEQVIAEKEKEFGGKIKIVEFICFEVGEGLEKKTEDFAAEVAAQL

>SEQUENCE_2

SATVSEINSETDFVAKNDQFIALTKDTTAHIQSNSLQSVEELHSSTINGVKFEEYLKSQI

ATIGENLVVRRFATLKAGANGVVNGYIHTNGRVGVVIAAACDSAEVASKSRDLLRQICMH

  • Save sequences in notepad, notepad++ or sublime (not Word) as a <FILENAME>.fasta file.
  • Sequence names cannot have spaces. Generally its better to name it as mm_Gdf15-NM_004864.4 where mm indicates mouse, Gdf15 is the gene name and NM indicates a RefSeq mRNA. If there are multiple mRNA's for the gene, name them

Create Multiple Sequence Alignment using CLUSTAL Omega

PhyloBayes Analysis

  • Mark in your notes the software version used.
  • The PhyloBayes manual can be found here.