Difference between revisions of "RNA-seq Pipeline for Known Transcripts"

From Bridges Lab Protocols
Jump to: navigation, search
(Pasted in Instructions from Richard McEachin)
 
(added section for aligning reads with bowtie)
 
(7 intermediate revisions by the same user not shown)
Line 1: Line 1:
RNA-Seq pipeline – known transcripts
+
==Reference Pages==
1. Run FASTQ to assess quality of reads from sequencer and:
+
* http://seqanswers.com/wiki/How-to/RNASeq_analysis
a. Filter for quality, if applicable
+
b. Trim, if applicable
+
c. http://www.bioinformatics.bbsrc.ac.uk/projects/fastqc/
+
  
2. Run bowtie-build to generate Burroughs Wheeler transformed reference genome (.ebwt format). 
+
==Sequence Quality and Trimming==
a. http://bowtie-bio.sourceforge.net/index.shtml (bowtie, tophat, and cufflinks are here).
+
# Run FASTQC to assess quality of reads from sequencer and:
b. [Optional input and parameter settings are in square brackets.]  
+
# FASTQC available at http://www.bioinformatics.bbsrc.ac.uk/projects/fastqc/
c. <Required parameters are in greater than/less than brackets.
+
## Open run_fastqc on a windows machine. Individually open each sequence file and allow it to analyseSave this report.
d. This BW transformed reference genome can be created once then used repeatedly in the future.  
+
## Check this report to decide if sequences need to be trimmed or discarded. See http://bioinfo.cipf.es/courses/mda11/lib/exe/fetch.php?media=ngs_qc_tutorial_mda_val_2011.pdf for sample results and an explanation of the quality report
e. $ is the command prompt. 
+
  
$ bowtie-build [-f specifies reference genome is in fasta format] <path to input reference genome (e.g. /ccmb/CoreBA/BioinfCore/Common/DATA/BowtieData/H_Sapiens/hg19.fa)> <base name for reference genome output .ebwt files (e.g hg19)>
+
===Filter Sequences Using FastX-Toolkit===
 +
# If samples are barcoded use Fastx barcode splitter (see http://hannonlab.cshl.edu/fastx_toolkit/commandline.html#fastx_barcode_splitter_usage for more details):
 +
<pre>
 +
/usr/local/bin/fastx_barcode_splitter.pl --bcfile FILE --prefix PREFIX [--suffix SUFFIX] [--bol|--eol] [--mismatches N] [--exact] [--partial N] [--help] [--quiet] [--debug]
 +
</pre>
 +
* This requires a barcode file in the format where BC# is the barcode number and the nucleotide names are the barcodes:
 +
<pre>
 +
#This line is a comment (starts with a 'number' sign)
 +
BC1 GATCT
 +
BC2 ATCGT
 +
BC3 GTGAT
 +
BC4 TGTCT
 +
</pre>
 +
* This file is the FILE for the --bcfile option
 +
* Example command where s_2_100.txt is the original file, mybarcodes.txt is the barcode file, 2 mismatches are allowed (default is 1).  This will generate files /tmp/bla_BC#.txt:
 +
<pre>
 +
cat s_2_100.txt | /usr/local/bin/fastx_barcode_splitter.pl --bcfile mybarcodes.txt --bol --mismatches 2 --prefix /tmp/bla_ --suffix ".txt"
 +
</pre>
 +
# Filter for quality, if applicable
 +
# Trim, if applicable using Fast-x. The following keeps 100% of the reads with a quality of 25 or greater:
 +
<pre>
 +
fastq_quality_filter -v -q 25 -p 100  -i  control-reads.txt -o control-reads-quality25.txt
 +
</pre>
  
3. Run tophat to align reads to the reference genome. I’ve included a pseudo command line as well as a “real” command line.
+
==Align Reads==
 +
This can be done with either TopHat or Bowtie, so choose one of the following. The reference genomes are located in the following locations:
 +
<pre>
 +
/database/davebrid/RNAseq/reference-genomes/hg19
 +
/database/davebrid/RNAseq/reference-genomes/mm9
 +
</pre>
 +
These reference alignments are pre-built UCSC genomes and downloaded from ftp://ftp.cbcb.umd.edu/pub/data/bowtie_indexes/
 +
 
 +
===Align Reads to Reference Genome with Bowtie===
 +
Run bowtie to align reads to reference genomes.  The following generates a sam formatted alignment using the best quality flag for reads aligned to hg19
 +
<pre>
 +
bowtie --sam --best /database/davebrid/RNAseq/reference-genomes/hg19/hg19 control-reads-quality25.txt control-aligned-quality25.sam
 +
</pre>
 +
 
 +
===Align Reads to Reference Genome with Tophat===
 +
Run tophat to align reads to the reference genome. I’ve included a pseudo command line as well as a “real” command line.
 +
<pre>
 
$ tophat [-p #processors -o ./output_directory] <./reference genome in both .ebwt and fasta formats (e.g. /ccmb/CoreBA/BioinfCore/Common/DATA/BowtieData/H_Sapiens/hg19)> <reads file to be aligned (e.g. s_1_1_sequence.fastq)>
 
$ tophat [-p #processors -o ./output_directory] <./reference genome in both .ebwt and fasta formats (e.g. /ccmb/CoreBA/BioinfCore/Common/DATA/BowtieData/H_Sapiens/hg19)> <reads file to be aligned (e.g. s_1_1_sequence.fastq)>
 
$ tophat -p 5 -o ./HG19/tophat_out_hg19_001_trimmed /ccmb/CoreBA/BioinfCore/Common/DATA/BowtieData/H_Sapiens/hg19 ./HG19/Rich_trim/A_1_16_85.fastq
 
$ tophat -p 5 -o ./HG19/tophat_out_hg19_001_trimmed /ccmb/CoreBA/BioinfCore/Common/DATA/BowtieData/H_Sapiens/hg19 ./HG19/Rich_trim/A_1_16_85.fastq
 
+
</pre>
4. Run cuffcompare to create .gtf format reference genome from a generic reference genome. Note that cuffcompare adds the tss_id and p_id columns that you will need in cuffdiff. This .gtf reference can be created once then used repeatedly in the future.
+
==Use Cuffcompare to Generate .gtf Reference==
 
+
Run cuffcompare to create .gtf format reference genome from a generic reference genome. Note that cuffcompare adds the tss_id and p_id columns that you will need in cuffdiff. This .gtf reference can be created once then used repeatedly in the future.
 +
<pre>
 
$ cuffcompare [-o ./output_directory] < input file twice (e.g. /ccmb/CoreBA/BioinfCore/Common/DATA/CufflinksData_hg19/hg19.gtf /ccmb/CoreBA/BioinfCore/Common/DATA/CufflinksData_hg19/hg19.gtf )>
 
$ cuffcompare [-o ./output_directory] < input file twice (e.g. /ccmb/CoreBA/BioinfCore/Common/DATA/CufflinksData_hg19/hg19.gtf /ccmb/CoreBA/BioinfCore/Common/DATA/CufflinksData_hg19/hg19.gtf )>
  
 
$ cuffcompare -o ./cuffcompare_out /ccmb/CoreBA/BioinfCore/Common/DATA/CufflinksData_hg19/hg19_genes.gtf /ccmb/CoreBA/BioinfCore/Common/DATA/CufflinksData_hg19/hg19_genes.gtf
 
$ cuffcompare -o ./cuffcompare_out /ccmb/CoreBA/BioinfCore/Common/DATA/CufflinksData_hg19/hg19_genes.gtf /ccmb/CoreBA/BioinfCore/Common/DATA/CufflinksData_hg19/hg19_genes.gtf
 +
</pre>
  
 
+
==Use Cuffdiff to Identify Differentially Expressed Transcripts==
5. Run cuffdiff to identify differentially abundant transcripts.
+
Run cuffdiff to identify differentially abundant transcripts.
 
+
<pre>
 
$ cuffdiff  [-p #processors -o ./output_directory –L label1,label2,etc. –T (for time series data) –N (use upper quantile normalization –compatible_hits_norm (use reference hits in normalization) –b (use reference transcripts to reduce bias, include path to file e.g. /ccmb/CoreBA/BioinfCore/Common/DATA/BowtieData/H_Sapiens/hg19.fa) –u (improve multi-read weighting) ] <transcripts.gtf (produced by cuffcompare) sample_A_accepted_hits1.bam, sample_A_accepted_hits2.bam,etc (all produced by tophat) sample_B_accepted_hits1.bam,sample_B_accepted_hits2.bam, etc>  
 
$ cuffdiff  [-p #processors -o ./output_directory –L label1,label2,etc. –T (for time series data) –N (use upper quantile normalization –compatible_hits_norm (use reference hits in normalization) –b (use reference transcripts to reduce bias, include path to file e.g. /ccmb/CoreBA/BioinfCore/Common/DATA/BowtieData/H_Sapiens/hg19.fa) –u (improve multi-read weighting) ] <transcripts.gtf (produced by cuffcompare) sample_A_accepted_hits1.bam, sample_A_accepted_hits2.bam,etc (all produced by tophat) sample_B_accepted_hits1.bam,sample_B_accepted_hits2.bam, etc>  
  
 
$ cuffdiff -o ./HG19/Cuffdiff_out_options_b_u_N_compatible/ -p 14 -L Control,PUF_kd --no-update-check -b /ccmb/CoreBA/BioinfCore/Common/DATA/BowtieData/H_Sapiens/hg19.fa -u -N --compatible-hits-norm /ccmb/CoreBA/BioinfCore/Projects/Goldstrohm_McEachin/HG19/cuffcompare_out.combined.gtf /ccmb/CoreBA/BioinfCore/Projects/Goldstrohm_McEachin/HG19/tophat_out_hg19_001_trimmed/accepted_hits.bam /ccmb/CoreBA/BioinfCore/Projects/Goldstrohm_McEachin/HG19/tophat_out_hg19_002_trimmed/accepted_hits.bam
 
$ cuffdiff -o ./HG19/Cuffdiff_out_options_b_u_N_compatible/ -p 14 -L Control,PUF_kd --no-update-check -b /ccmb/CoreBA/BioinfCore/Common/DATA/BowtieData/H_Sapiens/hg19.fa -u -N --compatible-hits-norm /ccmb/CoreBA/BioinfCore/Projects/Goldstrohm_McEachin/HG19/cuffcompare_out.combined.gtf /ccmb/CoreBA/BioinfCore/Projects/Goldstrohm_McEachin/HG19/tophat_out_hg19_001_trimmed/accepted_hits.bam /ccmb/CoreBA/BioinfCore/Projects/Goldstrohm_McEachin/HG19/tophat_out_hg19_002_trimmed/accepted_hits.bam
 +
</pre>
 +
[[Category: Bioinformatics]]
 +
[[Category: Transcription]]
 +
[[Category: RNA-seq]]
 +
[[Category: Next Generation Sequencing]]

Latest revision as of 16:08, 8 November 2011

Reference Pages

Sequence Quality and Trimming

  1. Run FASTQC to assess quality of reads from sequencer and:
  2. FASTQC available at http://www.bioinformatics.bbsrc.ac.uk/projects/fastqc/
    1. Open run_fastqc on a windows machine. Individually open each sequence file and allow it to analyse. Save this report.
    2. Check this report to decide if sequences need to be trimmed or discarded. See http://bioinfo.cipf.es/courses/mda11/lib/exe/fetch.php?media=ngs_qc_tutorial_mda_val_2011.pdf for sample results and an explanation of the quality report

Filter Sequences Using FastX-Toolkit

  1. If samples are barcoded use Fastx barcode splitter (see http://hannonlab.cshl.edu/fastx_toolkit/commandline.html#fastx_barcode_splitter_usage for more details):
/usr/local/bin/fastx_barcode_splitter.pl --bcfile FILE --prefix PREFIX [--suffix SUFFIX] [--bol|--eol] [--mismatches N] [--exact] [--partial N] [--help] [--quiet] [--debug]
  • This requires a barcode file in the format where BC# is the barcode number and the nucleotide names are the barcodes:
#This line is a comment (starts with a 'number' sign)
BC1 GATCT
BC2 ATCGT
BC3 GTGAT
BC4 TGTCT
  • This file is the FILE for the --bcfile option
  • Example command where s_2_100.txt is the original file, mybarcodes.txt is the barcode file, 2 mismatches are allowed (default is 1). This will generate files /tmp/bla_BC#.txt:
cat s_2_100.txt | /usr/local/bin/fastx_barcode_splitter.pl --bcfile mybarcodes.txt --bol --mismatches 2 --prefix /tmp/bla_ --suffix ".txt"
  1. Filter for quality, if applicable
  2. Trim, if applicable using Fast-x. The following keeps 100% of the reads with a quality of 25 or greater:
fastq_quality_filter -v -q 25 -p 100  -i  control-reads.txt -o control-reads-quality25.txt

Align Reads

This can be done with either TopHat or Bowtie, so choose one of the following. The reference genomes are located in the following locations:

/database/davebrid/RNAseq/reference-genomes/hg19
/database/davebrid/RNAseq/reference-genomes/mm9

These reference alignments are pre-built UCSC genomes and downloaded from ftp://ftp.cbcb.umd.edu/pub/data/bowtie_indexes/

Align Reads to Reference Genome with Bowtie

Run bowtie to align reads to reference genomes. The following generates a sam formatted alignment using the best quality flag for reads aligned to hg19

bowtie --sam --best /database/davebrid/RNAseq/reference-genomes/hg19/hg19 control-reads-quality25.txt control-aligned-quality25.sam

Align Reads to Reference Genome with Tophat

Run tophat to align reads to the reference genome. I’ve included a pseudo command line as well as a “real” command line.

$ tophat [-p #processors -o ./output_directory] <./reference genome in both .ebwt and fasta formats (e.g. /ccmb/CoreBA/BioinfCore/Common/DATA/BowtieData/H_Sapiens/hg19)> <reads file to be aligned (e.g. s_1_1_sequence.fastq)>
$ tophat -p 5 -o ./HG19/tophat_out_hg19_001_trimmed /ccmb/CoreBA/BioinfCore/Common/DATA/BowtieData/H_Sapiens/hg19 ./HG19/Rich_trim/A_1_16_85.fastq

Use Cuffcompare to Generate .gtf Reference

Run cuffcompare to create .gtf format reference genome from a generic reference genome. Note that cuffcompare adds the tss_id and p_id columns that you will need in cuffdiff. This .gtf reference can be created once then used repeatedly in the future.

$ cuffcompare [-o ./output_directory] < input file twice (e.g. /ccmb/CoreBA/BioinfCore/Common/DATA/CufflinksData_hg19/hg19.gtf /ccmb/CoreBA/BioinfCore/Common/DATA/CufflinksData_hg19/hg19.gtf )>

$ cuffcompare -o ./cuffcompare_out /ccmb/CoreBA/BioinfCore/Common/DATA/CufflinksData_hg19/hg19_genes.gtf /ccmb/CoreBA/BioinfCore/Common/DATA/CufflinksData_hg19/hg19_genes.gtf

Use Cuffdiff to Identify Differentially Expressed Transcripts

Run cuffdiff to identify differentially abundant transcripts.

$ cuffdiff  [-p #processors -o ./output_directory –L label1,label2,etc. –T (for time series data) –N (use upper quantile normalization –compatible_hits_norm (use reference hits in normalization) –b (use reference transcripts to reduce bias, include path to file e.g. /ccmb/CoreBA/BioinfCore/Common/DATA/BowtieData/H_Sapiens/hg19.fa) –u (improve multi-read weighting) ] <transcripts.gtf (produced by cuffcompare) sample_A_accepted_hits1.bam, sample_A_accepted_hits2.bam,etc (all produced by tophat) sample_B_accepted_hits1.bam,sample_B_accepted_hits2.bam, etc> 

$ cuffdiff -o ./HG19/Cuffdiff_out_options_b_u_N_compatible/ -p 14 -L Control,PUF_kd --no-update-check -b /ccmb/CoreBA/BioinfCore/Common/DATA/BowtieData/H_Sapiens/hg19.fa -u -N --compatible-hits-norm /ccmb/CoreBA/BioinfCore/Projects/Goldstrohm_McEachin/HG19/cuffcompare_out.combined.gtf /ccmb/CoreBA/BioinfCore/Projects/Goldstrohm_McEachin/HG19/tophat_out_hg19_001_trimmed/accepted_hits.bam /ccmb/CoreBA/BioinfCore/Projects/Goldstrohm_McEachin/HG19/tophat_out_hg19_002_trimmed/accepted_hits.bam