Difference between revisions of "RNA-seq Pipeline for Known Transcripts"
From Bridges Lab Protocols
Davebridges (Talk | contribs) (added instructions for barcode splitting) |
Davebridges (Talk | contribs) m (removed linebreak) |
||
Line 21: | Line 21: | ||
* Example command where s_2_100.txt is the original file, mybarcodes.txt is the barcode file, 2 mismatches are allowed (default is 1). This will generate files /tmp/bla_BC#.txt: | * Example command where s_2_100.txt is the original file, mybarcodes.txt is the barcode file, 2 mismatches are allowed (default is 1). This will generate files /tmp/bla_BC#.txt: | ||
<pre> | <pre> | ||
− | cat s_2_100.txt | /usr/local/bin/fastx_barcode_splitter.pl --bcfile mybarcodes.txt --bol --mismatches 2 | + | cat s_2_100.txt | /usr/local/bin/fastx_barcode_splitter.pl --bcfile mybarcodes.txt --bol --mismatches 2 --prefix /tmp/bla_ --suffix ".txt" |
− | + | ||
</pre> | </pre> | ||
# Filter for quality, if applicable | # Filter for quality, if applicable |
Revision as of 13:31, 21 October 2011
Contents
Sequence Quality and Trimming
- Run FASTQC to assess quality of reads from sequencer and:
- FASTQC available at http://www.bioinformatics.bbsrc.ac.uk/projects/fastqc/
- Open run_fastqc on a windows machine. Individually open each sequence file and allow it to analyse. Save this report.
- Check this report to decide if sequences need to be trimmed or discarded. See http://bioinfo.cipf.es/courses/mda11/lib/exe/fetch.php?media=ngs_qc_tutorial_mda_val_2011.pdf for sample results and an explanation of the quality report
Filter Sequences Using FastX-Toolkit
- If samples are barcoded use Fastx barcode splitter (see http://hannonlab.cshl.edu/fastx_toolkit/commandline.html#fastx_barcode_splitter_usage for more details):
/usr/local/bin/fastx_barcode_splitter.pl --bcfile FILE --prefix PREFIX [--suffix SUFFIX] [--bol|--eol] [--mismatches N] [--exact] [--partial N] [--help] [--quiet] [--debug]
- This requires a barcode file in the format where BC# is the barcode number and the nucleotide names are the barcodes:
#This line is a comment (starts with a 'number' sign) BC1 GATCT BC2 ATCGT BC3 GTGAT BC4 TGTCT
- This file is the FILE for the --bcfile option
- Example command where s_2_100.txt is the original file, mybarcodes.txt is the barcode file, 2 mismatches are allowed (default is 1). This will generate files /tmp/bla_BC#.txt:
cat s_2_100.txt | /usr/local/bin/fastx_barcode_splitter.pl --bcfile mybarcodes.txt --bol --mismatches 2 --prefix /tmp/bla_ --suffix ".txt"
- Filter for quality, if applicable
- Trim, if applicable
Generate a Reference Genome
- Run bowtie-build to generate Burroughs Wheeler transformed reference genome (.ebwt format).
- http://bowtie-bio.sourceforge.net/index.shtml (bowtie, tophat, and cufflinks are here).
- [Optional input and parameter settings are in square brackets.]
- <Required parameters are in greater than/less than brackets.>
- This BW transformed reference genome can be created once then used repeatedly in the future.
- $ is the command prompt.
$ bowtie-build [-f specifies reference genome is in fasta format] <path to input reference genome (e.g. /ccmb/CoreBA/BioinfCore/Common/DATA/BowtieData/H_Sapiens/hg19.fa)> <base name for reference genome output .ebwt files (e.g hg19)>
Align Reads to Reference Genome with Tophat
Run tophat to align reads to the reference genome. I’ve included a pseudo command line as well as a “real” command line.
$ tophat [-p #processors -o ./output_directory] <./reference genome in both .ebwt and fasta formats (e.g. /ccmb/CoreBA/BioinfCore/Common/DATA/BowtieData/H_Sapiens/hg19)> <reads file to be aligned (e.g. s_1_1_sequence.fastq)> $ tophat -p 5 -o ./HG19/tophat_out_hg19_001_trimmed /ccmb/CoreBA/BioinfCore/Common/DATA/BowtieData/H_Sapiens/hg19 ./HG19/Rich_trim/A_1_16_85.fastq
Use Cuffcompare to Generate .gtf Reference
Run cuffcompare to create .gtf format reference genome from a generic reference genome. Note that cuffcompare adds the tss_id and p_id columns that you will need in cuffdiff. This .gtf reference can be created once then used repeatedly in the future.
$ cuffcompare [-o ./output_directory] < input file twice (e.g. /ccmb/CoreBA/BioinfCore/Common/DATA/CufflinksData_hg19/hg19.gtf /ccmb/CoreBA/BioinfCore/Common/DATA/CufflinksData_hg19/hg19.gtf )> $ cuffcompare -o ./cuffcompare_out /ccmb/CoreBA/BioinfCore/Common/DATA/CufflinksData_hg19/hg19_genes.gtf /ccmb/CoreBA/BioinfCore/Common/DATA/CufflinksData_hg19/hg19_genes.gtf
Use Cuffdiff to Identify Differentially Expressed Transcripts
Run cuffdiff to identify differentially abundant transcripts.
$ cuffdiff [-p #processors -o ./output_directory –L label1,label2,etc. –T (for time series data) –N (use upper quantile normalization –compatible_hits_norm (use reference hits in normalization) –b (use reference transcripts to reduce bias, include path to file e.g. /ccmb/CoreBA/BioinfCore/Common/DATA/BowtieData/H_Sapiens/hg19.fa) –u (improve multi-read weighting) ] <transcripts.gtf (produced by cuffcompare) sample_A_accepted_hits1.bam, sample_A_accepted_hits2.bam,etc (all produced by tophat) sample_B_accepted_hits1.bam,sample_B_accepted_hits2.bam, etc> $ cuffdiff -o ./HG19/Cuffdiff_out_options_b_u_N_compatible/ -p 14 -L Control,PUF_kd --no-update-check -b /ccmb/CoreBA/BioinfCore/Common/DATA/BowtieData/H_Sapiens/hg19.fa -u -N --compatible-hits-norm /ccmb/CoreBA/BioinfCore/Projects/Goldstrohm_McEachin/HG19/cuffcompare_out.combined.gtf /ccmb/CoreBA/BioinfCore/Projects/Goldstrohm_McEachin/HG19/tophat_out_hg19_001_trimmed/accepted_hits.bam /ccmb/CoreBA/BioinfCore/Projects/Goldstrohm_McEachin/HG19/tophat_out_hg19_002_trimmed/accepted_hits.bam