CSE5599-BMI7830: Laboratory Assignment 4

Release date 11/19/2015 – Due date 12/5/2015

Learning goals

Learn more about the use of Galaxy.
Learn the nuances of aligning RNA-seq raw data and measuring/changing quality.
Learn the workflows of processing RNA-seq data.
Learn about tools used in the workflows -- FastQC, FastQuality Trimmer, Cufflinks, and Tophat.
Learn about visualizing with IGV or Trackster.
Learn about comparing RNA-seq data.

The Actual Assignment

1. Download data

Download sequence read files for a paired-end sequencing dataset for adrenal and brain tissues. The reads are sampled to map specific region of chr19 positions 3-3.5million. Please visit https://osu.box.com/rna-seq-data and collect the files adrenal_1.fastq, adrenal_2.fastq, and brain_1.fastq, brain_2.fastq

2. Upload data to Galaxy

· Upload above sequence read files to Galaxy (https://usegalaxy.org)

· Make sure to set data type as “fastqsanger”.

3. Quality Control

· Run FastQC (under NGS: QC and manipulation folder) to perform quality control checks on raw sequence data for both datasets, adrenal and brain.

· How many sequence reads exist in each dataset?

· What is the sequence length?

· Present (as figure) and discuss per base sequence quality result.

4. Quality Trimming

· Based on the per base sequence quality results and assuming the bases with quality score of 20 and below are unusable, do you think quality trimming is needed?

· If necessary trim the reads based on your answer using FastQuality Trimmer (under NGS: QC and manipulation folder)

· Discuss the change after trimming is performed.

5. Alignment

· Align sequence reads (R1 and R2) to reference genome build hg19 using tophat (under NGS: RNA-Seqfolder) for both the adrenal and brain dataset.

· Download alignment results (BAM and BAI files) and visualize in IGV or Trackster.

· Browse to chr19:3,280,000-3,300,000 and show alignment in this region (as figure).

· Report: (Tophat:align_summary)

o The total number of reads

o Overall alignment rate

o Number or percentage of multiple alignment

o Number of percentage of discordant alignments

6. Quantification

· Estimate the transcript abundances using cufflinks (under NGS: RNA-Seqfolder) for both datasets.

· Use reference annotation for hg19, Genomes_UCSC_hg19_chr19_gene_annotation.gtf, which you can download at https://osu.app.box.com/rna-seq-data

· For either adrenal OR brain data, pick a gene that has a non-zero FPKM value and more than one transcript. Report the FPKM values for each transcript of that gene. Compare it with the FPKM value of the same gene at gene level. (Hint: use transcript_expression and gene_expression tables )