BMI7830-CSE5599 – Laboratory Assignment 1

Release date 09/17/15 – Due date 10/1/15

Learning goals:

1.    Get familiar with R/Rstudio and Bioconductor
2.    Learn workflow for analyzing gene expression microarray (Affymetrix microarray to be specific, but other types of microarrays can follow)
3.    Learn to use public data from NCBI Gene Expression Omnibus

Overview:

Comparing gene expression profiles between two types of non-small cell lung cancer (NSCLC) – the adenocarcinoma versus the squamous cell carcinoma. Identify differentially expressed genes; visualize the volcano plot; search for genes of interest; save the output; carry out GO and gene set enrichment analysis for highly differentially expressed genes to identify biological processes and pathways.

Steps:

It is necessaary that all approporate steps should be accomplished through a R script/program.

Download raw data (compressed .CEL files) for GSE10245 to a data directory you specified and decompress them (using R).
Generate a sample annotation file (using Excel or text editors) and save to the same data directory.
Carry out GC-RMA normalization on the .CEL files (using Bioconductor packages).
Visualization and QC:

Visualize the boxplot for the data before and after normalization (using Bioconductor/R packages)
Visualize the distribution (histogram) of the data before and after normalization (using Bioconductor/R packages)
Visualize the image of the .CEL files (using Bioconductor/R packages)

Carry out comparative analysis between the two types of NSCLC using the Limma package (using Bioconductor/R packages)
Map the probeset ID to gene symbols and save the output (probeset IDs, log fold changes, p-values, gene symbols, etc) to a tab-delimited text file (using R)
Generate Volcano plot for the comparative analysis output (using ggplot from Biocconductor)
Label the probes associated with EGFR, TP53, SOX2 on the Volcano plot
Take probesets with |log2FC| > 2 and carry out enrichment analysis using ToppGene/ToppFun or DAVID.

Templates:
Find many of the steps in THIS R script which also examines the differential expression in data albeit for patients afflicted breast cancer. Please note that the very last step (#9) is not in the script and will need to be completed by the class. There are four subtypes in the other study on breast cancer. The annotation file for that study can be found HERE.

What to Submit ?

Your code, clearly commented to indicate what the codes related to each of the above steps. (points will be taken off for codes without clear comments) – 10pt

The below items should be in one Word file:

Figures showing the boxplots before and after normalization. – 5pt
Report the number of probesets with |logFC| > 2, 3, 4, respectively. – 5pt
A table for the top 10 up-regulated probesets in Adenocarcinoma with the p-values, logFC, gene symbols and other output parameters (you can use the output from the head function). The table should have clear caption. – 10pt
A table for the top 10 down-regulated probesets in Adenocarcinoma with the p-values, logFC, gene symbols and other output parameters (you can use the output from the head function). The table should have clear caption. – 10pt
Volcano plot with probesets with |logFC|>3 and p<0.05 highlighted. – 10pt
Volcano plot with probesets for the three genes listed in Step 8 labeled. You can use one plot for all or different plots for different genes. Just make sure you have clear caption for your figures. – 15pt
For the enrichment analysis, please report:

A table for the top 5 (or all, whichever is smaller) most enriched GO Biological Process terms (with the term name, GO number, p-values, genes in your list having that term) – 10pt
Figures showing the histograms before and after normalization. – 5pt
A table for other significantly enriched “things” that you think might be of interest to lung cancer study, list at least three things with p-values (with the term name, GO number, p-values, genes in your list having that term). Explain why you think they might be interesting to lung cancer research. -10pt

A table for the top 3 (or all, whichever is smaller) most enriched Pathways (with the term name, GO number, p-values, genes in your list having that term) – 10pt.

Submission
You need to submit two files – the code file (.R file) and the Word file which contains the answers to the above questions. Please do NOT submit any other files. Use CARMEN to submit the files and upload in the class repository.