BMI7830-CSE5599 – Laboratory Assignment 1
Release date 09/17/15 – Due date 10/1/15
Learning goals:
1. Get familiar with R/Rstudio and Bioconductor
2. Learn
workflow for analyzing gene expression microarray (Affymetrix
microarray to be specific, but other types of microarrays can follow)
3. Learn to use public data from NCBI Gene Expression Omnibus
Overview:
Comparing gene expression
profiles between two types of non-small cell lung cancer (NSCLC) – the
adenocarcinoma versus the squamous cell carcinoma. Identify
differentially expressed genes; visualize the volcano plot; search for
genes of interest; save the output; carry out GO and gene set
enrichment analysis for highly differentially expressed genes to
identify biological processes and pathways.
Steps:
It is necessaary that all approporate steps should be accomplished through a R script/program.
- Download raw data (compressed .CEL files) for GSE10245 to a data directory you specified and decompress them (using R).
- Generate a sample annotation file (using Excel or text editors) and save to the same data directory.
- Carry out GC-RMA normalization on the .CEL files (using Bioconductor packages).
- Visualization and QC:
- Visualize the boxplot for the data before and after normalization (using Bioconductor/R packages)
- Visualize the distribution (histogram) of the data before and after normalization (using Bioconductor/R packages)
- Visualize the image of the .CEL files (using Bioconductor/R packages)
- Carry out comparative analysis between the two types of NSCLC using the Limma package (using Bioconductor/R packages)
- Map the probeset ID to gene
symbols and save the output (probeset IDs, log fold changes, p-values,
gene symbols, etc) to a tab-delimited text file (using R)
- Generate Volcano plot for the comparative analysis output (using ggplot from Biocconductor)
- Label the probes associated with EGFR, TP53, SOX2 on the Volcano plot
- Take probesets with |log2FC| > 2 and carry out enrichment analysis using ToppGene/ToppFun or DAVID.
Templates:
Find many of the steps in THIS
R script which also examines the differential expression in data albeit
for patients afflicted breast cancer. Please note that the very
last step (#9) is not in the script and will need to be completed by
the class. There are four subtypes in the other study on breast
cancer. The annotation file for that study can be found HERE.
What to Submit ?
- Your code, clearly
commented to indicate what the codes related to each of the above
steps. (points will be taken off for codes without clear comments) –
10pt
The below items should be in one Word file:
- Figures showing the boxplots before and after normalization. – 5pt
- Report the number of probesets with |logFC| > 2, 3, 4, respectively. – 5pt
- A table for the top 10 up-regulated probesets in Adenocarcinoma
with the p-values, logFC, gene symbols and other output parameters (you
can use the output from the head function). The table should have clear
caption. – 10pt
- A table for the top 10 down-regulated probesets in Adenocarcinoma
with the p-values, logFC, gene symbols and other output parameters (you
can use the output from the head function). The table should have clear
caption. – 10pt
- Volcano plot with probesets with |logFC|>3 and p<0.05 highlighted. – 10pt
- Volcano plot with probesets for the three genes listed in Step 8
labeled. You can use one plot for all or different plots for different
genes. Just make sure you have clear caption for your figures. – 15pt
- For the enrichment analysis, please report:
- A table for the top 5 (or all, whichever is smaller) most
enriched GO Biological Process terms (with the term name, GO number,
p-values, genes in your list having that term) – 10pt
- Figures showing the histograms before and after normalization. – 5pt
- A table for other significantly enriched “things” that you
think might be of interest to lung cancer study, list at least three
things with p-values (with the term name, GO number, p-values, genes in
your list having that term). Explain why you think they might be
interesting to lung cancer research. -10pt
- A table for the top 3 (or all, whichever is smaller) most
enriched Pathways (with the term name, GO number, p-values, genes in
your list having that term) – 10pt.
Submission
You need to submit two files –
the code file (.R file) and the Word file which contains the answers to
the above questions. Please do NOT submit any other files. Use
CARMEN to submit the files and upload in the class repository.