BMI7830 – Laboratory Assignment 2

Release date 10/13/2015 – Due date 10/22/2015

Learning goals:

Get familiar with GenePattern
Learn to build pipelines
Learn how to deploy pipelines on various data
Learn to annotate

Overview:

Very similar to Lab1 in terms of biological context and findings. The goals are prosaic; we want you to learn how to assemble GenePattern pipelines given the large user-community and eco-system it carries and it's distinct style of bioinformatics.

Steps:

Use Gene Pattern (online/installed) to accomplish following steps:
Visit the site GenePattern - http://genepattern.broadinstitute.org. You will need to register.
Read the tutorial on GenePattern - http://www.broadinstitute.org/cancer/software/genepattern/tutorial/gp_tutorial.html
Read the tutorial on Gene Expression Analysis - http://www.broadinstitute.org/cancer/software/genepattern/desc/expression.
Click Differential Expression Analysis (http://genepattern.broadinstitute.org/gp/pages/protocols/DiffExp.html)
Find all relevant modules at http://www.broadinstitute.org/cancer/software/genepattern/modules?taskType=Gene+List+Selection
Once again we use the results of the paper Golub T.R., Slonim D.K., et al. Molecular Classification of Cancer: Class Discovery and Class Prediction by Gene Expression Monitoring, Science, 531-537 (1999).
ComparativeMarkerSelection analysis is used to find marker genes -- genes in the dataset that are most closely correlated with the two phenotypes (ALL and AML) in the dataset. Note that AML is acute myeloid leukemia and ALL acute lymphoblastic leukemia and are referred to as phenotypes in the bio-/life sciences literature.
A pipeline needs to be employed as listed below

Choose Open module with example data in the GenePattern window.
Select parameters. Please note the following in the pre-processing
PreprocessDataset can preprocess the data in one or more ways (in this order)

If you look at the pre-processing options, you will note the following.
Set threshold and ceiling values. Any value lower/higher than the threshold/ceiling value is reset to the threshold/ceiling value.
Convert each expression value to the log base 2 of the value.
Remove genes (rows) if a given number of its sample values are less than a given threshold.
Remove genes (rows) that do not have a minimum fold change or expression variation.
Discretize or normalize the data.

Choose HeatMapVisualizer to visualize the data. We will again discuss the parameters in class. The visualization with a heat map changes as you change the parameters in steps above. The idea is to tag or identify parameters that alter the visualization the most.
Use ComparativeMarkerSelection to select the genes. Give some thought to the selection of parameters.
Use ComparativeMarkerSelectionViewer. Use custom filters based on false-discovery-rate (FDR) to filter results.
Complete Gene Set Enrichment Analysis (GSEA) on the selected genes
Run above steps without creating a pipeline (practice run).
Now create a pipeline called Name_Lab2_CompExperession_Pipeline and store it.

Compare the results obtained with the same data but with the R/Bioconductor programs.
Consider other datasets; the TNBC data set from the prototypical script for Lab 1 and the test data on NSCLC. Apply the saved pipeline to datasets.

What to Submit ?

A report that discusses your explorations with Gene Pattern for all three datasets.

Submission
You need to submit a Word file which contains the answers to the above questions. Please do NOT submit any other files. Please submit using Carmen by mid-night of the due date.