BMI7830 – Laboratory Assignment 2
Release date 10/13/2015 – Due date 10/22/2015
Learning goals:
- Get familiar with GenePattern
- Learn
to build pipelines
- Learn how to deploy pipelines on various data
- Learn to annotate
Overview:
Very similar to Lab1 in terms of
biological context and findings. The goals are prosaic; we want you to
learn how to assemble GenePattern pipelines given the large user-community and
eco-system it carries and it's distinct style of bioinformatics.
Steps:
- Use Gene Pattern (online/installed) to accomplish following steps:
- Visit the site GenePattern - http://genepattern.broadinstitute.org. You will need to register.
- Read the tutorial on GenePattern - http://www.broadinstitute.org/cancer/software/genepattern/tutorial/gp_tutorial.html
- Read the tutorial on Gene Expression Analysis - http://www.broadinstitute.org/cancer/software/genepattern/desc/expression.
- Click Differential Expression Analysis (http://genepattern.broadinstitute.org/gp/pages/protocols/DiffExp.html)
- Find all relevant modules at http://www.broadinstitute.org/cancer/software/genepattern/modules?taskType=Gene+List+Selection
- Once again we use the results of
the paper Golub T.R., Slonim D.K., et al. Molecular Classification of
Cancer: Class Discovery and Class Prediction by Gene Expression
Monitoring, Science, 531-537 (1999).
- ComparativeMarkerSelection
analysis is used to find marker genes -- genes in the dataset that are
most closely correlated with the two phenotypes (ALL and AML) in the
dataset. Note that AML is acute myeloid leukemia and ALL acute
lymphoblastic leukemia and are referred to as phenotypes in the
bio-/life sciences literature.
- A pipeline needs to be employed as listed below
- Choose Open module with example data in the GenePattern window.
- Select parameters. Please note the following in the pre-processing
- PreprocessDataset can preprocess the data in one or more ways (in this order)
- If you look at the pre-processing options, you will note the following.
- Set threshold and ceiling values. Any value lower/higher than
the threshold/ceiling value is reset to the threshold/ceiling value.
- Convert each expression value to the log base 2 of the value.
- Remove genes (rows) if a given number of its sample values are less than a given threshold.
- Remove genes (rows) that do not have a minimum fold change or expression variation.
- Discretize or normalize the data.
- Choose HeatMapVisualizer to visualize the data. We will again
discuss the parameters in class. The visualization with a heat map
changes as you change the parameters in steps above. The idea is
to tag or identify parameters that alter the visualization the
most.
- Use ComparativeMarkerSelection to select the genes. Give some thought to the selection of parameters.
- Use ComparativeMarkerSelectionViewer. Use custom filters based on false-discovery-rate (FDR) to filter results.
- Complete Gene Set Enrichment Analysis (GSEA) on the selected genes
- Run above steps without creating a pipeline (practice run).
- Now create a pipeline called Name_Lab2_CompExperession_Pipeline and store it.
- Compare the results obtained with the same data but with the R/Bioconductor programs.
- Consider other datasets;
the TNBC data set from the prototypical script for Lab 1 and the test
data on NSCLC. Apply the saved pipeline to datasets.
What to Submit ?
- A report that discusses your explorations with Gene Pattern for all three datasets.
Submission
You need to submit a Word file
which contains the answers to
the above questions. Please do NOT submit any other files. Please submit using Carmen by mid-night of the due
date.