**You must be able to discriminate between different types of
attributes (e.g. continuous vs. discrete; ordinal vs. categorical) and be able
to provide examples for each case.**

Ordinal: (e.g. Hurricane strength; Ranking; Ratings); Categorical: (e.g. Sex (M/F); Color of Eye)

Continuous (e.g. sensory readings, temperature, pressure and so on); Discrete (e.g. any ordinal/categorical attributes; sometimes can also capture Age in discrete intervals – e.g. childhood; adolescent; adulthood; senior).

**You must be able to explain and understand the tradeoffs
between different types of dataset representations.**

For example a __distance matrix__ is a lossy representation; while the __data matrix__ is
lossless (one can always recover a distance matrix from the data matrix but not
the other way around).

Given no knowledge about the problem on hand – the representation
you always pick should be lossless – i.e. __data matrix__ or __a
transactional representation.__

Given the knowledge that you will repeatedly require
pairwise distances (e.g. for a clustering task) and further given that the dimensionality
of the dataset is high – the __distance representation__ is a good one to
pick since it will have lower representation cost and distances will not have
to be recomputed each time.

Given the knowledge that you may require distances but also
the raw points or if you are told that the number of instances/entities/rows is
very high (say a billion entries) then you pick a __data matrix__ for its
lower representation cost. A distance matrix would require a very high
representation cost.

Understand the difference between a transactional representation and a data representation – both are lossless but one has a more dense feel to it (transactional) while the other may facilitate column-wise selection operations. Again context defines which representation you pick.

**You must be able to understand different sources of poor Data
Quality and different techniques one can adapt to address Data Quality problems.**

For instance for missing value problems – mean imputation is a standard solution but it has weaknesses (it does not account for the correlation structure in data). EM (expectation maximization) coupled with is a de-facto standard used to address such data quality issues.

For outlier detection – you must be able to provide examples
where outliers may pose a significant problem. A standard definition of outliers
is given by **Hawkins** (**Hawkins**, 1980) who defines an **outlier**
as an observation that deviates so much from other observations as to arouse
suspicion that it was generated by a different mechanism. How to detect them? Distance-based approaches are simple and
scalable. Statistical and computational geometry based approaches are more well
founded but expensive. We will spend a
lecture on this later in the class but this is good enough for now.

Why is duplicate data a problem? It can lead to an over-representation of certain entities which can lead to bias in the modeling exercise. How can we detect these? We will talk about hashing strategies for this purpose later in the class.

**You must be able to explain why normalization is important
and you must be able to deploy/demonstrate the ability to normalize data using
one of the three normalization strategies described in the slides.**

When does normalization play a role – typically in data fusion data preprocessing exercises. Why is it important – if you want to combine data sources you need to make sure the playing field is level. You need to be able to perform min-max normalization; z-score normalization and decimal scaling (the last is rarely used but you should certainly be able to demonstrate how to do it).

**You must be able to explain the roles of aggregation,
sampling, dimensionality reduction and discretization? What do these strategies
have in common? How do they differ? What are the challenges in each one?
Demonstrate use and deployment.**

They are all strategies that focus on reducing the representation cost of the dataset. Each focuses on reducing in different ways.

Aggregation/Summary Statistics – Tries to group related data and represent related data using single representation. Commonly used in data warehouses – for instance during roll-up operations. Useful for data understanding; scale analysis (e.g region-based vs county-based vs city-based)and can lead to stable models.

Sampling – Reduces the number of rows one has to process. Key is to be able to accurately model the measure of interest in the sample. Caveats to think about – cost of generating the sample; how well the measure is being represented in the sample vs. the population. How do we sample unbalanced data? Different types of sampling methodologies – random vs. stratified and so on.

Dimensionality Reduction/Feature Selection – Dual of sampling in some senses in that it focuses on reducing the columns/dimensionality of the dataset. Standard approaches include use of PCA. What are the key assumptions of PCA? (ans: underlying Gaussian assumption) What are the strengths and weaknesses of PCA? (ans: Strengths – simple well founded methodology; reasonably fast methods to compute; intuition based on retaining as much of the variance in the system; Weaknesses: interpretability of new dimensions, Gaussian assumption).

Discretization: Focuses on reducing the representation cost of the dataset by quantizing each continuous attribute into discrete intervals. If you have 10 intervals required to discretize a dataset – you need 4 bits to represent it as opposed to 32 bits for standard floats. See slides for different types of methods. Given a dataset students will need to demonstrate how to discretize it using equal width, equal frequency or using Entropy based methods. We will work additional examples in next lecture.

**We revisited distance/similarity/density-based metrics and
introduced notion of the Mahalanobis metric. You must
demonstrate the capability to solve/compute these metrics on datasets. See
slides for examples worked in class and intuition.**