You must be able to discriminate between different types of attributes (e.g. continuous vs. discrete; ordinal vs. categorical) and be able to provide examples for each case.

Ordinal: (e.g. Hurricane strength; Ranking; Ratings); Categorical: (e.g. Sex (M/F); Color of Eye)

Continuous (e.g. sensory readings, temperature, pressure and so on); Discrete (e.g. any ordinal/categorical attributes; sometimes can also capture Age in discrete intervals – e.g. childhood; adolescent; adulthood; senior).

You must be able to explain and understand the tradeoffs between different types of dataset representations.

For example a distance matrix is a lossy representation; while the data matrix is lossless (one can always recover a distance matrix from the data matrix but not the other way around).

Given no knowledge about the problem on hand – the representation you always pick should be lossless – i.e. data matrix or a transactional representation.

Given the knowledge that you will repeatedly require pairwise distances (e.g. for a clustering task) and further given that the dimensionality of the dataset is high – the distance representation is a good one to pick since it will have lower representation cost and distances will not have to be recomputed each time.

Given the knowledge that you may require distances but also the raw points or if you are told that the number of instances/entities/rows is very high (say a billion entries) then you pick a data matrix for its lower representation cost. A distance matrix would require a very high representation cost.

Understand the difference between a transactional representation and a data representation – both are lossless but one has a more dense feel to it (transactional) while the other may facilitate column-wise selection operations. Again context defines which representation you pick.

You must be able to understand different sources of poor Data Quality and different techniques one can adapt to address Data Quality problems.

For instance for missing value problems – mean imputation is a standard solution but it has weaknesses (it does not account for the correlation structure in data). EM (expectation maximization) coupled with is a de-facto standard used to address such data quality issues.

For outlier detection – you must be able to provide examples where outliers may pose a significant problem. A standard definition of outliers is given by Hawkins (Hawkins, 1980) who defines an outlier as an observation that deviates so much from other observations as to arouse suspicion that it was generated by a different mechanism. How to detect them?  Distance-based approaches are simple and scalable. Statistical and computational geometry based approaches are more well founded but expensive.  We will spend a lecture on this later in the class but this is good enough for now.

Why is duplicate data a problem? It can lead to an over-representation of certain entities which can lead to bias in the modeling exercise. How can we detect these? We will talk about hashing strategies for this purpose later in the class.

You must be able to explain why normalization is important and you must be able to deploy/demonstrate the ability to normalize data using one of the three normalization strategies described in the slides.

When does normalization play a role – typically in data fusion data preprocessing exercises. Why is it important – if you want to combine data sources you need to make sure the playing field is level. You need to be able to perform min-max normalization; z-score normalization and decimal scaling (the last is rarely used but you should certainly be able to demonstrate how to do it).

You must be able to explain the roles of aggregation, sampling, dimensionality reduction and discretization? What do these strategies have in common? How do they differ? What are the challenges in each one? Demonstrate use and deployment.

They are all strategies that focus on reducing the representation cost of the dataset. Each focuses on reducing in different ways.

Aggregation/Summary Statistics – Tries to group related data and represent related data using single representation. Commonly used in data warehouses – for instance during roll-up operations. Useful for data understanding; scale analysis (e.g region-based vs county-based vs city-based)and can lead to stable models.

Sampling – Reduces the number of rows one has to process. Key is to be able to accurately model the measure of interest in the sample. Caveats to think about – cost of generating the sample; how well the measure is being represented in the sample vs. the population. How do we sample unbalanced data? Different types of sampling methodologies – random vs. stratified and so on.

Dimensionality Reduction/Feature Selection – Dual of sampling in some senses in that it focuses on reducing the columns/dimensionality of the dataset. Standard approaches include use of PCA. What are the key assumptions of PCA? (ans: underlying Gaussian assumption) What are the strengths and weaknesses of PCA? (ans: Strengths – simple well founded methodology; reasonably fast methods to compute; intuition based on retaining as much of the variance in the system; Weaknesses: interpretability of new dimensions, Gaussian assumption).

Discretization: Focuses on reducing the representation cost of the dataset by quantizing each continuous attribute into discrete intervals. If you have 10 intervals required to discretize a dataset – you need 4 bits to represent it as opposed to 32 bits for standard floats.  See slides for different types of methods. Given a dataset students will need to demonstrate how to discretize it using equal width, equal frequency or using Entropy based methods. We will work additional examples in next lecture.

We revisited distance/similarity/density-based metrics and introduced notion of the Mahalanobis metric. You must demonstrate the capability to solve/compute these metrics on datasets. See slides for examples worked in class and intuition.