Data Mining Research at The Ohio State University

Areas of Research: Data Mining, Parallel and Distributed Computing, and Systems

Broad Categorization of Projects:

Novel Data Mining and Data Preprocessing Algorithms:

Structure Mining: The development of novel algorithms for frequent pattern mining particularly in the context of mining structured data (e.g. graphs, 3D structures, and XML data).
Anomaly Detection: The design and development of novel anomaly detection algorithms.
Data Preprocessing: The development of novel data preprocessing strategies that address issues such as how to effectively use sampling in the context of data mining, how to handle missing data, and how to discretize continuous attributes in an effective manner.
We plan to design and evaluate a framework for mining cause-and-effect patterns drawing inspiration from work in temporal logic. If successful, this framework can have a significant impact in bioinformatics, security and other areas.

Bioinformatics:

Structure Analysis, within the context of problems such as protein structure analysis, and drug motif discovery.
Shape modeling and mining, in the context of eye disease detection.
Modeling temporal, mixed attribute datasets, in the context of identifying hepatoxicity patterns as a function of drug intake.
We plan to examine the use of graph mining techniques in the context of protein-protein interaction graphs and examine the use of probabilistic and deterministic models for rational design problems such as protein crystallization.

Scientific Data Analysis:

Discovery and Visual Exploration of Scientific Data: Developing novel techniques that will enable us to visualize and mine data produced by molecular dynamics simulations. We have examined the use of spatial frequent structure analysis techniques, feature mining and classification algorithms in this context.
We plan to extend this work to track evolving patterns. We also plan to examine and develop a similar set of techniques (as done for MD) for mining Computational Fluid Dynamics datasets.

Scalable Data Mining Algorithms:

Development of parallel, distributed and incremental algorithms for various mining tasks including several of the ones listed above.
Development of high performance mining algorithms on modern processors.
We plan to extend this work to new domains and to evaluate the use of such techniques in the context of adaptive mining of data streams. This work has applications in areas such as intrusion detection and mining large scale simulation data in-vivo.

Parallel and Distributed Systems:

Identify systems solutions for next generation data analysis centers that run adaptive parallel mining algorithms operating on dynamic data. Specific issues that we plan to look at include: storage services, particularly support for disk-based sampling and data placement strategies.
Development caching services for such next generation data analysis centers, that target re-use of data and previously mined information; and finally scheduling services, that target job admission and scheduling on a tightly coupled parallel cluster.
We plan to validate the proposed framework drawing on applications listed above.

Funding Acknowledgements:

Our research is currently supported by several grants from the NSF (CAREER, ITR/NGS, NGS, ACR/Software), a grant from the DOE (ECPI), and a grant from Pfizer Inc. Other sources of support include an NIH center scale grant and a research infrastructure grant from the NSF. Past research has been supported by NSF (DBI), and Ameritech/SBC.