This page gives an overview of and links to recent research papers that describe some of the research of my lab. The commentary for some papers gives links to follow-on work so that the reader can see the trajectories of the different research lines.
My current research in automatic speech recognition is both informed by my graduate work in pronunciation modeling as well as discriminative modeling techniques for language-related tasks done while at Bell Labs. The overall goal of my lab's research is to find meaningful ways to integrate acoustic, phonetic, lexical, and other linguistic insights into the speech recognition process through a combination of statistical modeling and data/error analysis.
My goal is to train students to be flexible, independent thinkers who can apply statistical techniques to a range of language-related problems. While the papers below describe primarily speech recognition related research as a coherent focus, my students, colleagues, and I have also been engaged in other research activities in natural language processing and spoken dialogue systems. Papers in these areas can be found in my online publication list; the reader may also wish to consult the chronological listing of papers.
Since I also get quite a few requests for information on joining the lab, I include a section on this topic.
The Speech and Language Technologies Laboratory is a group of dynamic researchers who are interested in mixing aspects of machine learning with speech and language processing.
If you are not an OSU student, but want to apply: see my note on the application process to OSU.
If you are a current OSU student: see the "once you are at OSU" section of my note.
Over the last few years, my lab has engaged in a series of studies to build automatic speech recognition systems using direct discriminative models that can combine correlated evidence of linguistic events. This work is the lastest step in this line of research: it provides a discriminative framework for modeling longer trajectories in speech through segmental models. The innovation in this particular paper is the first one-pass discriminative segmental model for word recognition (building on our previous work in phone recognition). We show that the monophone-based model improves recognition over discriminatively trained monophone-based HMM and Frame-based CRF models for the Wall Street Journal read-speech task, and starts to approach triphone-based performance. Thus, this serves as a good intermediate point in building systems that can start to compete with state-of-the-art systems.
Recognized as a Spotlight Poster at ASRU 2011 (voted as a top 3 poster in its session by the attendees).
Segmental modeling can be thought of as a type of linguistic structural modeling (integrating linguistic structure over time). Another linguistic-inspired modeling approach that we have experimented with, in conjunction with partners at Toyota Technological Institute at Chicago, explicitly models articulator trajectories over time through a factored model -- unlike phone-based systems, this paradigm allows models of asynchrony which can account for different types of pronunciation variation commonly seen in continuous speech. In this paper, we use factorized Conditional Random Fields in order to learn patterns of asynchrony that can be utilized in providing articulatory feature transcriptions that can be expensive to obtain manually. Our experiments show that the transcriptions can better account for pronunciation variations observed by linguists in the Switchboard corpus. In subsequent papers, we were able to utilize this framework for acoustic-based keyword spotting, showing improvement over a HMM-based baseline.
Best Student Paper Award, Interspeech 2012
This paper takes a slightly different approach to articulatory modeling than the Prabhavalkar work described above, starting from a previous Dynamic Bayesian Network (DBN) approach and efficiently derives, as well as discriminatively trains, a weighted finite state transducer (WFST) representation for the articulatory feature-based model of pronunciation. We use the conditional independence assumptions imposed by the DBN to efficiently convert it into a sequence of WFSTs (factor FSTs) which, when composed, yield the same model as the DBN. We then introduce a linear model of the arc weights of the factor FSTs and discriminatively learn its weights using the averaged perceptron algorithm. We demonstrate the approach using a lexical access task in which we recognize a word given its surface realization. This work subsequently led to discriminative training approaches for factorized WFSTs that can be used even in standard WFST-based ASR systems.
One line of research that we have followed is to use some of the discriminative techniques that we have developed in speech recognition in concert with speech separation techniques inspired by (and often in collaboration with) my colleague DeLiang Wang. The paper highlighted here was an outgrowth of this work, in which my student Billy Hartmann and I asked whether it was possible to use speech separation directly on noisy speech data to mask out noise without any reconstruction of the masked componenets in ASR. Previously it was assumed that zero-energy "holes" would cause problems in spectrally-masked speech that was not reconstructed or where the missing components were not marginalized in the probability estimation. The baseline for these latter techniques was usually just the recognition on the non-modified (noisy) speech. In this paper we show that one can use masked speech data directly in recognition, and argue that this should be the "simple" baseline from which other techniques are based.
My group has also been active in NLP research, particularly in the domain of electronic health records (EHRs) in collaboration with Albert Lai in Biomedical Informatics. This paper describes the culmination of several pieces of work, where we extract medical events from multiple clinical notes in an EHR, develop a timeline for each note, and then align the events across notes to create an overall summary timeline of the medical history.
I have also been active in developing review articles to help explain several current topics to wider audiences. This invited paper gives a broad overview of Conditional Random Fields and their use in various processing tasks.
This paper details a model which can selectively pay attention to some phonological information and ignore other information using a discriminative model known as Conditional Random Fields (CRFs). While CRFs had been used in a few studies prior to this work, the contribution of this paper was to examine their utility as feature combiners, combining posterior estimates of phone classes and phonological feature classes to improve TIMIT phone recognition. We have continued this line of research since this paper, moving towards the first CRF-based word recognition experiments ever done.
This paper provides insights into the lexical-phonetic aspects of underspecification: we show that creating pronunciation dictionaries for HMM-based ASR systems which allow a "backoff" to the manner of articulation in the case of some unstressed syllables results in a speech recognizer that has comparable (or better in one instance) performance to a system with a standard pronunciation dictionary, particularly in speech masked by additive noise. This garners evidence for a hypothesis proposed by Briscoe in 1989, which claimed that manner of articulation for unstressed syllables could be enough for lexical access. However, in more recent studies, we have found that this claim also applies to place of articulation (contrary to Briscoe's hypothesis). This line of work will continue to find ways to underspecify pronunciation lexica to account for the phonetic variation in spontaneous speech.
Much of the work above is devoted to methods of modeling the acoustic-phonetic variation inherent in speech, in order to build better speech recognition models. However, a slightly different way of thinking about variation is to consider the variation in patterns of errors made by a speech recognizer due to many factors (for example, errors due to inherent speech variation, errors caused by poor acoustic/lexical models, or search errors). This paper focuses on methods to predict errors made by speech recognition systems, even when we only have a text transcript (i.e., no audio); the proposed framework is flexible enough to allow for different prediction models to characterize system performance. The impact of this technology has allowed us and others to train discriminative language models that directly optimize system error rate (rather than data likelihood) using a large amount of textual data.
An initial study into how the process of matching parent and child vowel spaces might be accomplished.
Abstract: As a child acquires language, he or she: perceives acoustic information in his or her surrounding environment; identifies portions of the ambient acoustic information as language-related; and associates that language-related information with his or her perception of his or her own language-related acoustic productions. The present work models the third task. We use a semisupervised alignment algorithm based on manifold learning. We discuss the concepts behind this approach, and the application of the algorithm to this task. We present experimental evidence indicating the usefulness of manifold alignment in learning speaker normalization.