(Updated on Dec 9th, 2017)

There are two complementary channels for people to acquire knowledge and solve problems: (1) turning to machines, such as Google-kind of search engines and Siri-like intelligent systems; (2) soliciting help from human experts directly. Both channels are undergoing significant transformations. On the one hand, owing to the recent availability of large-scale knowledge bases (KBs) or knowledge graphs like Freebase, as well as the development of natural language understanding techniques, for many questions such as ``What are the symptoms of breast cancer,'' users can now get direct and precise answers from machines, in contrast to a long list of relevant documents from traditional search engines. On the other hand, the traditional way of in-person, one-to-one enquiring is turning into a more powerful way of seeking help from a network of human experts online. Examples include online community platforms such as Quora, Stack Overflow, HealthBoards, and various crowdsourcing platforms (e.g., Amazon MTurk, Crowdflower), as well as customer service networks in millions of companies. My research work has been advancing both channels, i.e., machine intelligent and human collaborative systems, through mining disparate data sources (including structured KBs, semi-structured tables, and unstructured texts), network analysis, and human behavior understanding. My work is closely connected to healthcare, education, social science, business, cybersecurity, etc.

(1) Machine Intelligent Question Answering

The recent blossom of large-scale knowledge bases, such as Freebase, Google’s Knowledge Graph, andMicrosoft’s Satori, largely promotes the development of machine-aided knowledge discovery and question answering systems. Our studies in this line include:

Structured Knowledge Based Question Answering [VLDB'14, SIGMOD'14 (Demo), SIGKDD'15]

Querying complex graph databases such as knowledge graphs is a challenging task for non-professional users. Due to their complex schemas and variational information descriptions, it becomes very hard for users to formulate a query that can be properly processed by the existing systems. We proposed that for a user-friendly graph query engine [VLDB'14], it must support various kinds of transformations such as synonym, abbreviation, and ontology, between a query and its match. Furthermore, the derived candidate matches must be ranked in a principled manner. We designed a novel framework enabling schema-less and structure-less graph querying (SLQ), where a user does not need to describe queries precisely as required by most databases. The query engine is built on a set of transformation functions that automatically map keywords and linkages from a query to their matches in a graph. It automatically learns an effective ranking model, without assuming manually labeled training examples, and can efficiently return top ranked matches using graph sketch and belief propagation. Furthermore, we studied how to exploit relevance feedback from users to provide improved results subsequently [SIGKDD'15]. Our recent study [EMNLP’16] presented the first framework toconstruct KB-based QA datasets with rich question characteristics (e.g., structure complexity, function,and paraphrasing), which can greatly help benchmark different QA systems.

Unstructured Text Based Question Answering [WWW’15, EMNLP’17]

As a well-known fact, although knowledge bases contain a wealth of information, they are far from complete and may not always contain information required to answer users’ questions. We developed a new QA system QuASE [WWW’15], which mines answer candidates from the large-scale web texts such as news articles and Wikipedia pages, and employs knowledge bases to obtain semantic features to further boost the QA performance. Our recent work [EMNLP’17] studied one critical problem lying in most existing QA systems: For any given question, they will always return an answer from a set of candidates, leading to high false positive rate. This can greatly hurt user experience, especially when it is hard for users to judge answer correctness.

Semi-Structured Table Based Question Answering [WWW’16]

Beyond texts on the Web, we observed that HTML tables are also pervasive and range across a large variety of topics. Compared with texts, tables are more structured to provide concept-related grouping, but use more flexible schema in comparison with KBs. In [WWW’16], we studied how to identify table cells from millions of tables to answer user questions, in contrast to QA based on texts and KBs. [WWW’15 and [WWW’16] showed that texts or tables can address questions that are not easily or possibly answered by existing KBs. Our ongoing projects investigate show to employ the complementary three sources (i.e.,KBs, texts, and tables) for query resolution.

(2) Human Collaborative Problem Solving [SIGKDD'14, TKDE'15, SDM’16]

Expertise Mining, Expert Behavior Analysis, Collaboration Optimization

One specific but pervasive type of human-aided problem solving systems is collaborative networks. Collaborative networks abound in real life, where experts cooperate with each other to complete specific tasks. For instance, in service businesses, a service provider often maintains an expert network where agents collaboratively solve problems reported by customers. In a classic collaborative network, upon receiving a problem, an expert first tries to solve it; if they fail, the expert will route the problem to another expert. Such procedure continues until the problem arriving at some expert who can solve it. Under this background, in order to further improve the problem solving efficiency, we have particularly studied (a) How does an expert decide on whom to route a problem to? Do they always route a problem to an expert most likely to solve it, or their routing behaviors bear much randomness?[SIGKDD'14] (b) how to identify an expert whose "fine-grained" expertise is most relevant to a problem, so that the problem can be dispatched to them? [TKDE'15] For example, given a problem on “Java multithreaded programming”, an expert that has the particular experience is more desirable than experts who have the broadest knowledge about “Java”, but little experience on “multithreaded programming”. We contributed a two-step framework to summarize user fine-grained knowledge by analyzing their Web surfing data. We further proposed novel distributed representation models in [SDM’16] to simultaneously capture the two natural aspects of human expertise:Specialization (what area an expert is good at) andProficiency Level(to what degree), which cannot be done by traditional topic model based methods

(3) Text Mining and (Domain-Specific) Knowledge Discovery [SIGKDD'13, TKDE'14, ICDM’16, WSDM’17]

In the big data era, both machine- and human-aided systems avail themselves of large-scale diversified texts. The rich textual resources provide abundant knowledge that can (1) directly respond to users’ information requests in most cases; (2) assist human experts to make decisions and solve problems; (3)serve as a significant information source to complete and update KBs. Along this line, we have developed novel methods to mine three types of textual data, i.e., product reviews, tweets, and QA forums (especially healthcare-related and programming-related): (a) Fake Review Generation and Detection. Online reviews have been popularly adopted in many applications. Since they can either promote or harm the putation of a product or a service, buying and selling fake reviews becomes a profitable business and a big threat. In [SIGKDD'13], we brought into attention an automated, low-cost process for generating fake reviews, which could be easily employed by evil attackers in reality. To the best of our knowledge,this is the first work to expose the potential risk of machine generated deceptive reviews, where we aimed at stimulating debate and defense against this deception scheme. (b) Public Sentiment Variation on Twitter. Sentiment analysis on Twitter data has provided an economical and effective way to expose public opinion timely. While there had been extensive studies on sentiment analysis of a particular tweet, we performed further analysis to mine useful insights behind significant variations of public sentiment [TKDE'14]. Detecting possible reasons behind such sentiment variations can provide important decision-making information. For example, if negative sentiment towards President Barack Obama increases significantly, the White House Administration Office may want to know why people changed their opinions and react accordingly to reverse this trend. (c) Mining healthcare- and programming-related QA forums. We have been investigating how to build medical intelligent assistants to help self-diagnosis by mining healthcare-related forums [ICDM’16, WSDM’17]. In [WSDM’17], we aim to identify trustworthy answers from non-experts by capturing their semantic meanings. We developed a medical Android in [ICDM’16] which does not only receive questions from users, but also asks back and obtains most relevant information, in order to predict the most likely cause of a health condition.Besides healthcare-related forums, one of my ongoing projects focuses on mining structured knowledge from programming discussion forums like Stack Overflow and MOOCs, to facilitate automated QA and knowledge exploration. Outcomes from this research project can greatly help computer science education and speed up software development