What's New
Aug 6-14, Berlin for ACL, presenting my 3rd
TACL paper on text simplification
Dec 10-16, Osaka Japan for COLING, organizing the
2nd Workshop on Noisy User-generated Text
Teaching
I designed and taught a new course —
Social Media and Text Analytics.
[Summary] Social media provides a massive amount of valuable information and shows us how language is actually used by lots of people. This course covers several important machine learning algorithms and the core natural language processing techniques for obtaining and processing Twitter data.
[Schedule]
Students
My past advisees all have published one or more papers with me:
Quanze Chen (undergraduate UPenn → PhD University of Washington)
Bin
Fu (undergraduate Tsinghua → PhD CMU → Google NYC)
Mingkun Gao (master Upenn → PhD UIUC)
Ray Lei (undergraduate UPenn → master UPenn)
Maria Pershina (PhD NYU | I served on her PhD thesis committee)
Siyu Qiu (master UPenn → Hulu)
Research Highlights
Joint Word-Sentence Models
I build probabilistic graphical models to extract semantic or structured knowledge from large volumes of data. I
designed the first succesful models to extract paraphrases from
Twitter that can scale up to billions of sentences. These web-scale
paraphrases enable natural language systems to handle errors (e.g.
“everytime” ↔ “every time”), lexical variations (e.g. “oscar nom’d
doc” ↔ “Oscar-nominated documentary”), rare words (e.g “NetsBulls
series” ↔ “Nets and Bulls games”), and language shifts (e.g. “is
bananas” ↔ “is great”) [BUCC2013] [SemEval2015]. But it is difficult to capture such lexically divergent paraphrases by the conventional similarity-based approaches. I invented the multi-instance learning paraphrase (MultiP) model [TACL2014], which jointly infers latent word-sentence relations and relaxes the reliance on human annotation. It is a conditional random field model with latent variables [ACL2014][ACL2013], and the current state-of-the-art, outperforming deep leaning and latent space methods.
Statistical Natural Language Generation (NLG) Framework
Many text-to-text generation problems can be thought of as sentential paraphrasing or monolingual machine translation. It faces an exponential search space larger than bilingual translation, but a much smaller optimal solution space due to specific task requirements. I advocate for a statistical text-to-text framework, building on top of statistical machine translation (SMT) technology. My recent work uncovered multiple serious problems in text simplification [TACL2015] research between 2010 and 2014, and set a new state-of-the-art by designing novel objective functions for optimizing syntax-based SMT and overgenerating with large-scale paraphrases [TACL2016]. I am also very interested in paraphrases of different language styles (e.g. historic ↔ modern [COLING2012], erroneous ↔ well-edited [BUCC2013], feminine ↔ masculine [AAAI2016]).
Publications
-
A Minimally Supervised Method for Recognizing and Normalizing Time Expressions in Twitter
Jeniya Tabassum, Alan Ritter, Wei Xu
In EMNLP 2016
-
Optimizing Statistical Machine Translation for Text Simplification [data & code - expected in August]
Wei Xu, Courtney Napoles, Ellie Pavlick, Quanze Chen, Chris Callison-Burch
In TACL 2016 (expected presentation at ACL 2016)
-
Discovering User Attribute Stylistic Differences via Paraphrasing [bib] [data]
Daniel Preoţiuc-Pietro, Wei Xu, Lyle Ungar
Proceedings of AAAI 2016
-
Problems in Current Text Simplification Research: New Data Can Help [bib][slides] [data]
Wei Xu, Chris Callison-Burch, Courtney Napoles
In TACL 2015 (talk at EMNLP 2015)
-
Shared Tasks of the 2015 Workshop on Noisy User-generated Text: Twitter Lexical Normalization and Named Entity Recognition [bib]
Timothy Baldwin, Marie-Catherine de Marneffe, Bo Han, Young-Bum Kim, Alan Ritter, Wei Xu
Proceedings of ACL 2015 Workshop on Noisy User-generated Text (WNUT)
-
Cost Optimization for Crowdsourcing Translation [bib]
Mingkun Gao, Wei Xu, Chris Callison-Burch
Proceedings of NAACL 2015
-
SemEval-2015 Task 1: Paraphrase and Semantic Similarity in Twitter (PIT) [bib] [data & code - email me]
Wei Xu, Chris Callison-Burch, William B. Dolan
Proceedings of SemEval 2015
-
Extracting
Lexically Divergent Paraphrases from Twitter [bib][code][data - email me]
Wei Xu, Alan Ritter, Chris Callison-Burch, William B. Dolan, Yangfeng Ji
In TACL 2014 (talk at NAACL 2015)
-
Poetry of the
Crowd: A Human Computation Algorithm to Convert Prose into
Rhyming Verse [bib]
Quanze Chen, Chenyang Lei, Wei Xu, Ellie Pavlick, Chris
Callison-Burch
Proceedings of HCOMP 2014
-
Infusion of
Labeled Data into Distant Supervision for Relation
Extraction [bib]
Maria Pershina, Bonan Min, Wei Xu, Ralph Grishman
Proceedings of ACL 2014
-
Data-driven
Approaches for Paraphrasing Across Language Variations
[bib]
Wei Xu
PhD Thesis
-
Filling
Knowledge Base Gaps for Distant Supervision of Relation
Extraction [data]
[bib]
Wei Xu, Raphael Hoffmann, Le Zhao, Ralph Grishman
Proceedings of ACL 2013
-
[data]
[bib]
Wei Xu, Alan Ritter, Ralph Grishman
Proceedings of ACL 2013 Workshop on Building and Using
Comparable Corpora (BUCC)
-
[data]
[bib]
Wei Xu, Ralph Grishman, Adam Meyers, Alan Ritter
Proceedings of NAACL 2013 Workshop on Language Analysis in
Social Media (LASM)
-
Paraphrasing
for Style [data &
code][bib]
Wei Xu, Alan Ritter, Bill Dolan, Ralph Grishman, Colin
Cherry
Proceedings of COLING 2012
-
Exploiting
Syntactic and Distributional Information for Spelling
Correction with Web-Scale N-gram Models [bib]
Wei Xu, Joel Tetreault, Martin Chodorow, Ralph Grishman, Le
Zhao
Proceedings of EMNLP 2011
-
Passage
Retrieval for Information Extraction using Distant
Supervision
Wei Xu, Ralph Grishman, Le Zhao
Proceedings of IJCNLP 2011
-
New York
University 2011 System for KBP Slot Filing
Ang Sun, Ralph Grishman, Wei Xu, Bonan Min
Proceedings of TAC 2011
-
Who,
What, When, Where, Why? Comparing Multiple Approaches to
the Cross-Lingual 5W Task
Kristen Parton, Kathleen R. McKeown, Bob Coyne, Mona T.
Diab, Ralph Grishman, Dilek Hakkani-Tür, Mary Harper, Heng
Ji, Wei Yun Ma, Adam Meyers, Sara Stolbach, Ang Sun, Gokhan
Tur, Wei Xu, Sibel Yaman
Proceedings of ACL-IJCNLP 2009
-
A
Parse-and-Trim Approach with Information Significance for
Chinese Sentence Compression
Wei Xu, Ralph Grishman
Proceedings of ACL-IJNLP Workshop on Language Generation
and Summarisation 2009
-
Transducing
Logical Relations from Automatic and Manual
Annotation
Adam Meyers, Michiko Kosaka, Heng Ji, Nianwen Xue, Mary
Harper, Ang Sun, Wei Xu, Shasha Liao
Proceedings of ACL-IJNLP Workshop on Linguistic Annotation
2009
-
Automatic
Recognition of Logical Relations for English, Chinese and
Japanese in the GLARF Framework
Adam Meyers, Michiko Kosaka, Nianwen Xue, Heng Ji, Ang Sun,
Shasha Liao, Wei Xu
Proceedings of NAACL-HLT Workshop on Semantic Evaluations
2009
-
Using Non-Local
Features to Improve Named Entity Recognition Recall
Xinnian Mao, Wei Xu, Yuan Dong, Haila Wang
Proceedings of PACLIC 2007
-
Domain
Extension of Chinese Named Entity Recognition
Wei Xu, Bin Fu, Liu Liu, Chunfa Yuan, Wenjie Li
Frontiers of Content Computing 2007
-
Extractive
Summarization using Inter- and Intra- Event
Relevance
Wenjie Li, Wei Xu, Mingli Wu, Chunfa Yuan, Qin Lu
Proceedings of COLING-ACL 2006
-
Deriving Event
Relevance from the Ontology Constructed with Formal Concept
Analysis
Wei Xu, Wenjie Li, Mingli Wu, Wei Li, Chunfa Yuan
Proceedings of CICLing 2006
-
Building
Document Graphs for Multiple News Articles Summarization:
An Event-Based Approach
Wei Xu, Wenjie Li, Mingli Wu, Wei Li, Chunfa Yuan, Kam-Fai
Wong
Proceedings of ICCPOL 2006
-
The
Hong Kong Polytechnic University at ACE2005
Wenjie Li, Wei Li, Mingli Wu, Wei Xu
Proceedings of ACE Evaluation Workshop 2005
Professional Service
Area Chair:
EMNLP (2016)
Publicity Chair:
NAACL (2016)
Session Chair:
EMNLP (2015), NAACL (2015), AAAI (2015), ACL (2014)
Organizer:
-
ACL
2015 Workshop on Noisy User-generated Text (W-NUT)
-
SemEval 2015 shared-task: Paraphrases and Semantic Similarity in Twitter (PIT)
-
2016 Mid-Atlantic
Student Colloquium on Speech, Language and Learning
Program Committee:
ACL (2015, 2014, 2013), NAACL (2015), EMNLP (2015, 2014), COLING (2014)
WWW (2016, 2015), AAAI (2016, 2015, 2012), KDD (2015)
WWW Workshop on #Microposts (2016)
ACL Workshop on Social Factors in Natural Language Processing (2016)
EACL Workshop on Language Analysis in Social Media (2014)
Journal Reviewer:
Transactions of the Association for Computational Linguistics (TACL)
Invited Talks
-
Multiple-instance Learning from Unlimited Text
May 2016, University of Edinburgh, Edinburgh, United Kingdom
Apr 2016, Ohio State University, Columbus, OH
Apr 2016, University of North Carolina, Chapel Hill, NC
Mar 2016, Arizona State University, Tempe, AZ
Mar 2016, Vanderbilt University, Nashville, TN
Mar 2016, Imperial College London, London, United Kingdom
Mar 2016, University of Waterloo, Waterloo, ON, Canada (CS Seminar)
Feb 2016, Indiana University, Bloomington, IN (Computer Science Colloquium Series)
Feb 2016, Washington University, St Louis, MI (Computer Science & Engineering Colloquia Series)
Feb 2016, Simon Fraser University, Vancouver, BC, Canada
Feb 2016, University of Alberta, Edmonton, AB , Canada (Special Lecture)
Feb 2016, Yale University, New Haven, CT (CS Talk)
Oct 2015, University of Maryland, College Park, MD (CLIP Colloquium)
Oct 2015, Ohio State University, Columbus, OH (Clippers Seminar)
-
Large-scale Paraphrase Acquisition from Twitter
May 2015, DARPA DEFT PI Meeting, Boulder, CO
-
Learning and Generating Paraphrases from Twitter and Beyond [poster]
Apr 2015, Carnegie Mellon University, Pittsburgh, PA
Apr 2015, Columbia University, New York, NY (NLP Talk)
Feb 2015, Johns Hopkins University, Baltimore, MD (CLIP Colloquium)
-
Paraphrases in Twitter [slides]
Feb 2015, Twitter.com, San Francisco, CA
-
Modeling Lexically Divergent Paraphrases in Twitter (and
Shakespeare!) [poster]
Mar 2015, The City University of New York, New York, NY (NLP Seminar)
Feb 2015, IBM Research - Almaden, San Jose, CA
Feb 2015, UC Berkeley, Berkeley, CA
Feb 2015, UT Austin, Austin, TX (Forum for Artificial Intelligence)
Dec 2014, Yahoo! Research, New York, NY
Nov 2014, Carnegie Mellon
University, Pittsburgh, PA (CL+NLP Lunch Seminar)
Aug 2014, Microsoft Research,
Redmond, WA (Visiting Speaker Series)
- Incremental Information Extraction
Apr 2012, Stanford Research Institute, Palo Alto, CA
May 2011, IARPA's
KDD PI Meeting, San Diego, CA
- Information Extraction Research
Jan 2011, University of Washington,
Seattle, WA
- Event-based Summarization
Nov 2009, Thomson Reuters, Eagan,
MN
Mar 2007, France Telecom, Beijing,
China
Collaborators
I am a big believer of collaborations and have been happy to work
and co-author with:
Colin Cherry
(National Research Council Canada)
Martin
Chodorow (CUNY)
Bill
Dolan (Microsoft Research)
Yangfeng Ji (Gatech)
Raphael Hoffmann (U of Washington →
AI2 Incubator)
Wenjie Li (Hong Kong
Polytechnic University)
Adam Meyers
(NYU)
Courtney Napoles
(JHU)
Daniel Preoţiuc-Pietro (UPenn)
Alan
Ritter (U of Washington → Ohio State U)
Joel Tetreault (ETS
→ Yahoo! Research)
Lyle Ungar (UPenn)
Luke Zettlemoyer
(U of Washington)
Le
Zhao (CMU → Google)
and many others ...
The members of my thesis committee are:
Ernest
Davis (NYU)
Bill
Dolan (MSR)
Satoshi
Sekine (NYU/Rakuten)
Luke
Zettlemoyer (U of Washington)
Places I interned and visited when I was a phd student:
2012-2013, University of Washington,
Seattle, WA
Summer 2011, Microsoft Research, Redmond,
WA
Summer 2010, Amazon.com, Seattle, WA
Spring/Fall 2010, Educational Testing Service, Princeton, NJ
Miscellaneous
When I have spare time, I enjoy arts, traveling, snowboarding,
rock climbing, sailing and windsurfing.
I also made a list of the best
dressed NLP researchers (2015) and (2014).