My research interests are in distributed systems, in particular fault
tolerance, scalability, and performance optimization.
My CV can be found here.
wPerf [paper, source code]:
this work builds a tool to identify waiting events that are limiting the maximal throughput
of a multi-threaded application. To achieve this goal, wPerf first computes how a waiting event
can affect threads directly waiting for this event; then wPerf builds a wait-for graph to compute whether
such impact can indirectly reach other threads. By combining these two techniques, wPerf essentially
tries to identify events with large impacts on all threads.
SafeTimer [paper, source code]:
this work enhances existing timeout detection protocols to tolerate long delays
in the OS and the application. At the heartbeat receiver, SafeTimer checks whether there
are any pending heartbeats before reporting a failure; at the heartbeat sender, SafeTimer
blocks the sender if it cannot send out heartbeats in time. We have proved that SafeTimer
can prevent false failure report despite arbitrary delays in the OS and the application.
This property allows existing protocols to relax their timing assumptions and use a shorter
timeout interval for faster failure detection.
Hadoop metadata benchmark [paper, source code]:
this work builds benchmarks to test HDFS NameNode and Yarn Resource Manager by running real experiments in a small testbed,
collecting the traces, and extrapolating such traces to a larger scale.
ThriftyPaxos [paper, source code]:
standard Paxos needs 2f+1 replicas to tolerate f+1 failures. To reduce cost, ThriftyPaxos activates f+1
replicas first and activates backup ones when active ones fail. To ensure system availability
when copying data to the newly activated replica, ThriftyPaxos logically separates agreement
and execution and exploits a unique property from each: agreement only needs to decide what
is the next request, which allows
a blank agreement node to join the protocol instantly; execution requires only 1 node to reply
when processing a client's request, but requires f+1 nodes to reply when take a snapshot, which
means when we have fewer than f+1 replicas, we can still process clients' requests and only need
to delay the time-insensitive snapshot task.
[ASPLOS '20] FlatStore: an Efficient Log-Structured Key-Value Storage Engine for Persistent Memory.
Youmin Chen, Youyou Lu, Fan Yang, Qing Wang, Yang Wang, Jiwu Shu.
Accepted by the Twenty-Fifth International Conference on Architectural Support for Programming Languages and Operating Systems.
[OSDI '14]Salt:Combining ACID and BASE in a Distributed Database.
Chao Xie, Chunzhi Su, Manos Kapritsos, Yang Wang, Navid Yaghmazadeh, Lorenzo Alvisi, and Prince Mahajan.
In the 11th USENIX Symposium on Operating Systems Design and Implementation, Broomfield, CO, Oct 2014.
[NSDI '13]Robustness in the Salus scalable block store.
Yang Wang, Manos Kapritsos, Zuocheng Ren, Prince Mahajan, Jeevetha Kirubanandam, Lorenzo Alvisi, and Mike Dahlin.
In the 10th USENIX Symposium on Networked System Design and Implementation, Lombard, Il, Apr 2013.
[SOSP '09]UpRight Cluster Services.
Allen Clement, Manos Kapritsos, Sangmin Lee, Yang Wang, Lorenzo Alvisi, Mike Dahlin, and Taylor Riche.
In 22nd ACM Symposium on Operating Systems Principles, Big Sky, MT, Oct 2009.
2019: USENIX ATC ERC, SOSP Student Grant Committee