My research interests are in distributed systems, in particular fault
tolerance, scalability, and performance optimization.
My CV can be found here.
this work builds a tool to identify waiting events that are limiting the maximal throughput
of a multi-threaded application. To achieve this goal, wPerf first computes how a waiting event
can affect threads directly waiting for this event; then wPerf builds a wait-for graph to compute whether
such impact can indirectly reach other threads. By combining these two techniques, wPerf essentially
tries to identify events with large impacts on all threads.
SafeTimer [paper, source code]:
this work enhances existing timeout detection protocols to tolerate long delays
in the OS and the application. At the heartbeat receiver, SafeTimer checks whether there
are any pending heartbeats before reporting a failure; at the heartbeat sender, SafeTimer
blocks the sender if it cannot send out heartbeats in time. We have proved that SafeTimer
can prevent false failure report despite arbitrary delays in the OS and the application.
This property allows existing protocols to relax their timing assumptions and use a shorter
timeout interval for faster failure detection.
ThriftyPaxos [paper, source code]:
standard Paxos needs 2f+1 replicas to tolerate f+1 failures. To reduce cost, ThriftyPaxos activates f+1
replicas first and activates backup ones when active ones fail. To ensure system availability
when copying data to the newly activated replica, ThriftyPaxos logically separates agreement
and execution and exploits a unique property from each: agreement only needs to decide what
is the next request, which allows
a blank agreement node to join the protocol instantly; execution requires only 1 node to reply
when processing a client's request, but requires f+1 nodes to reply when take a snapshot, which
means when we have fewer than f+1 replicas, we can still process clients' requests and only need
to delay the time-insensitive snapshot task.
"wPerf: Generic Off-CPU Analysis to Identify Critical Waiting Events", Fang Zhou, Yifan Gan, Sixiang Ma, and Yang Wang. To appear in OSDI 2018.
"Evaluating Scalability Bottlenecks by Workload Extrapolation", Rong Shi, Yifan Gan, and Yang Wang. To appear in MASCOTS 2018.