"Mixer: efficiently understanding and retrieving visual content at web-scale"  

An Qin, Mengbai Xiao, Yongwei Wu, Xinjie Huang, and Xiaodong Zhang 

Proceedings of the VLDB endowment, Vol. 14, No. 12, pp. 2906-2917, August 2021.  


Visual contents, including images and videos, are dominant on the Internet today. The conventional search engine is mainly designed for textual documents, which must be extended to process and manage increasingly high volumes of visual data objects. In this paper, we present Mixer, an efective system to identify and analyze visual contents and to extract their features for data retrievals, aiming at addressing two critical issues: (1) efciently and timely understanding visual contents, (2) retrieving them at high precision and recall rates without impairing the performance. In Mixer, the visual objects are categorized into diferent classes, each of which has representative visual features. Subsystems for model production and model execution are developed. Two retrieval layers are designed and implemented for images and videos, respectively. In this way, we are able to perform aggregation retrievals of the two types in efcient ways. The experiments with Baidu’s production workloads and systems show that Mixer halves the model production time and raises the feature production throughput by 9.14x. Mixer also achieves the precision and recall of video retrievals at 95% and 97%, respectively. Mixer has been in its daily operations, which makes the search engine highly scalable for visual contents at a low cost. Having observed productivity improvement of upper-level applications in the search engine, we believe our system framework would generally beneft other data processing applications.