新闻动态

计算机系薛广涛课题组荣获USENIX FAST最佳论文奖

发布时间:2023-03-02

近日,计算机体系结构顶会USENIXFAST 2023在美国圣克拉拉举办,聚焦计算机存储领域的国际最高水准研究。其中,计算机系网络与服务计算研究所薛广涛教授李明禄教授共同指导的博士生鲁瑞明荣获本届最佳论文奖,也是国内首次获此殊荣。

获奖的论文题目为Perseus:一种适用于云存储系统的缓慢故障检测框架。欧宝app官方网站下载和阿里云共同提出实现了一种适用于大规模数据中心慢盘检测框架,利用性能监测指标进行非侵入式的、细粒度的缓慢故障检测。系统被部署在阿里云2万台机器上,准确检测到数百块慢盘,可将节点p9999长尾延迟平均降低33-64%,在保障性能稳定性的同时,极大减少了性能的抖动,为客户提供了可预期的平滑服务质量保障。因其对缓慢故障检测的杰出创新研究和巨大应用价值,该研究也被评委推荐发表于USENIX会刊《;login:.

研究背景

近年来,随着高速设备的不断发展,一种新兴的故障模型——缓慢故障(Fail-SlowFailure)逐渐引起了研究者的注意。缓慢故障设备处于全速工作状态和停止故障状态之间的一种中间状态,虽然仍在运行但性能大幅低于预期。

针对缓慢故障的识别,现有相关工作均是粗粒度的、侵入式的检测。“粗粒度”是指其以节点为单位进行检测,而无法定位具体的缓慢设备。“侵入式”是指其需要在软件栈上进行源码修改,并且要求用户使用特定的软件版本。然而,大型云服务提供商不该触碰到用户所使用设施的代码,而且也无法要求用户使用特定软件版本。因此,亟需一种细粒度的、非侵入式的、准确且广泛适用于各种云服务产品的缓慢故障检测框架

研究成果

基于以上挑战,该研究结合传统机器学习技术,提出针对存储设备的、适用于大型云存储系统的缓慢故障检测框架。如下图所示,整体框架包括异常值检测、拟合回归模型、生成慢事件、量化缓慢程度等共四个步骤,最终将存储设备的缓慢程度用一套打分规则进行量化,方便驻场工程师优先针对缓慢程度最高的设备进行下线维修和人工检查。该框架无需做任何参数和设计调整即可广泛适用于阿里云的各项业务线。目前,该研究已成功落地阿里云生产环境,并在一年多的部署里成功检测出300余块缓慢故障设备,大幅降低节点长尾延迟的同时、持续保障云服务的平稳运行


缓慢故障检测框架工作流程

USENIXConference on File and Storage Technologies (FAST)

FAST会议创办于2002年,是由美国高等计算系统协会(USENIX)和美国计算机学会操作系统专业组织(ACM SIGOPS)联合组织的聚焦存储领域的顶级国际会议,代表了计算机存储领域的国际最高水平。本届会议一共收录28篇文章,从中评选出2篇最佳论文奖。自创办二十多年以来,FAST推动了如RAID、闪存文件系统、非易失内存技术和分布式存储等多项存储相关技术的发展。

论文链接:https://www.usenix.org/conference/fast23/presentation/lu

文稿丨鲁瑞明

Computer Science Professor Guangtao Xue’s Research Group Wins USENIX FAST Best Paper Award

Recently, the top conference on computer system and architecture, USENIX FAST 2023, was held in Santa Clara, USA,focusing on international cutting-edge research in the field of storage systems.Ruiming Lu, a doctoral student from the Network Computing Center of the Department of Computer Science instructed byProfessor Guangtao XueandProfessor Minglu Li,won the Best Paper Awardat this year's conference, which is alsothe first time for researchers in China to receive this honor.

The award-winning paper is titled"Perseus: A Fail-Slow Detection Framework for Cloud Storage Systems". Shanghai Jiao Tong University and Alibaba Cloud jointly proposed and implemented Perseus, afail-slow failure detection framework applicable to cloud storage systems. Perseus utilizes performance monitoring metrics for non-intrusive and fine-grained fail-slow drive detection. The framework has been deployed on nearly 20,000 machines in Alibaba Cloud and has accurately detected hundreds of fail-slow drives. Perseus can reduce the average node-level p9999 tail latency by 33-64% while ensuring performance stability and greatly reducing performance instability, providing customers with predictable and smooth service quality assurance. Due to its outstanding innovative research and enormous application value in fail-slow detection, the paper was alsorecommended by the FAST Program Committee for publication in the USENIX magazine ";login:".

Background

Fail-slow failures, a previously unnoticed failure model, have gradually caught the attention of researchers in recent years. Unlike the traditional fail-stop failures, fail-slow failures refer to devices that are in an intermediate state between full speed and complete failure. In other words, fail-slow components are still functioning yet with lower-than-expected performance.

Existing work on fail-slow failure detection is mostly coarse-grained and intrusive. Coarse-grained detection means that they can only detect fail-slow failures at the node level, thus still requiring nontrivial manual efforts to locate the culprits. Intrusive detection means they require source code access or software modification, while large service providers like cloud vendors do not touch tenants’ code. Therefore,there is an urgent need for a fine-grained, non-intrusive, accurate, and general fail-slow failure detection framework for various cloud service products.

Achievements

To tackle the challenges mentioned above, this research combines classic machine learning techniques and proposes a fail-slow detection framework for storage devices that is adaptable to large-scale cloud storage systems. As shown in the figure below, the overall framework includes four steps: outlier detection, building regression models, formulating fail-slow events, and evaluating risk. Eventually, the slowness of drives will be quantified using a set of scoring mechanisms, making it easier for on-site engineers to prioritize devices with the highest slow degree for offline maintenance and manual inspection. This framework can bewidely applied to various cloud services of Alibaba Cloud without any parameter or design adjustments. Currently, this research has been successfully deployed in Alibaba Cloud's production environment and has detected over 300 fail-slow devices in more than a year of deployment,significantly reducing the long tail latency of nodes while ensuring the smooth operation of cloud services.

High-level workflow of the fail-slow detection framework

USENIXConference on File and Storage Technologies (FAST)

The FAST conference was founded in 2002 and is ever since a top international conference in the storage field organized jointly by the USENIX Association and the ACM SIGOPS.It represents the highest level of international achievement in computer storage.This year's conference included 28 papers, from which two best paper awards were selected. Over the past twenty years, FAST has driven the development of many storage-related technologies, such as RAID, flash file systems, non-volatile memory technologies, and distributed storage.

Paper linkhttps://www.usenix.org/conference/fast23/presentation/lu

Edited by Ruiming Lu

联系我们 webmaster@cs.sjtu.edu.cn

欧宝app官方网站下载计算机科学与工程系版权所有 @ 2013

Baidu
map