nav emailalert searchbtn searchbox tablepage yinyongbenwen piczone journalimg journalInfo journalinfonormal searchdiv searchzone qikanlogo popupnotification paper paperNew
2024, 05, v.56 71-79
A Cluster System Failure Prediction Approach Based on Deep Learning
Email: zhang_han@zzu.edu.cn;
DOI: 10.13705/j.issn.1671-6841.2023021
Published:   2023-12-16
Publication Date:   2023-12-16
Online:   2023-12-16
Mobile reading
Abstract:

In the clustered system failure prediction, the long-time series prediction was accompanied by problem such as gradient disappearance or explosion, due to the loss of key feature information, which would affect the accuracy of the model for failure prediction. For this reason, a new model of cluster system fault prediction method based on deep learning was proposed. The method adopted bidirectional gate recurrent unit(BiGRU) to capture local timing features while employing Transformer to improve the global feature extraction capability. The dynamic changes of timing features on the cluster system logs were obtained through bidirectional information transfer in the BiGRU layer to capture the potential causality and local temporal features in the cluster events.The Transformer layer was used to process the time series output from the BiGRU layer in parallel to obtain the global temporal dependence, which followed by the fully connected neural network layer to obtain the prediction results. The effectiveness of the method was validated on a public dataset constructed from real logs generated by the Blue Gene/L system. The results showed that the proposed method outperformed the comparison methods with a best-correct rate and F1 value of 91.69% and 92.74%, respectively.

References

[1] 郑维维.集群系统失效预测与资源重配置方法[D].北京:北京邮电大学,2017.ZHENG W W.Approaches for failure prediction and resource re-allocation in cluster systems[D].Beijing:Beijing University of Posts and Telecommunications,2017.

[2] 董婧.基于时空关联分析的集群系统故障预测方法[D].北京:北京邮电大学,2020.DONG J.Failure prediction method of cluster system based on spatio-temporal correlation analysis[D].Beijing:Beijing University of Posts and Telecommunications,2020.

[3] YANG Y,DONG J,FANG C,et al.FP-STE:a novel node failure prediction method based on spatio-temporal feature extraction in data centers[J].Computer modeling in engineering and sciences,2020,123(3):1015-1031.

[4] MA Y,WU S,GONG S,et al.Artificial intelligence-based cloud data center fault detection method[C]//IEEE 9th Joint International Information Technology and Artificial Intelligence Conference.Piscataway:IEEE Press,2021:762-765.

[5] BENGIO Y,SIMARD P,FRASCONI P.Learning long-term dependencies with gradient descent is difficult[J].IEEE transactions on neural networks,1994,5(2):157-166.

[6] WANG Z G,GAO L X,GU Y,et al.A fault-tolerant framework for asynchronous iterative computations in cloud environments[C]//IEEE Transactions on Parallel and Distributed Systems.Piscataway:IEEE Press,2018:1678-1692.

[7] KHAN S,NASEER M,HAYAT M,et al.Transformers in vision:a survey[J].ACM computing surveys,2022,54(10):1-41.

[8] ZHOU T,MA Z Q,WEN Q S,et al.FEDformer:frequency enhanced decomposed transformer for long-term series forecasting[C]//Proceedings of International Conference on Machine Learning.New York:ACM Press,2022:27268-27286.

[9] REN R,LI J H,YIN Y,et al.Failure prediction for large-scale clusters logs via mining frequent patterns[M]//Communications in Computer and Information Science.Berlin:Springer Press,2021:147-165.

[10] 王卫华,应时,贾向阳,等.一种基于日志聚类的多类型故障预测方法[J].计算机工程,2018,44(7):67-73.WANG W H,YING S,JIA X Y,et al.A multi-type failure prediction method based on log clustering[J].Computer engineering,2018,44(7):67-73.

[11] FU X Y,REN R,ZHAN J F,et al.LogMaster:mining event correlations in logs of large-scale cluster systems[C]//IEEE 31st Symposium on Reliable Distributed Systems.Piscataway:IEEE Press,2013:71-80.

[12] FU X Y,REN R,MCKEE S A,et al.Digging deeper into cluster system logs for failure prediction and root cause diagnosis[C]//IEEE International Conference on Cluster Computing.Piscataway:IEEE Press,2014:103-112.

[13] LIANG Y,ZHANG Y Y,XIONG H,et al.Failure prediction in IBM Blue Gene/L event logs[C]//Proceedings of the 7th IEEE International Conference on Data Mining.Piscataway:IEEE Press,2008:583-588.

[14] 王振华.基于日志分析的网络设备故障预测研究[D].重庆:重庆大学,2015.WANG Z H.Study on failure prediction for network equipment based on log analysis[D].Chongqing:Chongqing University,2015.

[15] MOHAMMED B,AWAN I,UGAIL H,et al.Failure prediction using machine learning in a virtualised HPC system and application[J].Cluster computing,2019,22(2):471-485.

[16] VASWANI A,SHAZEER N,PARMAR N,et al.Attention is all You need[C]//Proceedings of the 31st International Conference on Neural Information Processing Systems.New York:ACM Press,2017:6000-6010.

[17] LI S Y,JIN X Y,XUAN Y,et al.Enhancing the locality and breaking the memory bottleneck of transformer on time series forecasting[EB/OL].(2019-06-29)[2022-12-21].https://doi.org/10.48550/arXiv.1907.00235.

[18] WU H X,XU J H,WANG J M,et al.Autoformer:decomposition transformers with auto-correlation for long-term series forecasting[EB/OL].(2022-01-01)[2022-12-21].https://doi.org/10.48550/arXiv.2106.13008.

[19] CUTURI M,BLONDEL M.Soft-DTW:a differentiable loss function for time-series[C]//Proceedings of the 34th International Conference on Machine Learning.New York:ACM Press,2017:894-903.

[20] Ultrascale Systems Research Center.CFDR data[EB/OL].(2022-02-01) [2022-11-21].https://www.usenix.org/cfdr-data.

[21] BOTEZATU M M,GIURGIU I,BOGOJESKA J,et al.Predicting disk replacement towards reliable data centers[C]//Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining.New York:ACM Press,2016:39-48.

[22] XU C,WANG G,LIU X G,et al.Health status assessment and failure prediction for hard drives with recurrent neural networks[J].IEEE transactions on computers,2016,65(11):3502-3508.

[23] 王鑫,吴际,刘超,等.基于LSTM循环神经网络的故障时间序列预测[J].北京航空航天大学学报,2018,44(4):772-784.WANG X,WU J,LIU C,et al.Exploring LSTM based recurrent neural network for failure time series prediction[J].Journal of Beijing university of aeronautics and astronautics,2018,44(4):772-784.

[24] HAI Q D,ZHANG S W,LIU C,et al.Hard disk drive failure prediction based on GRU neural network[C]//IEEE/CIC International Conference on Communications in China.Piscataway:IEEE Press,2022:696-701.

[25] ZHOU H Y,ZHANG S H,PENG J Q,et al.Informer:beyond efficient transformer for long sequence time-series forecasting[C]//Proceedings of the AAAI Conference on Artificial Intelligence.Palo Alto:AAAI Press,2021:11106-11115.

Basic Information:

DOI:10.13705/j.issn.1671-6841.2023021

China Classification Code:TP277;TP18

Citation Information:

[1]JI Lixia,ZHANG Qingkai,ZHOU Hongxin ,et al.A Cluster System Failure Prediction Approach Based on Deep Learning[J].Journal of Zhengzhou University(Natural Science Edition),2024,56(05):71-79.DOI:10.13705/j.issn.1671-6841.2023021.

Fund Information:

国家自然科学基金项目(52179144); 河南省重大科技专项(201300210500); 郑州市重大科技创新专项(2020CXZX0053)

Published:  

2023-12-16

Publication Date:  

2023-12-16

Online:  

2023-12-16

quote

GB/T 7714-2015
MLA
APA
Search Advanced Search