Fault Tolerant Suffix Trees

2021-12-14IftikharAhmadSyedZulfiqarAliShahAmbreenShahnazSadeeqJanSalmaNoorWajeehaKhalilFazalQudusKhanandMuhammadIftikharKhan

Computers Materials&Continua 2021年1期

Iftikhar Ahmad,Syed Zulfiqar Ali Shah,Ambreen Shahnaz,Sadeeq Jan,Salma Noor,Wajeeha Khalil,Fazal Qudus Khan and Muhammad Iftikhar Khan

1Department of Computer Science and Information Technology,University of Engineering and Technology,Peshawar,25120,Pakistan

2Department of Computer Science,Shaheed Benazir Bhutto Women University,Peshawar,25000,Pakistan 3Department of Information Technology,FCIT,King Abdulaziz University,Jeddah,Saudi Arabia

4Department of Electrical Engineering,University of Engineering and Technology,Peshawar,25120,Pakistan

Abstract:Classical algorithms and data structures assume that the underlying memory is reliable,and the data remain safe during or after processing.However,the assumption is perilous as several studies have shown that large and inexpensive memories are vulnerable to bit flips.Thus,the correctness of output of a classical algorithm can be threatened by a few memory faults.Fault tolerant data structures and resilient algorithms are developed to tolerate a limited number of faults and provide a correct output based on the uncorrupted part of the data.Suffix tree is one of the important data structures that has widespread applications including substring search,super string problem and data compression.The fault tolerant version of the suffix tree presented in the literature uses complex techniques of encodable and decodable error-correcting codes,blocked data structures and fault-resistant tries.In this work,we use the natural approach of data replication to develop a fault tolerant suffix tree based on the faulty memory random access machine model.The proposed data structure stores copies of the indices to sustain memory faults injected by an adversary.We develop a resilient version of the Ukkonen’s algorithm for constructing the fault tolerant suffix tree and derive an upper bound on the number of corrupt suffixes.

Keywords:Resilient data structures;fault tolerant data structures;suffix tree

1 Introduction

Computer memory plays an important role in all Turing-based computational platforms.Modern systems are using memories at multiple levels which consequently increases the need of its correctness[1].However,memories,at any level,are fragile and can be error-prone,so some sort of healing mechanism is necessary.A fault is an underlying cause of an error,such as a stuck-at bit or high energy particle strike.Faults can be active(causing errors)or dormant(not causing errors)[1].An error is an incorrect portion of state resulting from an active fault,such as an incorrect value in memory.Faults are predicted to become more frequent in future systems that contain many times more DRAM and SRAM than found in current systems.

Memory faults can compromise the correctness of results.During the execution of an algorithm or during the lifetime of a data structure,it is assumed that the data stored,and data structures built in memory locations are correct and the processing is taking place exactly according to the directions of the algorithm.When memory faults occur,data items stored on memory locations get altered and algorithm is compelled to take wrong steps.Consequently,the results are quite different from the expected results.This can lead to financial loss or even result in more disastrous situations such as in case of avionics applications[2].The errors can be contained by using the error detection and correction circuitry.But this is an expensive solution in terms of price and performance as additional computational overhead is also added.Another solution can be the use of high performance and reliable memory like memory registers.However,it is too expensive to be used for large applications.Hence these faults need to be taken care of at application level instead of hardware level.Therefore,some recovery or tolerance mechanism is required to guard against these memory faults and ensure the reliability of processing.For this purpose,resilient algorithms or fault-tolerant data structures are designed.A resilient algorithm is the one which can compute a correct output based on uncorrupted values[3].During the last fifty years several resilient algorithms and fault-tolerant data structures have been presented by the algorithmic community[2].

Suffix tree is an important data structure used for performing substring queries on a given string.Several classical algorithms are available for creating suffix tree for a given string.These algorithms fail when a few memory faults occur.Therefore,in this work,we design a fault tolerant suffix tree based on the natural technique of data replication.We use the Faulty-Memory random access machine model introduced by Finocchi and Italiano[3]to construct a Fault-Tolerant Suffix Tree(FTST).We useas duplication function,where σ is an upper bound on memory faults.Instead of duplication actual characters of the string,only the starting index of each edge is duplicatedtimes.The proposed FTST can toleratefaults on each edge of the suffix tree.

Rest of the paper is organized as follows.Section 2 provides a brief overview of the state of the art.Formal problem statement is presented in Section 3.Section 4 presents the proposed Fault Tolerant Suffix Tree(FTST)model,and analysis of the model is performed in the same section.Section 4 concludes the work and provides directions for future research.

2 Literature Review

A lot of work has been done in the field of fault-tolerant data structures and resilient algorithms.Largescale applications process massive data which requires large amount of low-cost memory.Hence faulttolerance is an important consideration in large systems,safety critical systems and financial institutions’systems etc.Computing with unreliable information is a new and interesting area of research.Investigations and explorations have been carried out for coping with the problem of computing with unreliable information in a variety of different settings.For example,liar model[4],fault-tolerant persistent memory programming[5],fault tolerant main memory file systems[6],parallel models of computation with faulty memories[7],and others[8–11].In the following,we summarize key works in the area of fault tolerant algorithms and systems.

Zhang[5]presented Pangolin,which is a fault tolerant persistent object library for NVMM.The proposed library uses a variety of techniques such as parity,checksums and micro-buffering to protect and preserve the objects from errors introduced due to media failure or due to software bugs.Xu et al.[6]proposed a novel non-volatile main memory(NVMM)based file system called NOVA-Fortis.The proposed file system is capable of performing well even in the presence of faulty storage(errors introduced due to corruption)or software bugs.Wang et al.[11]considered the problem of massive health data generation,transmission and collection using various devices,and proposed a interest matching mechanism for efficient,fault tolerant data collection.Jia[12]criticised the stability of traditional systems in the era of big data processing.The author proposed a fault tolerant model using spark application framework for big data clustering in distributed environment.

Finocchi[3]introduced a faulty-memory random access machine model,in which the storage locations may suffer from storage faults.In this model an adversary can alter up to σ storage locations throughout the execution of an algorithm.This model is used to produce correct output based on uncorrupted values[3,8].Aumann[9]proposed fault-tolerant Stack,Linked List and General Tree.In their work they have used reconstruction technique.When the faults are detected,the data structure is reconstructed.Finocchi et al.[10]presented resilient search trees to obtain optimal time and space bounds while tolerating up tomemory faults,wherenis the size of search tree.

Weidendorfer et al.[13]proposed LAIK(Lightweight Application-Integration data distribution for parallel worKers)to support fault tolerant features in parallel programming.LAIK has access to data and has the ability to perform various operations such as freeing up the node before failure and assisting in replication for roll back schemes.The authors presented an example by integrating LAIK with other application.

Wang et al.[14]considered the high computational cost and I/O costs incurred during the archiving phase of large graph computation systems and proposed to dynamically adjust checkpoint interval.The resultant procedure not only achieves the required key property of fault tolerance but significantly reduces the high I/O costs as well.

Traditionally,wireless sensor network nodes have low energy storage capacity,resulting in the failure of routing and communication protocols.This can result in power outage in some sensor networks resulting in loss of communication among nodes.Lu[15]addressed the problem by proposing a new algorithm using the structured directional de Brujin graph to improve the fault tolerance of communication protocols in wireless sensor networks.The intuition behind the proposed algorithm is to deploy nodes with high energy to act as super nodes responsible for formation of redundant routing tables.

3 Problem Statement

We have a stringScontainingnnumber of characters including$symbol as its last character.StringSis constituted over a set of alphabets of sizec.A patternPis any other string of lengthm.Pis to be checked for substring ofS.The suffix tree data structure can be used for substring queries.The suffix tree built forScontainsnnumber of suffixes or leaf nodes andcnumber of root-originating edges.The total number of edges of suffix tree are represented bye.

We are using faulty-memory random access machine model for constructing the Fault Tolerant Suffix Tree.In this model resilient variables are used to achieve resilience capability.A resilient variable duplicates values for safety purpose.This duplication is based on a function of number of corruptions,such as 2δ or δ2,where δ is an upper bound on the number of corruptions that can be introduced by an adversary.The duplication functiondf,if not selected cautiously,can decrease the performance of resilient data structures or algorithms in terms of space and running time.Therefore,we have carefully selecteddffor ourFTSTmodel as.

Fact 1:The minimum number of edgesein a suffix tree is given bye≥n.

Fact 2:The maximum number of edgesein a suffix tree is given bye≤2*n-c.

4 Proposed Fault Tolerant Suffix Tree Algorithms

In this section,we present the proposed fault tolerant suffix tree construction and traversing algorithms.The construction algorithm(Algorithm 1)is used to construct a fault tolerant suffix tree for stringSof lengthn.Algorithm 2 and Algorithm 3 are used to find a patternPin stringS.

4.1 Construction of Fault Tolerant Suffix Tree

Algorithm 1 describes the steps required for the construction of a fault tolerant suffix tree.Algorithm 1 receives stringSas input whereScontains $ symbol as its terminating character.nrepresents the size of stringSandcthe size of its alphabets.These two valuesnandcare used for calculating the possible number of edgeseof theFTSTby using Fact 1.By using Theorem 2,maximum number of corruptions,maxare calculated beyond which all edges can be corrupted by adversary.Actual number of corruptions δ are then randomly selected between 1 andmax.The number of actual corruptions δ defines the value of our duplication functiondf.We are aware that data replication can be very costly in terms of running time and space,therefore,we have carefully selected thedfas.rrepresents the duplicate copies plus the original value which will be used to decide the size of array which stores start index of every node ofFTST.

FTSTiis the Fault Tolerant Suffix Tree ofSwhich contains characters ofSfrom 1 toithcharacter.FTST1is the Fault Tolerant Suffix Tree which contains only the first character ofS.Then by using all the five heuristics of Ukkonen’s algorithm,the wholeFTSTis constructed.Each child node stores thestartandendindex which links it to its parent node.Thisstart-endpair shows the length of the edge which connects this child to its parent.Eachstartindex is a resilient variable which stores the valuernumber of times to provide the resilience capability.The majority ofstartindex values decide its correctness or otherwise.

4.2 Searching for Pattern P in Fault Tolerant Suffix Tree

Algorithm 2 and Algorithm 3 are proposed to search for pattern P in the fault tolerant suffix tree constructed by Algorithm 1.The algorithms and the required explanation is provided in the text below.

Algorithm 2(Traverse Node)receives three parameters,a nodeNofFTST,a patternPand an integer valuexwhich is used as index of patternP.Pis another string of lengthmand is globally accessible in all algorithms.Pis to be searched withinS.Traverse Nodealgorithm traverses its edge and stores the returned value in result.A 1 shows success while-1 shows failure of search.Variablexis incremented by the length of edge.The character ofPatxis used as index for identifying among children nodes ofN.Nrecursively calls its valid child node and proceeds.

Traverse Nodealgorithm callsTraverse Edge(Algorithm 3)to explore the edge which linksNto its parent node.During this exploration,corresponding characters of the patternPare compared with characters stored on this edge.If 1 is received fromTraverse Edgealgorithm,then it shows that patternPis successfully found in stringS.If 0 is received fromTraverse Edgealgorithm,then it means that comparison is successful on this edge but is incomplete and the remaining part needs to be explored further.To explore the deeper edge,linking this childNto its sub-child,Traverse Nodealgorithm recursively calls itself.Any value other than 0 or 1 indicates that patternPis not a substring ofS.

Algorithm 3(Traverse Edge)receives four parameters,a patternPto be searched,an integerxwhich represents the index of patternP,an array of integersstart[]representing the starting index of node and integer variableendrepresenting end index of node.Integer variablekis used as index of stringSandxofP.If corresponding characters are not equal,-1 is returned indicating failure of searching.Every node ofFTSTstores a pair of indices start and end,which stores the start and end index of the edge linking this node to its parent node,respectively.In order to achieve resilience capability,theFTSTstores the start index valuertimes in an array,start[],of sizer.tstores the start index value.treceives its value by callingCheck Indexalgorithm which finds a value in majority.

After the while loop,1 is returned to indicate that comparison has remained successful till the end of patternP,hencePis a substring ofS.If comparison between the corresponding characters remained successful but the end of patternPhas not reached so far,then it means that further match must be searched in the sub-child of this node.To signal this phenomenon,0 is returned.

4.3 Analysis of the Proposed Model

5 Conclusion

In this work,we designed a Fault Tolerant Suffix Tree(FTST)data structure which is used for applying substring queries.FTST is very useful specially in situations where memory faults can occur frequently,such as large-scale data processing systems which require low price high capacity memory or systems used at high altitudes and are susceptible to cosmic radiation.We used the natural concept of data replication for the construction of FTST.We also proposed resilient algorithm for sub-string query in FTST and derived an upper bound on the number of maximum corruptions that FTST can sustain.

The proposed FTSTconstruction and sub-string query algorithms can be used in a variety of applications where fault tolerance is a key design requirement.The proposed technique can also be used to design other fault tolerant data structures such as AVL trees,Red-Black trees and Splay trees etc.

Funding Statement:The authors received no specific funding for this study.

Conflicts of Interest:The authors declare that they have no conflicts of interest to report regarding the present study.

Computers Materials&Continua

2021年1期