Efficient and Effective 4D Trajectory Data Cleansing

2020-06-01，，，*

Transactions of Nanjing University of Aeronautics and Astronautics 2020年2期

，，，*

1.National Key Laboratory of CNS/ATM，Beihang University，Beijing 100191，P.R.China；

2.School of Electronic and Information Engineering，Beihang University，Beijing 100191，P.R.China

Abstract: As the rapid development of aviation industry and newly emerging crowd-sourcing projects such as Flightradar24 and FlightAware，large amount of air traffic data，particularly four-dimension（4D）trajectory data，have become available for the public.In order to guarantee the accuracy and reliability of results，data cleansing is the first step in analyzing 4D trajectory data，including error identification and mitigation.Data cleansing techniques for the 4D trajectory data are investigated.Back propagation（BP）neural network algorithm is applied to repair errors.Newton interpolation method is used to obtain even-spaced trajectory samples over a uniform distribution of each flight's 4D trajectory data.Furthermore，a new method is proposed to compress data while maintaining the intrinsic characteristics of the trajectories.Density-based spatial clustering of applications with noise（DBSCAN）is applied to identify remaining outliers of sample points.Experiments are performed on a data set of one-day 4D trajectory data over Europe.The results show that the proposed method can achieve more efficient and effective results than the existing approaches.The work contributes to the first step of data preprocessing and lays foundation for further downstream 4D trajectory analysis.

Key words:4D trajectories；data cleansing；outlier detection；repair

0 Introduction

The aviation industry has been developed rapidly in recent years，and the air transportation system is facing with huge challenges for both management and technologies.The emergence and dissemination of data science provide powerful tools to analyse and manage the air transportation system.Air traffic data，especially four-dimension（4D）trajectory data，are used for various analysis tasks，such as aircraft conflict detection and resolution［1］，airspace congestion management［2］，and air navigation route optimization［3］.The 4D trajectories refer to a series of points composed of the coordinates of longitude，latitude，altitude and the corresponding timestamp of aircraft in the air.With the emergence of the 4D trajectory data，airborne calculations can be used to predict the aircraft sequence at the intersection of the busy or congested airspace.It can facilitate the decision-making process of the air traffic controllers.

As the driving force of scientific and technological innovation，“data”account for a rising proportion of assets，and increasingly become another major factor of production after“land”and“capital”.Data cleansing is the prerequisite for the 4D trajectory data analysis.For obtaining more accurate and reliable results，outliers and unreliable data must be cleansed beforehand.Otherwise，it will prolong the data processing time and efforts，and mislead the final result.In the field of air transportation，research focusing on data cleansing is still rare［4-6］，and the percentage of abnormal data is not high.However，they are widely distributed［7］，almost throughout every trajectory，thereby affecting the reliability of results negatively. The common abnormal data includes data timestamp error［4，7］and missing data，such as longitude and latitude，identification information and duplicated data［4-5，7］.In addition to these anomalies，there are also some logical errors that are hard to be identified intuitively，such as data jumping［5，7-8］，which indicates too large distance，too long time gap，or too fast speed，etc.Therefore，there is an urgent requirement for 4D trajectory data cleansing method.

Research on data cleansing mainly focuses on improvement and management of the data quality.Although there are significant achievements，the data quality control and data cleansing methods need to be improved continually［9］.Basic principles of data cleansing techniques include identifying determinants which affect the data quality，define the cleansing requirements，and establish the cleansing model［10-13］.

Data cleansing is the very first step in the data preprocessing for trajectory data analysis and applications.However，most papers did not provide sufficient justifications for the data cleansing or data preprocessing before trajectory analysis［5，8］.To check the integrity of ADS-B data，Andrisani et al.［5］applied a suite of Kalman filters to smooth noise，identified and suppressed erroneous data，coasted between data dropouts，and provided the current best state estimates.Experiments were performed on the simulated ADS-B data signals and demonstrated that the approach was promising to data integrity check.The 4D trajectory data cleansing was applied by Patroumpas et al.［8］，who developed one-pass heuristics to eliminate inherent noise.They provided reliable trajectory representations and presented various bounds for trajectory error detection.Since it handled trajectories online and discarded errors directly，it was inapplicable with data offline.To manage ADS-B data collection efficiently and integrate with other flight related data，Garcia et al.［14］devised AIRPORTS DL with Data Lake architecture to reconstruct gate-to-gate trajectories and to derive parameters，such as the predictability or the fuel consumption.It discarded useless messages or aligned field values to satisfy the AIRPORTS data model，and determined trajectories when different flights use the same call sign as well.However，it required extra information，such as the history trajectory data to detect these data.The traffic library for the Python programming language，introduced by Olive et al.［15］，presented how to access different sources of data，leverage processing methods to clean，filter，clip or resample trajectories，and compare trajectory clustering methods on a sample dataset of trajectories above Switzerland.The paper handled missing data，slicing，querying or resampling with Pandas library，but it is not usually sufficient for a high requirement of data cleansing.

Back propagation（BP）neural network algorithm is widely used in various fields，such as aviation industry［16］，manufacturing industry［17］，engineering［18］，medical industry［19］，geology［20］etc.Lin et al.［21］established a sensor error correction model which combined particle swarm optimization（PSO）with the BP neural network algorithm to reduce nonlinear characteristics and improve test accuracy of the system. The BP neural network has three or more than three layers，including input layer，hidden layer，and output layer.The upper and lower layers are connected completely.There is no connection between each neuron in the same layer.To construct a BP neuron network，first random parameters of each layer are set；and the output based oninput data and the initial parameters are calculated.Subsequently，following the direction of reducing the loss between the output and actual target，the connection from the output layer weights to the middle layer-by-layer is amended.Finally，the procedure returns to the input layer［22］.

In this paper，a rich set of data cleansing methods are presented to deal with 4D trajectory data.The BP neural network is proposed to repair the errors detected by average speed between two adjacent points in the trajectories.Newton interpolation is used to filter out inconsistent data and to fix the frequency of the points in 4D trajectories.To reduce experimental complexity of further analysis while to maintain the characteristics of the trajectories，the unit cube sampling method is used to cut down data size.In addition，the density-based spatial clustering of applications with noise（DBSCAN）is used to identify outliers of trajectories.Experiments on oneday 4D trajectory dataset in European area are conducted.Results show that the proposed techniques can better cleanse the 4D trajectory data.The goal of this study is to clean data for the preparation of further data analysis.A framework for cleansing 4D trajectories is presented in the following section.

1 Methodology

As the recorded 4D trajectory data can be erroneous，the common abnormal data includes data timestamp error，missing data，duplicated data，and off-couse data［4-5，7-8］.Data cleansing is the very first step before further data analysis.Typical statistical outlier detection techniques are first used to detect the errors. Here，it is focused on the flight speed element.The data with high speed beyond the maximum flight speed will be fixed by applying BP neural network，or directly discarded if data are unnecessary.The 4D trajectory data of one flight may be recorded in 8 min，12 min，or even longer.The sampling frequency is not fixed，in other words，sampling data is missing to some extent.In order to repair the inconsistent data，Newton interpolation is applied according to fixed time or fixed distance.Moreover，similar records are discarded，which are unimportant and occupy storage based on unit cube method.In addition to the aforementioned methods，clustering method to detect outliers of trajectories is also used.

1.1 4D trajectory model

The 4D trajectory data elements obtained in this paper include the information on aircraft number，time，latitude，longitude，altitude，and speed.A 4D trajectory is a sequence of 4D points［23-24］.There are three dimensions for space information and one dimension for time.The 4D pointPis

whereLis the latitude of the pointP，Hthe longitude of pointP，Athe altitude of pointPandTthe time at pointP.Then，a 4D trajectoryTris defined by

A 4D trajectory datasetDis a collection of 4D trajectories，which is defined as follows

1.2 BP neural network

The BP neural network is introduced to repair errors in this section.Firstly，the average speed between two adjacent pointsP1andP2is applied to detect the errors with the following formulations

whereR=6 371.0 km.L p1，L p2are the latitudes ofP1andP2；H p1，H p2the longitudes ofP1andP2；A1，A2the altitudes ofP1andP2；C12the radian betweenP1andP2；d12the great circle distance betweenP1andP2；D12the actual distance betweenP1andP2；andv12the average speed between pointsP1andP2.

Since the speed of the current commercial plane is less than the speed of sound，ifv12is larger than 1 200 km/h，P2is taken as an error point.Then，we apply BP neural network to repairP2based on the correct points.Fig.1 is the structure of BP neural network.

Fig.1 Neural network structure

The BP neural network includes input layer，hidden layer，and output layer.Given a network，there areNnodes andLlayers.The activation function defined as the sigmoid function is

The error of mean square function is used to describe the loss between the true value and the output value

wherexis the input value，ythe true value，andythe output value.

The input value of theith neuron at thelth layer isand the output value of theith neuron at thelth layer is，the forward propagation is

whereis the layer connection value from theith neuron of thelth to thejth neuron of the（l+1）th layer，the offset of theith neuron of thelth layer，andf(⋅)the activate function.

To update the weight and offset，the gradient descent functions of the error function are defined as

The parameterRLearnis updated by a decay function.

wheredis the decay factor to update the learning rate.Noted that if the learning rate is too large，the model would be over-training.Otherwise，if it is too small，the optimization speed would be too slow.

The BP neural network with two hidden layers can implement any nonlinear mapping without limiting the number of hidden nodes.Therefore，fourlayer neural network with two hidden layers is chosen in the proposed model in this paper.

The process of 4D trajectory data training based on the BP neural network is summarized as follows:

Step 1Take one flight of 4D trajectory data，and detect and discard errors.Set the remained data as training dataset.Set nodes at each layer of the BP neural network.

Step 2Set the activation function as sigmoid function.Define the value function MSE（y，y-）representing the square sum of the output error.

Step 3Set the learning rate.Lay down a feedback regulation.

Step 4Choose time as input parameter.Set longitude，latitude and altitude as output separately.Feed data to train.

Step 5Input errors of time detected in Step 1 to the trained neural network model.Collect the output which is the repaired data.

Step 6Select the next BP neural network data，back to Step 1.Stop until all the trajectories are repaired.

1.3 Newton Interpolation with sliding window

Newton interpolation is one of the most popular interpolation methods.Data based on Newton interpolation with sliding window can reach high level accuracy，and computational efficiency can be gained as well［25］.

Assume thatx0，x1，x2，x3are a set of independent variables that are not equal to each other，andf（x）is a dependent variable.In this paper，the 4thorder Newton interpolation is applied to fix the frequency of points in trajectories.

wheref[x i，xj]，f[x i，xj，xk] andf[x i，xj，xk，xl]are the 2nd-order difference quotient，the 3rd-order difference quotient，and the 4th-order difference quotient，respectively. By following Eqs.（18）—（21），the 4th-order Newton interpolationN3（x）is obtained.

1.4 Unit cube method

Some of the 4D trajectory data is redundant.In order to save storage and not to destroy the data integrity，the process of sampling based on distance is applied.As shown in Fig.2，a 4D trajectory includes a series of points in 3D spaces throughout the flight.There are some points that are redundant and the characteristics of the trajectory are not changed if they are removed.Therefore，the 3D space is cut with cubes in the same volume.The points are all put into their corresponding cubes.Then，select one point in each cube to represent points in it.

In Fig.2，blue and red points are the whole data.The first cube is set at the first point，and then the subsequent cubes are set one by one according to the trend of the data.Finally，all the points are filled into the respective cubes.A data point from each cube is selected，for example，the red are the points selected.The final data are the remained red points.

Fig.2 One demo of unit cube method

In this paper，we set the latitude to 0.5°，the longitude to 0.5°，and the altitude to 152.4 m as the cubic size.The size of the cube is set according to the requirement of data accuracy.In other words，if more precise data is required，the unit cube is supposed to be set smaller，while the size of the remained data would be larger.

1.5 Trajectory clustering

To filter noises of the trajectories or the outliers of trajectories，we apply the clustering method DBSCAN［26］，which views clusters as points in the high density areas.In detail，DBSCAN clusters the points together（inεNeighborhood with the at mostMinPtssamples）.There are two parameterspesandMinPtsfor the clusters.Theε-neighborhood contains the points with the distance less thanεfor a given pointxiin the datasetD，i.e.，N ε(xj)={xi∈D|dis(x，xj)≤ε}，pesrepresents the maximum distance between two samples for them which can be considered as in the same neighborhood.If the number in the neighborhoodN ε(xj)is more thanMinPts，then a new cluster is generated.With the iteration of DBSCAN，the points are added to the clusters until all points in the dataset are labelled.Only spatial information of the trajectories is used in this section.When the trajectory is separated from other combined trajectories or consists of several sporadic points，it will be identified as noises.

The width of a victor way is 8 nm（14.8 km）according to provisions of the International Civil Aviation Organization（ICAO），and longitude and latitude are used to cluster and 0.1°is about 11 km.In our experiments，the parameterpesis set to 0.1 and the parameterMinPtsto 20.

2 Results

2.1 Datasets and experimental setup

Despite their high value in aircraft surveillance，positional data streams are not error-free，particularly ADS-B messages relayed from aircraft.Spurious coordinates indicate impossible positions across a flight.Satellite transmission problems may lead to delayed or missing messages.There may be also glitches in altitude values［8］.Moreover，when aircraft is above the sea area data is lacking.As we experimentally verified errors may concern up to 0.9%，so data cleansing is a necessary step before any further processing of aircraft trajectories.

To evaluate the performance of 4D trajectory data cleansing techniques，a 4D trajectory dataset in European area is used as case studies.The dataset includes one-day 4D trajectories on 1 January，2018.There are 5 905 137 records of 18 286 aircraft in total.We extract 4D trajectories of the aircraft 471F86 for the experimental purpose.The 4D trajectories of the flight DLH3EJ in the whole month of January，2018 are also included to verify the performance of the clustering algorithm.

The experiments are performed on a laptop equipped with four-core i7-6300U 2.50 GHz，and 16 GB DRAM.The methods are all coded with Python 3.6.7.

2.2 Selection of parameters for BP neural network

In this section，a set of experiments are conducted to select proper parameters to build the BP neural network and the proposed training model.The trajectory data of aircraft 471F86 from Wroclaw（WRO）and Dortmund（DTM）are used.There are 575 records，and 16 errors are detected based on calculating the average speed between two adjacent records.BP neural network is applied with the data after cutting down the errors for model training.

Firstly，the performance of the BP network method is compared with different parameters of learning raterland decay raterd.Fig.3 shows the performance of the BP neural network with different values of learning rate and decay rate.Fig.3（a）is the result with the learning rate set to 0.2 and the decay rate set to 0.25，which is under-trained.Fig.3（b）is the results with the learning rate set to 0.9 and the decay rate set to 0.025，which is overtrained.Fig.3（c）is the results with the learning rate set to 0.5 and the decay rate set to 0.005.In Fig.3，if the learning rate is set too large and decay is small，the model would be over-trained. Otherwise，the learning rate is too small and decay is large，the final model would be under-trained and the training process would be slow.Therefore，we set the learning rate to 0.5 and the decay rate to 0.005 in the final training.

Fig.3 Performance of BP neural network with different learning rates and decay rates

The number of nodes at each hidden layer influences the training quality and speed.Eight scenarios are tested with different combinations（nHL1，nHL2）of the number of nodes at each hidden layer，which are listed in Table 1，wherenHL1is the number of nodes in the first hidden layer，andnHL2the number of nodes in the second hidden layer.In each experiment，the learning rate is set to 0.5 and the decay rate to 0.005.Moreover，the training accuracy is set to 0.000 1 and the training process will stop if the results reach the training accuracy or the training steps are up to 3 000.The training time is also reported in Table 1.Fig.4 shows the performance of the eight scenarios.In Table 1，settingnHL1to 10 andnHL2to 5 cost the shortest time to train the model.However，the performance of the combination（10，5）shown in Fig.4（b）is not as good as that of the combination（5，2）shown in Fig.4（g）.Therefore，in our final experiments，we set five nodes at the first hidden layer and two nodes at the second hidden layer.

Table 1 Performance of hidden layers 1,2 with different combinations

2.3 Results on real 4D trajectories

The proposed BP neural network model，Newton interpolation method and unit cube method are applied on real 4D trajectories.Fig.5 shows the results of the trajectories of the aircraft 471F86.The 471F86 aircraft finished eight flights between Wroclaw（WRO）and Dortmund（DTM），Wroclaw

2.4 Results of comparison

Some data cleansing methods are illustrated and compared with Kalman filter.As the existing papers presented，most of data cleansing methods dealing with the trajectory data just drop the error when it is detected.This kind of process is the simplest but cannot assure the integrity of the original trajectory data. In addition，the Kalman filtering method is used in trajectory data cleansing.This kind of method can smooth glitches but some obvious glitches do still exist，which means that it is not（WRO）and Eindhoven（EIN），Wroclaw（WRO）and Luton（LTN），Wroclaw and（WRO）Birmingham（BHX）on January 1st，2018.There are 4 161 records in total.

Fig.5（a）shows the raw 4D trajectory data.We can see that some obvious wrong points，which drop or rise sharply as well as are far away from the major trajectory.Fig.5（b）shows the errors.The red points are the errors detected by calculating the average speed between two records.There are 50 red points which means that we detect 50 errors.Fig.5（c）shows the result after being repaired by BP neural network model.The input layer is time and the output layer is set as longitude，latitude，and altitude separately to repair errors.Fig.5（d）shows the results of interpolation data.The red points are the added points.There are 4 161 records before applying Newton interpolation method and 5 149 records after applying the method.In Fig.5（e），the green points are the final reduced points by the unit cube method.Only 900 points are selected，which shows that the unit cube method cuts down the data size vastly but keeps the trajectories.Fig.5（f）shows the trajectories after reducing records which is the finnal results of 4D trajectories of aircraft 471F86 after applying the set of data cleansing method.In Fig.5（f），the unit cube method can eliminate the glitches which cannot be detected by average speed.Comparing Fig.5（f）with the row trajectory data shown in Fig.5（a），the off-course data and glitchs are well processed，which shows the effectiveness on our data cleansing method.effective enough.Although，the proposed method is relatively complex to some extent when comparing with the Kalman filtering method，the results show that our method can remove the errors utmostly，which demonstrates its effectiveness.

Here，the Kalman filter is implemented to compare with the proposed method.Fig.6 shows the results after applying Kalman filtering.In Fig.6，some errors are elimilated but some obvious glitchs still exist.Compared Fig.6 with Fig.5（f），the proposed method is more effective than the Kalman filter.

Fig.4 Performance of the number of nodes at hidden layers with different combinations

2.5 Clustering result analyses

The performance of DBSCAN is reported to identify the outliers of trajectories.Fig.7 shows the results of clustering the trajectories of OD pair London（LHR）-Brussels（BRU）and OD pair London（LHR）-Dusseldorf（DUS）in one day.Two additional flights of Zurich（ZRH）-Brussels（BRU）and Brussels（BRU）-Copenhagen（CPH）are also included in the test data.In Fig.7，except trajectories from LHR to BRU and LHR to DUS are clustered together，which are shown in blue.The trajectories of ZRH-BRU and BRU-CPH and noise points are clustered into other categories and displayed in different colors.

Fig.5 Data cleansing techniques on 4D trajectories of aircraft 471F86

Fig.6 Kalman filter on 4D trajectories of aircraft 471F86

Fig.7 Clusters of trajectories of LHR-BRU and LHRDUS

The DBSCAN algorithm is also applied to identify the trajectory noises of DLH3EJ flight，which flies from Oslo Gardermoen（OSL）to Frankfurt Int'l（FRA）in a month.Fig.8 shows the results of DBSCAN based on flight DLH3EJ.The red points are identified as noises.In Fig.8，one can obtain the similar conclusions in Fig.7.Therefore，the DBSCAN algorithm can be used to detect outliers of trajectories.

Fig.8 Clusters of trajectories of flight DLH3EJ

2.6 Data cleansing methods on one-day trajectories

The data cleansing methods are applied to oneday 4D trajectories in the European area.There are 5 905 137 records of raw data in total.Since the data size is large，the three-dimension visualization is not a good option.The data are shown in two-dimensions with longitude and latitude.

Fig.9 shows the raw one-day 4D trajectory data.When aircraft fly above the area of sea，most of the data are lacked.If the records of one flight are less than 20，the flight is dropped out directly.

Fig.9 One-day records in Europe

Fig.10 shows one-day trajectories after applying the data cleansing methods.Fig.10（a）shows the data after repairing the errors of each flight based on BP neural network.Fig.10（b）shows the results of filling data based on Newton interpolation.Fig.10（c）shows the results of reducing data based on unit cube.In Fig.10（c），the off-course points are well trimmed and the points above the sea area are filled，which shows the effectiveness of the data cleansing method proposed in this paper.

There are 5 905 137 records of raw 4D trajectory data in total.After dropping out the duplicates，5 012 518 points are left.There are 4 989 510 points after rounding the unreliable data detected by the average speed.In our experiment，6 286 errors are detected and repaired based on the BP neural network method.There are 7 894 063 records existing after applying Newton interpolation and 1 699 110 left after cutting down data based on unit cube.

Fig.10 Data cleansing techniques on one-day trajectories

3 Conclusions

A rich set of data cleansing techniques are presented for the 4D trajectory data.The errors are detected by the average speed，and the BP neural network is applied to deal with the errors.By using the Newton interpolation method，the frequencies of the points in the 4D trajectory data are fixed.To reduce the computational complexity while maintaining data characteristics，the unit cube sampling method is used to cut down the size of the 4D trajectory data significantly.Compared with the Kalman filtering method，the proposed data cleansing method can obtain a better result.The DBSCAN method is applied to identify outliers of trajectories.Experimental results verify the efficiency of the proposed data cleansing techniques.

Data cleansing，especially for the 4D trajectory data，is a new issue.The proposed data cleansing methods will be tested for the data with a longer period，and other methods for addressing error identification problems are also to be investigated in further research.

Transactions of Nanjing University of Aeronautics and Astronautics

2020年2期