APP下载

Multi-Object Tracking Based on Segmentation and Collision Avoidance

2018-06-15MengZhaoJunhuiWangMaoyongCaoPeiruiBaiHongyanGuandMingtaoPeiCollegeofElectricalEngineeringandAutomationShandongUniversityofScienceandTechnologyQingdao66590ChinaCollegeofElectronicCommunicationandPhysicsShandongUniversity

Meng Zhao, Junhui Wang Maoyong Cao Peirui Bai , Hongyan Gu and Mingtao Pei(.College of Electrical Engineering and Automation, Shandong University of Science and Technology, Qingdao 66590, China; .College of Electronic, Communication and Physics, Shandong University of Science and Technology, Qingdao 66590, China; .Beijing Lab of Intelligent Information Technology, Beijing Institute of Technology, Beijing 0008 China)

Multi-object tracking (MOT) is essential for many applications including video surveillance, action recognition and human-computer interaction. Although tracking-by-detection methods have been proven to be promising, there are still many shortcomings. One of the most obvious shortcomings is that most information available in image sequences is simply ignored. Unlike most previous approaches which only depend on detector responses, Milan et al.[1]further consider low-level superpixel information to handle tracking problems in crowded scenarios, which can recover trajectories of largely occluded targets since the partial image evidence often persists even where there is a lack of detections. They formulate the tracking problem as an energy minimization problem. The energy functions include two types of unary potentials based on detection and superpixel nodes respectively, as well as a set of pairwise potentials both for spatial-temporal superpixels and high-order cliques formed by each superpixel with every detection node.

However, the relations between objects are not employed. It is known that two different objects cannot occupy the same 3D space simultaneously. So we construct the pairwise potentials to enforce collision avoidance between objects in crowded scenes. Furthermore, in Ref.[1], a linear SVM is trained online for each sequence to separate foreground from background, and positive and negative samples of superpixels are collected by clustering using only LAB color features. The position feature is not used. Since the closer a superpixel to the object, the more likely it belongs to the object. We propose to cluster superpixels in LAB color space andx,yimage plane space, which can get better clustering results.

In summary, our main contributions are: collision avoidance is incorporated between objects as pairwise potentials, which can improve performance on multi-object tracking; superpixels are clustered based on color similarity and position proximity, which can improve the clustering and tracking performance.

1 Related Work

Multi-object tracking is an important task in the field of computer vision. A wide range of solutions have been proposed to deal with the problem in recent years. Thus, we categorize and present detailed discussions of some closely related work in the section.

Tracking-by-detection methods usually consist of two steps: object detection and data association. A pre-trained object detector is applied to locate the positions of objects in each frame, then data association techniques are employed to connect them into consistent trajectories. Most of the recent approaches have formulated the data association as a network flow problem. Pirsiavash et al.[2]obtained a global solution with a greedy shortest path procedure on a flow network based on dynamic programming algorithms. Zhang et al.[3]found a global optimal solution by using push-relabel algorithm, and used the same graph as Ref.[2]. Berclaz et al.[4]took advantage of linear programming framework to solve the flow problem effectively using the k-shortest paths algorithm. A Butt et al.[5]formed a new min-cost flow problem that incorporates a constant velocity motion model. Wang et al.[6]trained a structure SVM to combine a set of features for multi-target tracking. Bewley et al.[7]proposed to explore temporal reinforcement within association chains to learn the affinity between multiple targets. Loïc et al.[8]proposed to use sparse representations to improve multi-frame data association for robust near-online multi-object tracking. Yu et al.[9]formulated the online multi-object tracking problem as decision making in Markov decision processes (MDPs) to realize robust tracking. Jaeyong et al.[10]proposed a two-stage data association method to handle the drifting targets stemming from occlusions and people’s abrupt motion changes. Yang et al.[11]formulated the data association as a min-cost multi-commodity flow problem to perform multi-person tracking. Unlike most previous work where detection and association have been considered separately, a new target identity-aware network[12]was proposed to perform the detection and data-association simultaneously, and the solution is optimized efficiently through Lagrangian relaxation.

Although tracking-by-detection approaches have become increasingly popular, it is highly depending on the performance of object detector. To improve tracking performance, different approaches have focused on designing better data association techniques[13-14], or improving the performance of the detector[1,15-16].

Video segmentation[17-18]is the task of assigning the same label to all pixels that belong to the same semantic object throughout the entire video sequence, which is an important part of general scene understanding. T Brox and J Malik[17]exploited long term motion information to assign labels to objects for object-level segmentation, which can naturally handle occlusions. A reduced graph was used in Ref.[18] based on superpixels for image and video segmentation, which reduces runtime and memory consumption significantly. Recently, segmentation-based methods have also been applied to track multiple targets. Bibby and Reid[19]derived a probabilistic framework for visual tracking of multiple targets in real time. Level sets[20]have become popular to handle many segmentation and tracking tasks due to their edibility and computational efficiency for a better distinction between background and foreground. A similar idea, specifically for pedestrian tracking, was proposed in Ref.[21], which used a contour tracker between sparse HOG detections to bridge the gaps. However, the contour representation is prone to fail when a target becomes occluded. Milan et al.[1]proposed a multi-label conditional random field (CRF) to handle tracking in crowed scenarios, exploiting superpixel information and detector responses jointly.

Fig.1 CRF model with collision avoidance between detections for two consecutive frames (the edges εDS are the highlight of our method)

2 Problem Formulation

2.1 Motivation

In this paper, we follow the tracking framework as Ref.[1], which formulates the multi-target tracking problem as a multi-label conditional random field (Fig.1). A graph is defined asG=(V,E) whereV=vS∪vDis the set of nodes andE=εSS∪εST∪εDSis the set of edges between nodes,vSare superpixel nodes andvDare detection nodes. TheεSSandεSTrepresent the spatial and temporal relations between superpixels respectively, while theεDSrepresents the connection of superpixels with every detection node. Each node can take a label from the label setL={1, … ,N,φ}, whereNis the number of trajectory hypotheses, and every label in setLcan be a unique target ID or the background. The aim is to find the labelv*that minimizes the Gibbs energy:v*=arg min E(V). The complete CRF energy is defined as[1]

(1)

However, this method has not taken into account the relations of targets. As is known to all, two different targets cannot share the same position in 3D space. Thus, motivated by Refs.[22-23], we construct the pairwise potential by incorporating spatial constraints between detections to avoid collision, so the spatial edgesεDSare added to the graph, which turns out to be practical to exploit such information (Fig.1). The pairwise potentials for collision avoidance are added to the pairwise potentialsψin Eq. (1). Besides, we raise a new distance measure, which is based on color similarity and position proximity, to cluster superpixels.

2.2 Pairwise potentials for collision avoidance

Collision avoidance denotes that two targets cannot occupy the same space simultaneously. Therefore a continuous penalty term[22-23]is utilized to enforce collision avoidance with aunique data association.

(2)

This formulation takes into account the pairwise distance between all targets at all frames, the actual overlap of targets is checked at all times. In addition, if two targets would collide, the continuous penalty constraint can separate them as much as possible, ensuring one detection can be explained by at most one target.

2.3 Superpixels cluster in Lab-XY space

Superpixels have been proved very useful in computer vision applications, which can capture redundancy of the image, compute the local features, and greatly reduce the complexity of subsequent image processing tasks. Superpixels group perceptually similar pixels (such as color, texture, transparency, etc.) to create visually meaningful entities.

In Ref.[1], a linear SVM is trained online for each sequence to separate foreground from background, and only Lab color features are used to obtain the positive and negative training samples (superpixels) by clustering. As it is known, the farther from the cluster center of negative samples, the more likely they are taken as the positive samples. Hence, position information is incorporated to cluster superpixels in Lab color space andx,yimage plane space. Therefore, superpixels are clustered based on color similarity and spatial proximity as

(3)

whereDsis the sum-of-distance-difference of Lab color and thex,yimage coordinates. A variableλis introduced to offer a good balance between the color and position information.

3 Experiment

3.1 Datasets

Experiments are conducted to evaluate the performance of our tracker on several public video sequences including PETS09-S2L1[24], TUD-Campus[25]and TUD-Stadmitte[26]among which PETS09-S2L1 is a widely used sequence showing up to 8 walking pedestrians, partly in unusual patterns. TUD-Campus includes only 71 frames with side-view pedestrians, while TUD-Stadtmitte is a real-world video filmed with low angle shots and frequent occlusions, making the dataset quite challenging. Subsequently, we show quantitative comparisons with the state-of-art methods, as well as visualized results of the proposed approach.

3.2 Metrics

Conducting a fair quantitative comparison of MOT approaches is challenging for many reasons. On one hand, tracking-by-detection methods are highly reliant on the use of the detector. On the other hand, the definition of a tracker output may be ambiguous. So, evaluation metric of MOT algorithms is not merely crucial to measure its performance, but difficult with a single established protocol. Here, we follow the widely accepted CLEAR MOT[27]metrics, and also quote three popular metrics according to Ref.[28], including:

① MOTA(↑):(abbreviated as TA) The multi-object tracking accuracy (MOTA) combines three major error types: false positives (FP), missed targets (FN), and identity switches (ID) into one number, such that the score of 100 percent corresponds to no errors.

② MOTP(↑):(abbreviated as TP) The multi-object tracking precision (MOTP) is a measure for the localization error, which evaluates the alignment of the tracker output and the ground truth. This metric averages the bounding box overlap over all tracked targets, where 100 percent again reflects a perfect result.

③ Rcll(↑): correctly matched detections / total detections in the ground truth.

④ Prcn(↑): correctly matched detections / total detections in the tracking result.

⑤ MT(↑) (mostly tracked)measures the ratio of mostly tracked trajectories, which are successfully tracked for at least 80%.

⑥ ML(↓) (mostly lost)measures how many ground truth trajectories are lost, which are computed on entire trajectories, and it is successfully tracked for less than 20%.

⑦ ID(↓) (Id switch)means the number of times that a tracked trajectory changes its matched id.

⑧ FM(↓) (Fragmentation)implies the numbers of track fragmentations, counting how many times each track is fragmented.

For items with↑ mean that the higher the better, and ↓ mean that the lower the better.

3.3 Results

The proposed method is compared with a kind of state-of-the-art tracker[1]that does not employ collision avoidance between the detections, and the comparison results are presented in Tab. 1. Fig. 2 shows some qualitative results of our method on the three sequences. As expected, the tracking performance is improved significantly by adding the collision avoidance constraint.

Tab.1 Comparison of our approach to SegTrack[1]

Fig.2 Qualitative results of our method on three publicly available sequences

4 Conclusions

In this paper a multi-object tracker based on segmentation and collision avoidance is proposed. Compared with previous state-of-the-art approaches, one of our main contributions is that the pairwise potential is constructed by incorporating spatial constraints between objects to enforce collision avoidance in crowded scenes. Moreover, a new distance measure based on the color similarity and position proximity is presented to cluster superpixels in 5D Lab-XYspace. Finally, the experimental results show significantly improvements on public benchmarks.

The proposed method requires the intrinsic and extrinsic camera parameters, which can be inferred by structure-from-motion methods. However, the unavoidable errors in the inferred camera parameters will affect the tracking performance. The larger the error, the worse the performance. Therefore, precise camera parameters estimation and better multiple objects tracking at the same time are our future research directions.

[1] Milan A, Leal T L, Schindler K, et al. Joint tracking and segmentation of multiple targets[C]∥Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, USA, 2015:5397-5406.

[2] Pirsiavash H, Ramanan D, Fowlkes C. Globally-optimal greedy algorithms for tracking a variable number of objects[C]∥Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Colorado, USA, 2011:1201-1208.

[3] Zhang L, Li Y, Nevatia R. Global data association for multi-object tracking using network flows[C]∥IEEE Conference on Computer Vision and Pattern Recognition, Anchorage, Alaska, USA, 2008:1-8.

[4] Berclaz J, Fleuret F, Turetken E. Multiple object tracking using k-shortest paths optimization[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2011,33(9):1806-1819.

[5] Butt A, Collins R. Multi-target tracking by Lagrangian relaxation to min-cost network flow[C]∥IEEE Conference on Computer Vision and Pattern Recognition, Portland, Oregon,USA, 2013:1846-1853.

[6] Wang S, Charless C F. Learning optimal parameters for multi-target tracking with contextual interactions[J]. International Journal of Computer Vision, 2017,122(3):484-501.

[7] Alex B, Lionel O, Fabio R, et al. Alextrac: affinity learning by exploring temporal reinforcement within association chains[C]∥IEEE International Conference on Robotics and Automation, Stockholm, Sweden, 2016:2212-2218.

[8] Loïc F B, Romaric A, Yoann D, et al. Improving multi-frame data association with sparse representations for robust near-online multi-object tracking[C]∥ European Conference on Computer Vision, Amsterdam, The Netherlands, 2016:774-790.

[9] Yu Xiang, Alexandre Alahi, Silvio Savarese. Learning to track: online multi-object tracking by decision making[C]∥Proceedings of the IEEE International Conference on Computer Vision, Santiago, Chile, 2015: 4705-4713.

[10] Jaeyong J, Daehun K, Bonhwa K, et al. Online multi-person tracking with two-stage data association and online appearance model learning[J]. IET Computer Vision, 2017,11(1):87-95..

[11] Yang M, Wu Y, Jia Y D. A hybrid data association framework for robust online multi-object tracking[J]. IEEE Transactions on Image Processing, 2017,26(12):5667-5679.

[12] Dehghan A, Tian Y, Torr P H S, et al. Target identity-aware network flow for online multiple target tracking[C]∥Computer Vision and Pattern Recognition, Boston, USA, 2015:1146-1154.

[13] Zamir A R, Dehghan A, Shah M. Gmcp-tracker: global multi-object tracking using generalized minimum clique graphs[C]∥ European Conference on Computer Vision,Firenze,Italy, 2012:343-356.

[14] Dehghan A, Modiri A S, Shah M. Gmmcp tracker: Globally optimal generalized maximum multi clique problem for multiple object tracking[C]∥Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, USA, 2015:4091-4099.

[15] Shu G, Dehghan A, Oreifej O, et al. Part-based multiple-person tracking with partial occlusion handling[C]∥Computer Vision and Pattern Recognition, Rhode Island, USA, 2012:1815-1821.

[16] Shu G, Dehghan A, Shah M. Improving an object detector and extracting regions using superpixels[C]∥ Computer Vision and Pattern Recognition, Oregon,USA, 2013:3721-3727.

[17] Brox T, Malik J. Object segmentation by long term analysis of point trajectories[C]∥ European Conference on Computer Vision, Crete, Greece, 2010:282-295.

[18] Galasso F, Keuper M, Brox T, et al. Spectral graph reduction for efficient image and streaming video segmentation[C]∥Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Ohio,USA, 2014:49-56.

[19] Bibby C, Reid I. Real-time tracking of multiple occluding objects using level sets[C]∥IEEE Conference on Computer Vision and Pattern Recognition, San Francisco, CA, USA, 2010: 1307-1314.

[20] Horbert E, Rematas K, Leibe B. Level-set person segmentation and tracking with multi-region appearance models and top-down shape information[C]∥International Conference on Computer Vision, Barcelona, Spain, 2011: 1871-1878.

[21] Mitzel D, Horbert E, Ess A, et al. Multi-person tracking with sparse detection and continuous segmentation[C]∥European Conference on Computer Vision, Crete, Greece, 2010:397-410.

[22] Andriyenko A, Schindler K. Multi-target tracking by continuous energy minimization[C]∥ Computer Vision and Pattern Recognition, Colorado, USA, 2011: 1265-1272.

[23] Milan A, Roth S, Schindler K. Continuous energy minimization for multitarget tracking[J]. Pattern Analysis and Machine Intelligence, 2014,36(1): 58-72.

[24] Ferryman J, Shahrokni A. PETS2009: dataset and challenge[C]∥ IEEE International Workshop on Performance Evaluation of Tracking and Surveillance, Miami, Florida,USA, 2009:1-6.

[25] Andriluka M, Roth S, Schiele B. People-tracking-by-detection and people detection-by-tracking[C]∥ Computer Vision and Pattern Recognition, Alaska, USA, 2008:1-8.

[26] Andriluka M, Roth S, Schiele B. Monocular 3D pose estimation and tracking by detection[C]∥ Computer Vision and Pattern Recognition, San Francisco, CA, USA, 2010: 623-630.

[27] Bernardin K, Stiefelhagen R. Evaluating multiple object tracking performance: the CLEAR MOT metrics[J]. Journal on Image and Video Processing,2008, 2008(1):246309.

[28] Li Y, Huang C, Nevatia R. Learning to associate: Hybrid-boosted multi-target tracker for crowded scene[C]∥Computer Vision and Pattern Recognition, Miami, Florida, USA, 2009:2953-2960.