APP下载

General and robust voxel feature learning with Transformer for 3D object detection

2022-04-18LIYangGEHongwei

LI Yang, GE Hongwei

(1. Jiangsu Provincial Engineering Laboratory of Pattern Recognition and Computational Intelligence, Wuxi 214122, China;2. School of Artificial Intelligence and Computer Science, Jiangnan University, Wuxi 214122, China)

Abstract: The self-attention networks and Transformer have dominated machine translation and natural language processing fields, and shown great potential in image vision tasks such as image classification and object detection. Inspired by the great progress of Transformer, we propose a novel general and robust voxel feature encoder for 3D object detection based on the traditional Transformer. We first investigate the permutation invariance of sequence data of the self-attention and apply it to point cloud processing. Then we construct a voxel feature layer based on the self-attention to adaptively learn local and robust context of a voxel according to the spatial relationship and context information exchanging between all points within the voxel. Lastly, we construct a general voxel feature learning framework with the voxel feature layer as the core for 3D object detection. The voxel feature with Transformer (VFT) can be plugged into any other voxel-based 3D object detection framework easily, and serves as the backbone for voxel feature extractor. Experiments results on the KITTI dataset demonstrate that our method achieves the state-of-the-art performance on 3D object detection.

Key words: 3D object detection; self-attention networks; voxel feature with Transformer (VFT); point cloud; encoder-decoder

0 Introduction

The 3D object detection plays an important role in many emerging industries such as autonomous driving and robotics. LiDAR sensors are widely used in these fields owing to the advantages of directly-acquired accurate 3D structure information and depth information. However, due to the disorder and irregularity of the point cloud which are essentially different from 2D image, many approaches have been proposed for this challenging point cloud feature learning for 3D object detection.

In order to make full use of the powerful feature learning capabilities of mature convolution neural networks, recent research has converted the irregular point cloud into regular and discrete representations. They can be roughly classified into two types: projection-based methods and voxel-based methods. As for the projection-based methods, multi-view 3D (MV3D)[1]encodes point clouds with a compact bird view and a front view, and 3D proposals are generated by using 2D convolution neural networks on bird view feature map. Different features of multi-view are fused to generate the final 3D bounding boxes. Agqregate view object detection (AVOD)[2]first uses a 2D convolution network and feature pyramid network(FPN) to get feature maps, then fuses multi-view features. 3D proposals are generated by crop and resize operations. Finally, oriented 3D object detection from PIXel-wise neural network predictions (PIXOR)[3]represents the point cloud by the dense and compact bird’s-eye view, and the point cloud high-level features are extracted by a 2D convolution network. The geometric information is collapsed during the projection.

For the voxel-based methods, the point cloud is converted into regular 3D voxels uniformly distributed in space, then different modules are designed to extract voxel feature. 2D or 3D convolution neural networks further learn high-level feature representations based on the input voxel feature. VoxelNet[4]first subdivides 3D space into equally spaced voxels and uses its voxel feature extractor (VFE) which is a fully connected network (FCN) and a max-pooling to learn voxel features. The FCN consists of a linear layer, a batch normalization (BN) layer and a rectified linear unit (ReLU) layer. Sparsely embedded convolutional detection (Second)[5]utilizes sparse convolution to reduce the amount of calculation and improves inference performance. PointPillar[6]organizes point cloud in vertical columns way and uses PointNet[7]to learn each column feature. The encoded column features are further processed by 2D convolution to improve the detection speed. Structure aware single-stage 3D detection (SA-SSD)[8]effectively utilizes object structure information to improve object positioning accuracy. The point cloud is voxelized and the auxiliary network uses the learned multi-layer features to perform a supervised point cloud segmentation task. This segmentation task can make the multi-layer features effectively represent the object structure information. However, the same problem for VFE and PointNet-based method is that the FCN simply gets the point-wise feature representations and voxel feature is obtained by the max-pooling operation without considering the spatial relationship and semantic information between points. Besides, due to the sparseness of point cloud, there is no enough point information in voxel and the detection performance gets worse when the objects are far and small.

PointNet[7]is a pioneering end-to-end deep neural network, which uses the permutation-invariant multi-layer perceptrons and max-pooling to learn the global features of the point cloud. PointNet++[9]utilizes the ball query grouping and hierarchical PointNet to learn the local features of point clouds at different scales. PointRCNN[10]is a two-stage detection framework, which uses the PointNet++ as its backbone for point feature learning in the first stage and generate 3D bounding boxes in a bottom-up scheme, and the improves the 3D bounding boxes in standard coordinates in the second stage. 3D single stage object detector (3DSSD)[11]combines two different sampling strategies of distance and feature, and successfully removes the time-consuming feature upsampling layer. Experiments show that this method is better than PointRCNN[10]while maintaining real-time performance. These PointNet-based methods directly process the raw point cloud without information loss, however, a large number of point clouds will lead to high calculation and memory consumption. Besides, the point-wise feature from multi-layer perceptrons are not able to aggregate global context that spatial relationship and context information exchange between all the points are fully utilized. The max-pooling operation simply aggregates features in a hardcoded way without adaptively considering different contributions of each point. In this work, we propose to solve the problem of ignoring the spatial relationship and context between points in voxel-based methods and PointNet-based methods by exploiting the advantage of Transformer in other fileds.

Since Transformer[12]was proposed, it has always been the dominant framework in natural language processing(NLP) field. It is an encoder-decoder architecture which consists of input embedding, position embedding and self-attention. Input embedding vectorizes input and outputs distinguishable features. Position embedding encodes the position information of the feature. As the core component of Transformer, the self-attention network is naturally suitable for processing the point cloud data due to its permutation invariance of the points in a sequence. Self-attention is able to learn the global context, and outputs the attention feature of each input which is related to all input features. The biggest benefit of Transformer is the parallelization which can boost the training speed. In NLP field, Bahdanau et al.[13]was the first one to combine a neural machine translation method with an attention mechanism. Self-attention[14]was proposed to visualize and interpret sentence embeddings. Based on these methods, Transformer was proposed for machine translation which is solely based on self-attention. After that, bidirectional transformer[15]wass proposed. It is one of the most powerful models in the NLP field. Recently, inspired by the success of Transformer, the Transformer-based computer vision tasks have achieved the state-of-art performance. Hu et al.[16]and Ramachandran et al.[17]apply self-attention within local image patches. Zhao et al.[18]develop a family of vector self-attention operators. ViT[19]is proposed for image classification task in which image is split into patches. The results show that Transformer outperforms a traditional convolution neural network. Carion et al.[20]propose an end-to-end detection framework that combines the convolution neural network (CNN) feature extraction ability and power of Transformer.

In this work, we propose a novel approach to deep learning on point cloud voxel feature with Transformer (VFT) for 3D object detection. It eliminates the defect of ignoring the spatial relationship and context between points in PointNet-based methods. The core idea of VFT is using the permutation invariance of self-attention networks which is naturally suitable for point cloud processing. We design an adaptive and robust voxel feature layer which is the core of VFT for voxel feature learning. The voxel feature layer is permutation invariant to the unordered and irregular points in a voxel, and possesses great ability of self-attention of adaptively extracting global context. The global context is based on spatial relationship and context information exchanging between points within a voxel. Based on the VFT, we construct a general voxel feature learning framework for 3D object detection. Note that the VFT can be easily adapted to any other voxel-based detection network and serves as general backbone for voxel feature extractor. Specially, we construct two detection networks, one is based on Second[5]and the other is based on PointPillar[6], both of which achieve greater performance on object detection. We conduct experiments and ablation studies to evaluate the effectiveness of the proposed VFT.

1 Deep learning based on VFT

We first start by briefly introducing three main parts of Transformer: input embedding, position embedding and self-attention. Then we introduce our proposed feature embedding that combines the input embedding and position embedding. Later, we introduce the core part voxel feature layer of voxel feature learning with Transformer. Lastly, we present the overall architecture of our framework for 3D object detection.

1.1 Three main parts of Transformer

Transformer is an encoder-decoder architecture, each of which consists of three main modules: input embedding, position embedding and self-attention. Input embedding converts the input point into a vector space and represents each input data with a discriminative feature vector. Position embedding encodes the position information of each input data. Input embedding and position embedding are added and sent to self-attention, which is expressed as

y=φ(x)+θ,

(1)

wherex∈Rdis a single input data;φis the input embedding;θis the position embedding function andy∈Reis the embedding feature.

Self-attention first computes three vectors for inputs:query,keyandvaluethrough learnable matrixQ,KandV, which are expressed as

query=yQ,key=yK,value=yV,

query∈Rk,key∈Rk,value∈Rk,

Q∈Re×k,K∈Re×k,V∈Re×k.

(2)

Then, the attention layer computes scalar product between features with theirqueryandkeyvectors, and uses it as an attention weight between the features, namely

A={(A1,…,Aj)|Aj=query·keyj,j∈{1,…,L},

(3)

whereLis the sequence length of input data.

Finally, the attention feature is aggregated by the weighted sum of all value vectors with the normalized attention weights, that is

(4)

(5)

After all that, all the input features have been encoded into the intermediate attention feature which is relative to all input features. Similar to the procedure of encoding, the decoder takes the intermediate attention feature, output embedding and position embedding as input, and outputs the sequence feature, respectively, each of which contains global context. In our work, based on VFT, we hope to get one single global feature that represents the structure and context of each voxel, thus we implement a simplified decoder. We discard most parts of the decoder of traditional Transformer such as output embedding, position embedding. We only take a start decoding flag EOS_BEGIN and intermediate attention feature as input and then output a single global context rather than a sequence. The details of our implemented encoder and decoder are explained in section 1.3.

1.2 Feature embedding

Input embedding and position embedding are important parts of self-attention, making it possible to learn local spatial relationships within the input points. Input embedding transforms the input points into vector space, in which the more similar the points of spatial relationship and semantics, the closer the points. The position embedding encodes the position information and is added to input embedding. Traditional position embedding for input sequential data is designed manually, such as the sine and cosine functions. During processing the point cloud, the 3D coordinates are natural candidates for position embedding. Observing that the input embedding and position embedding are performed separately in traditional Transformer, we go beyond this by introducing the feature embedding which encodes the input data and position information jointly as

f=φ(xi,pi-pvoxel,pi-pcluster),

(6)

wherexiandpiare the point feature and 3D point coordinates for pointi, respectively;pvoxelis the voxel center 3D coordinate;pclusteris the average 3D coordinats of points within the voxel; and the encoding functionφ(·) consists of a linear layer and a batch normalization (BN).

As for the output embedding of decoder, due to the fact that our decoder aims to output a single global context not a sequence, we just discard it. Instead, we use a single start decoding flag EOS_BEGIN. Since the 3D point cloud is a continuous metric space, it is not appropriate to set the flag to a fixed vector or a vector of all ones or zeros. We represent it by a learnable parameter which is implemented by a linear layer. The flag aims to aggregate the global context from the output attention feature from encoder.

1.3 Voxel feature layer

The voxel feature layer is composed of an encoder, decoder and residual connection, as shown in Fig.1.

Fig.1 Voxel feature layer

1.3.1 Encoder

As shown in encoder part of Fig.1, the encoder consists of a feature embedding, a self-attention and two linear layers. The feature embedding first encodes the input points and position information jointly and outputs discriminative feature vector for each input point. Then self-attention takes the feature vector as input and promotes information exchange between these discriminative feature vectors, producing intermediate attention feature vectors for each input point as its output.

The first linear layer further conducts feature learning on the attention feature by transforming it to higher level feature space. The second linear layer reduces the dimension of feature and connects with the attention feature in a residual way. Then the connected feature will be sent to the second self-attention of decoder to play the role of key and value.

1.3.2 Decoder

The core task of proposed decoder is to aggregate the attention feature from encoder and output a global context which represents the whole structure and semantic information for the voxel. The traditional decoder of Transformer takes the output embedding, position embedding and mask information which are discarded as inputs and finally outputs a sequence in the NLP filed. We implement a simplified decoder which takes the start decoding flag EOS_BEGIN and attention feature from encoder as inputs. As shown in decoder part of Fig.1, the decoder is composed of two self-attention networks and two linear layers. Firstly, the EOS_BEGIN flag which is a learnable parameter is sent to the first self-attention. Then the output attention feature from encoder and output from the first self-attention are sent into the second self-attention to aggregate the global context, in which the attention feature plays the role of key and value simultaneously, and output from the first self-attention plays the role of query. The two linear layers have the same function as in the encoder. Lastly, we augment the feature embedding with the voxel-wise global context and send it to the subsequent stacked voxel feature layer to further deep learn the descriptive voxel feature.

1.4 Overall architecture

1.4.1 Architecture

We construct the complete 3D object detection framework based on the voxel feature layer. Note that the voxel feature layer is the core voxel feature aggregation operator in the whole network and can adapt to any other voxel-based networks as the voxel feature backbone. The whole process is divided into four steps: preprocessing, voxel feature learning, convolution networks and detection heads. Firstly, in the preprocessing, the point cloud can be represented by spatially uniformly distributed voxels or vertical columns (pillars). Secondly, voxel feature learning uses the stacked voxel feature layers which are dual encoding and decoding structure to adaptively learn the voxel feature based on the spatial relationship and context information between points in the voxel or vertical columns. Thirdly, 2D or 3D convolution is used based on the point cloud representation to expand the receptive field and extract high-level representations of features. Lastly, detection head performs classification and regression tasks.

1.4.2 Loss

In detection head, we adopt the same loss functions and settings as in PointPillar[6]. The overall loss is composed of object classification loss, localization residuals regression loss and directions classification loss. As for object classification loss, focal loss[21]is adopted as

Lcls=-αt(1-pt)λlogpt,

(7)

whereptis the class prediction of an anchor. We use the same settings ofαt=0.25 andλ=2 as those in original paper[21].

The smoothL1loss is utilized for the localization residuals regression loss as

(8)

where Δvis the residual between the anchor prediction and the ground truth. smoothL1is expressed as

(9)

And cross entropy loss is used for the direction classification. The overall training loss is

L=β1Lcls+β2Lloc+β3Ldir,

(10)

whereβ1,β2andβ3are set to the same values 1.0, 2.0 and 0.2 as in PointPillars[6].

2 Experiments and analysis

In this section, we first present the experimental setup of our framework and compare our method with other state-of-art methods on public KITTI benchmark[22]Finally, We give a comprehensive analysis of the proposed voxel feature learning with Transformer in ablation studies.

2.1 Experiments

2.1.1 Dataset and metric

All experiments are conducted on the competitive KITTI benchmark for 3D object detection and BEV object detection. The KITTI dataset consists of 7 481 training samples and 7 518 testing samples. They are classified into three categories: car, pedestrian and cyclist class. For each class, experiments are conducted for three difficult levels: easy, moderate and hard. The difficulty level is determined by the object size, occlusion and truncation. We adopt the approach in MV3D[1]to split the 7 481 training samples into 3 712 samples for training set and 3 769 samples for validation set. All the experiment results are evaluated by the average precision (AP) of 11 recall positions and three class mean average precision (mAP) values for the 3D object and BEV object detection.

2.1.2 Settings

As shown in preprocessing step and convolution middle layers in Fig.2, there can be different style voxelizations and convolution middle layers. Based on that, we implement two detection networks, one is Voxel-style_VFT version and the other one is Pillar-style_VFT version. Specifically, for the Voxel-style_VFT version, the point cloud is first voxelized with the voxel size (0.05 m, 0.05 m, 0.1 m) inX,Y,Zaxes, respectively. Then we use the two stacked voxel feature layers for voxel feature learning. In convolution middle layer, we use three encode_blocks instead of one encode_block as in Second[5]to extract middle feature, each of which consists of a 3D sparse convolution and two submanifold convolutions. Matching for car, pedestrian and cyclist class uses positive thresholds of 0.6, 0.35 and 0.35, and negative thresholds of 0.45, 0.2 and 0.2, respectively. For the Pillar-style_VFT version, the point cloud is first voxelized with the voxel size (0.16 m, 0.16 m, 4 m) in each axis. Then we use one voxel feature layer for the voxel feature learning. The convolution middle layers are the same as the convolution layers in PointPillar[6]. Matching for car, pedestrian and cyclist class uses positive thresholds of 0.6, 0.5 and 0.5, and negative thresholds of 0.45, 0.35 and 0.35, respectively. The two detection networks are both trained with AdamW optimizer with the same weight decay and betas but different initial learning rate, 0.001 8 for Voxel-style_VFT version and 0.001 for Pillar-style_VFT. The weight decay is 0.01 and betas are 0.95 and 0.99, respectively.

Fig.2 Overall architecture of general detection framework

2.1.3 Data augmentation

2.2 Results on KITTI dataset

To evaluate the effectiveness of our proposed voxel feature learning with Transformer, we train the two networks Voxel-style_VFT and Pillar-style_VFT on KITTI training set and evaluate them on validation set. Because many public methods do not provide pedestrian and cyclist detection results in validation set, we only compare performance for car detection with other methods.

Table 1 shows the overall performance of the Pillar-style_VFT and Voxel-style_VFT on the most important 3D object detection benchmark of the car class. The Pillar-style_VFT and Voxel-style_VFT outperform previous mainstream state-of-the-art methods greatly, for example, compared with the voxel-based pioneer VoxelNet, the Voxel-style_VFT increases the AP by 6.5%, 12.1% and 12.6%, and the Pillar-style_VFT increases the AP by 3.2%, 10.6% and 6.6%. AVOD-FPN is multi-modal method which uses the RGB image and LiDAR and F-PointNet[23]uses a fine-tuned 2D detector while our networks are trained from scratch and are the single-modal method.

Table 1 3D detection performance: AP for 3D boxes in KITTI validation set

Table 2 Bird’s view detection performance: AP for BEV boxes in KITTI validation

Table 2 shows the BEV detection performance of our network. Our results are comparable to the state-of-the-art results and the performance is better than thost of the VoxelNet and Second. Note that as show in Fig.3, the Voxel-style_VFT outperforms the state-of-the-art methods by a significant margin on hard difficulty level for both the 3D detection and BEV detection, which show that our method is robust and effective when objects are far or occluded.

(a) 3D detection performance comparison for car detection

2.3 Ablation studies

In this section, we conduct extensive ablation experiments to investigate the effectiveness of our proposed method.

2.3.1 Different voxel feature learning strategies

To prove the effects of our voxel feature learning with Transformer, we first re-implement the Second-baseline and PointPillar-baseline, then we replace their voxel feature extractor with our proposed VFT. Table 3 shows that the performance increases significantly when replacing the PointNet-based voxel feature extractor which consists of MLPs and max-pooling with VFT. Specifically, the Second+VFT outperforms the Second-baseline by2.36%, 2.24% and 1.82% on easy, moderate and hard difficulty level for 3D mAP, respectively. The PointPillar+VFT outperforms the PointPillar-baseline by 1.53%, 2.29% and 1.25% on easy, moderate and hard difficulty level for 3D mAP, respectively. The results prove that our proposed VFT is effective and robust, which can extract global context according to the spatial relationship and context information exchange between all points within a voxel when detecting hard difficulty level objects.

Table 3 Performance comparison of different voxel feature learning strategies

To further demonstrate the effectiveness of VFT, we visualized some detection results, as shown in Fig.4. Fig.4(a) represents the RGB image of the scene, and ground truth 2D and 3D bounding boxes are shown in Fig.4(b). Fig.4(c) represents the nine ground truth 3D bounding boxes in point cloud. Furthermore, Fig.4(d) shows the PointPillar-baseline detection results and Fig.4(e) shows the PointPillar+VFT detection results.

It is worth noting that the PointPillar-baseline fails to detect a car at a distance and a non-car object is misjudged as a car. Instead, our PointPillar+VFT can successfully detect all the ground truth objects, which shows the effectiveness of VFT in improving the detection performance of hard difficulty level objects.

Fig.4 Visual display of PointPillar-baseline and PointPillar+VFT detection results

2.3.2 Comparison of inference time and convergence speed

Fig.5 shows the comparison of inference time and 3D car moderate level detection performance. PointPillar-baseline has the fastest inference speed but lower detection performance. Our PointPillar+VFT has a good compromise between inference speed and detection performance, especially for the near real-time inference time. Second-baseline has the second highest detection performance and real-time inference speed. Our Second+VFT achieves the top detection performance but has a slight loss in inference speed. In general, our VFT achieves a better detection performance but lower inference time due to the lager parameter space of VFT.

Fig.5 Comparison of inference time and 3D car detection performance

Fig.6 PointPillar-baseline vs PointPillar+VFT methods for 3D moderate mAP

Fig.6 shows the performance curve of PointPillar-baseline and PointPillar+VFT. It can be seen that our VFT with larger parameter space enhances the detection results and achieves comparable convergence speed simultaneously.

2.3.3 Different feature embedding strategies

We validate the effectiveness of our feature embedding by introducing only absolute coordinates position information. As shown in Table 4, the performance drops slightly when feature embedding only contains the absolute coordinates position information.

Table 4 Comparison of performances of different feature embedding strategies for three classes

It shows that the absolute coordinates of objects may vary due to rigid transformations. The relative coordinates position is more robust to the rigid transformations so that more stable and robust voxel feature can be learned by our VFT. Fig.7 shows the performance of different feature embedding strategies. The absolute+relative feature embedding has the comparable convergence speed and higher detection performance.

Fig.7 Visual display of performance of different feature embedding for 3D moderate mAP

2.3.4 Different number of points in voxel

In Table 5, we investigate the setting of the number of points (T) in a voxel. The best performance is achieved whenTis set to be 5. When the number of points is bigger, the detection performance drops significantly. Therefore, the model may be influenced by the extra noise, lowering the model’s performance. The visual results are shown in Fig.8.

Table 5 Comparison of performances of different number of points in voxel for 3D mAP

Fig.8 Visual display of performance of different number of points in voxel for 3D moderate mAP

3 Conclusions

Transformer has shown huge power in NLP field and has made process in 2D computer vision tasks. Inspired by this huge success, we investigate the permutation invariance of 3D points cloud and adapt the Transformer to point cloud processing. It eliminates the defect of ignoring the spatial relationship and context between points in PointNet-based methods. We first propose feature embedding to encode the point feature and position information jointly. Then we design a voxel feature layer based on the core part self-attention of Transformer to extract global and robust context of a voxel according to the spatial relationship and context information exchange between all points within the voxel. Based on the voxel feature layer, we construct a novel general and robust voxel feature learning with Transformer framework for 3D object detection. Specially, the VFT can be plugged into other voxel-based networks and serves as the backbone for voxel feature extractor. Two versions of the detection network we have implemented achieve better performance compared with baseline.