Learning discriminative representation with global and fine‐grained features for cross‐view gait recognition

2022-05-28JingXiaoHuanYangKunXieJiaZhuJiZhang

CAAI Transactions on Intelligence Technology 2022年2期

Jing Xiao|Huan Yang|Kun Xie|Jia Zhu|Ji Zhang

1School of Computer Science,South China Normal University,Guangzhou,Guangdong,China

2Key Laboratory of Intelligent Education Technology and Application of Zhejiang Province,Zhejiang Normal University,Jinhua,Zhejiang,China 3School of Sciences,University of Southern Queensland,Toowoomba Qld,Australia

Abstract In this study,we examine the cross-view gait recognition problem.Many existing methods establish global feature representation based on the whole human body shape.However,they ignore some important details of different parts of the human body.In the latest literature,positioning partial regions to learn fine-grained features has been verified to be effective in human identification.But they only consider coarse fine-grained features and ignore the relationship between neighboring regions.Taken the above insights together,we propose a novel model called GaitGP,which learns both important details through fine-grained features and the relationship between neighboring regions through global features.Our GaitGP model mainly consists of the following two aspects.First,we propose a Channel-Attention Feature Extractor (CAFE) to extract the global features,which aggregates the channel-level attention to enhance the spatial information in a novel convolutional component.Second,we present the Global and Partial Feature Combiner(GPFC) to learn different fine-grained features,and combine them with the global features extracted by the CAFE to obtain the relevant information between neighboring regions.Experimental results on the CASIA gait recognition dataset B (CASIA-B),The OU-ISIR gait database,multi-view large population dataset,and The OU-ISIR gait database gait datasets show that our method is superior to the state-of-the-art cross-view gait recognition methods.

1|INTRODUCTION

Gait recognition is a promising video-based biometric identification technology applied to identify individuals by their walking patterns.Compared to other biometric technologies,such as the face,fingerprint and iris recognition,gait recognition has the advantages of non-contact,long-distance and no explicit cooperative interest-subjects.Therefore,gait recognition has a potentially wide range of applications in video surveillance.As the accuracy increases,gait recognition technology will definitely become another effective tool for crime prevention,forensic identification and social security.In order to improve the accuracy of recognition,we need to overcome various external factors,including walking speed,bag-carrying,coat-wearing and camera viewpoint,that cause dramatic changes in gait appearance.As shown in Figure 1,the appearance of gait walking changes observably in different directions,which may result in the similarity of inter-class common attributes greater than that of intra-class common attributes,and brings challenges to gait recognition.

FIGURE 1 From left to right are silhouettes of all views in The CASIA gait recognition dataset B (CASIA-B),gait dataset,which possess evidently different shapes and moving patterns during walking

There are several attempts in the literature to solve the cross-view gait recognition problem.A common strategy is to extract global features by treating the whole human image as a unit.It is worth mentioning that many methods [1-4] use attention mechanisms to improve the performance of the model,and our method is no exception.However,due to the diversity of gait walking conditions in the cross-view situation,some important details are often ignored in the global features.Another learning strategy considers that different parts of the human body poss evidently various shapes and moving patterns during walking [5-10].They aim to learn fine-grained features through specific regions.Unfortunately,they only consider coarse fine-grained features and ignore the relationship between neighboring regions.To solve the problems in the above two strategies,we propose a novel model called GaitGP,which learns both important details through finegrained features and the relationship between neighboring regions through global feature representation.

Our novel model GaitGP consists of the following two components.The first component is a Channel-Attention Feature Extractor (CAFE),which is a novel application of convolution and can extract global features with channel attention mechanism.The other one is the Global and Partial Feature Combiner(GPFC),which learns fine-grained features in specific regions of images.Moreover,the GPFC combines the global features extracted by the CAFE with fine-grained features to obtain relevant information between neighboring regions.

In the CAFE,we jointly learn attention selection and feature representation to extract global features with a channel attention mechanism.From the experimental data,we find that channel attention does enhance the performance of the model compared to the original global features extracted from the image.Therefore,an effective channel attention mechanism method,called the Channel-level Spatial Pooling(CSP)is introduced to select the channel attention information and optimize the global features.Additionally,in order to improve the compatibility between channel attention selection and global features,our novel convolution layeradopts the partitionable stackingdesign,which will be discussed specifically in Section 3.2.

In the GPFC,we divide the global feature map extracted from the CAFE into several sub-branches.To combine the global feature and the fine-grained features,the first sub-branch contains only one whole partition to preserve the global information.In the remaining sub-branches,we divide the global feature maps into different numbers of stripes as part regions to learn local feature representations independently [5].More details will be discussed specifically in Section 3.3.

More simply,we summarise our contributions as follows:

· We propose a novel model called the GaitGP,which learns both important details through fine-grained features and the relationship between neighboring regions through global feature representation.

· We propose a CAFE for the optimization of global feature representation.

· We propose a GPFC for combining the global and finegrained features.

· For gait recognition accuracy,we combine the above aspects to conduct a large number of experimental ablation experiments on the widely used gait datasets the CASIA gait recognition dataset B (CASIA-B) [11],The OU-ISIR gait database,multi-view large population dataset (OU-MVLP)[12] and The OU-ISIR gait database (OULP) [13].Compared to several state-of-the-art methods,GaitGP shows superiority.

2|RELATED WORK

2.1|Cross‐view gait recognition

To adapt to the situation of cross-view for gait recognition,one of the most typical gait recognition methods is treating the whole human body shape as a unit to extract features and can be divided into two categories:model-based[14-17]and appearance-based[18-25].The model-based method tries to reconstruct the human 3D-body and motion models to identify individuals.Wolf et al.[14]used to model the dynamic characteristics of the gait sequence to express the overall understanding of the gait sequence.The gait silhouettes under different views are mapped on a common template by the 3D-model,but it is difficult to train because of the complexity of network architecture.

Many appearance-based methods in this fashion perform gait recognition in a more lightweight (easily to train) network architecture.Inspired by the great achievements in face recognition and action recognition,some researchers leverage generative methods to reconstruct the gait template in all views.The generative adversarial network (GAN) [26] is used to generate invariant side-view gait images to adapt to the situation of appearance changes caused by different clothing.Yu et al.[22]proposed a unified cross-view gait recognition model based on a generative framework to learn view-invariant features.A multiloss strategy is used in GaitGAN[27]to optimize the network to increase the inter-class distance and reduce the intra-class distance.All these methods compress the gait silhouettes from different views into a uniform template for gait recognition.However,it is believed that these methods retain unnecessary sequential constraints for periodic gait [3] and ignore some important details of different parts of the human body.

For learning more detailed information to enhance feature representation,many advanced methods in Re-Identification(Re-ID)task[5-9,28-30]have proved that locating important body parts from images to represent local identity information is an effective method to improve the accuracy and robustness of recognition.One of the most commonly used strategies is to split the feature map into strips and merge them into column vectors.Wang et al.[31]designed a Multiple Granularity Network with multiple branches,which uniformly partitions the images into several stripes,and varies the number of parts in different local branches to obtain local feature representation with multiple granularities.Fu et al.[8]proposed a simple and effective horizontal pyramid matching method to fully exploit various partial information of a given person.In the task of gait recognition,many of the latest articles have applied the strategies of finegrained features.Chao et al.[3]used Horizontal Pyramid Mapping to map the set-level feature into a more discriminative space for robust feature representation.Zhang et al.[32]employed the idea of part-based unified segmentation to extract local features of gait.However,these methods only consider coarse finegrained attentional features and ignore the relationship between neighboring regions.

To learn both the complement of important details through fine-grained features and the relationship between neighboring regions,in this work,the proposed model GaitGP combines the attention information to learn global feature representation and aggregate it with the fine-grained features to make feature representation more robust.

2.2|Deep learning on attention

One common learning strategy is long short term memory(LSTM)-based [32,33].They employ LSTM to temporal attention scores to pay more attention on those discriminative frames,and thus,improving the overall performance.However,these methods are considered to retain the unnecessary sequence constraints on the periodic gait.Additionally,some new approaches in the Re-ID task combine the local attention-based representation of the image to improve performance [1,30,34-38].Li et al.[34] proposed a Spatial Transformer Network(STN)with spatial constraints[36]to locate deformable pedestrian features.Zhao et al.[39]built a hard attention model by the STN to search for components,given the pre-defined spatial constraints.Li et al.[1]presented Harmonious Attention convolutional neural network(CNN)for joint learning of different levels of visual attention subject along with simultaneous optimization of feature representation.

Inspired by the successful application of visual attention,some methods directly perform on random sequences to get attention information,thus,avoiding unnecessary sequence constraints on the periodic gait.GaitSet [3] presented the Set Pooling applying attention mechanism [1,40,41] to improve its performance.GaitPart [4] applied the channel-wise attention mechanism [6,42,43] to the re-weighted micro-motion feature,which aims to overcome the limitation of the global feature.The above methods show that the attention information is beneficial to improve the performance of gait recognition.Therefore,in our method,we propose the CSP to learn channel-wise attention to enhance the global feature.

3|PROPOSED METHOD

In this section,we first summarize the overall network architecture of the GaitGP model.This is followed by a detailed description of the two components of the model,that is,the CAFE and GPFC.

3.1|Overall framework

The overall of the proposed method is shown in Figure 2.Given a dataset ofnpeople with identityyi,i∈1,2,…,n,we assume that the sequence of each identity isXi.ssilhouettes given from eachXiare expressed asXi=We first use the CAFE to jointly perform attention selection and feature representation.Then,the global features are extracted through a channel attention mechanism,which is formulated as follows:

whereχτdenotes the output feature map of the CAFE andφdenotes the function of attention selection,which is implemented by the CSP in the CAFE.The details of the CSP will be introduced in Section 3.2.

Then,the GPFC dividesχτintotsub-branches.Each subbranch is horizontally split intop=2γ,γ=1,2,…,denoted aspartitions,Finally,the GPFC combines all the fine-grained features and the global featureχτto learn the relationship between the neighbor regions.The GPFC is formulated as follows:

where νδdenotes the column vector down-sampled by theδ;δdenotes a Multi-Granularity Mapping (MGM) module.More details will be introduced in Section 3.3.

Finally,we choose the separate triplet loss[3,4]to train the proposed model.

3.2|Channel‐Attention Fusion Extractor

The CAFE learns global features with channel-level attention to enhance representation.There are two components in the CAFE:The CSP which aims to learn the attention information and the Partitionable Convolution layer(PConv) used to extract the global feature for integrating the attention information freom the CSP.Next,the CSP is described in detail first,followed by the exact structure of the PConv.

3.2.1|Channel-level Spatial Pooling

To enhance the expressiveness of global features,the CSP learns a channel-wise attention map to refine it.As shown in Figure 2,each block of the Partial Branch contains a CSP,assuming thatf b∈Rc×s×h×wis the input feature map of the CSP;brepresents the block in the CAFE;cis the number of channels;sis the length of the gait sequence and（h，w）is the size of each feature map.

FIGURE 2 The framework of GaitGP.Channel-Attention Fusion Extractor is consistsof CSP and Blocks.CSP represents the Channel-level Spatial Pooling and the Blocks are composed of two convolutional units(PConvs).In the Global Branch,PConv is mainly utilized to extract global features;while in the Partial Branch,PConv is used to collect channel-level attention information.Global And Partial Feature Aggregator is used to gather the global and fine-grained features.MGM represents the Multi-Granularity Mapping.Note that the MGMs are independent,each of which has a different scale.The dimension of the final feature is 256.FC,Fully Convolution

Since the length of the input gait may be different,many previous works [3,4,10] successfully utilize pooling to aggregate the gait information of elements in a sequence.Therefore,as shown in Figure 3,we first use Spatial Pooling to aggregate the information of gait elements to represent the gait motion pattern.A natural choice of the Spatial Pooling is to apply the statistical max function [3] on channel dimension.We pre-dividefbintoτ,channel-level partitions to aggregate the information of gait elements which is formulated as follows:

whereConcatrepresents the concatenation on the dimension of the channel.

Finally,fscoreis merged into thecollected by the statistical functions,formulated as follows:

wherefweight∈Rc×h×wis the final output of the CSP and ⊕is a channel-wise fusion operation.fweightcontains the framelevel Global information.

3.2.2|Partitionable convolution layer

The PConv is a basic unit of the blocks in the CAFE.To improve the compatibility between attention information and the global feature,the PConv is designed to be partitionable.As shown in Figure 2,we design the CAFE as a multibranched structure.In the Global Branch,the PConv is mainly utilised to extract global features;while in the Partial Branch,the PConv is used to collect channel-level attention information.The global feature extracted in different blocks of the Global Branch are added to the Partial Branch.In order to adapt to the various-level fusion of global features and channel-level attention information,we pre-define that each block hasτchannel-wise regions.In the initialization block(Block1),the PConv (τ=1) is equivalent to the regular convolutional layer.In the remaining blocks (Block2 and Block3),the PConvs are divided;The input global features are divided into t channel regions for convolution operation,and then vertically spliced together as the final output.

Supposing the output of the Global Branch isSglobal∈Rc×h×wand output of partial branch isSpart∈Rc×h×w,we connect both the two feature maps,represented as follows:

wherePConvpandPConvgrepresent the convolutional layer in the Partial Branch and in the Global Branch,respectively.⊕denotes the concatenate operation.

whereScafeis the final feature map of the CAFE,Concatrepresents the function Concatenate.Note that 2cmeans that the dimension of the channel becomes twice after the operationConcat.

Different layers have different receptive fields and each block contains two PConv layers,as shown in Figure 4(b).The exact structure and parameters of each PConv are shown in Table 1.As shown in Figure 4(a),taking the PConv in Block3 as an example,the input feature map is horizontally divided intoτ=4 partitions,which are operated independently.Then,the obtained channel-level feature vectors are spliced vertically as the final output.

3.3|Global and Partial Feature Aggregator

In literature,splitting the feature map into strips is commonly used in person Re-ID task [5,8,31].Horizontal Pyramid Pooling (HPP) [8] proposes to learn different finegrained features with four scales,and thus,can help the deep network focus on features with different sizes to gather both partial and global information.We improve the HPP to obtain relevant information between neighboring regions.The most obvious modification is that we divide the subsequent part into five independent sub-branches after the CAFE process.Each sub-branch has similar architecture with different scales.

Specifically,the GPFC hasρscales.On scaleρ,the feature map,Scafe,extracted by the CAFE is split into five independent sub-branches,expressed asEach sub-branch uses an MGM module with different scales,as shown in Figure 2.The MGM splits eachParttintoρ=2m-1on height dimension,that is,strips in total,wherem∈M=1.The upper sub-branchPart1contains only one whole partition(preserve global feature),which is used to supplement the relevant information between neighboring regions of other sub-branches.For the remaining four sub-branches,theScafeis split into differentρscales,that is,horizontally divided intoρstripes to learn different finegrained features independently.

Moreover,the structure of the MGM module is shown in Figure 5.On scaleρ,the Separate Max Pooling (SMP) is applied to downsampleScafeinto 3-D strip features of equal size.Then the Separate Conv1dNet (SC) module is leveraged to reduce the dimension,presented asvt,which consists of a 1-D convolutional layer with a kernel size ofρ.The specific parameters of each MGM component are shown in Table 2.The MGM is formulated as follows:

FIGURE 3 The structure of CSP.Take τ=4 as an example.The SP module applies an improved statistical max function to gather the most discriminative feature.The ConvNet is a convolutional layer with an activation function rectified linear unit (ReLU),which obtains the channel-level attention scores.CSP,Channel-level Spatial Pooling

FIGURE 4 (a) The illustration of the PConv in Block3 and the dimension of the input feature map is expressed as c×h w.(b)Block3 is a deep-layer block and consists of two PConvs.PConv,Partitionable Convolution

TABLE 1 The exact structure of the CAFE and the specific parameters of PConv.In_D,Out_D,Kernel represent the input dimension,output dimension and kernel size of the PConv,respectively.In particular,τ indicates the pre-defined partition in the PConv.Feature denotes the output feature maps of each block

whereMGMtis aρ-granularities extraction feature;is horizontally divided intoρscales;vtis the aggregated output vector andConcatrepresents the concatenate operation.The SMP is implemented by the 1-D Max Pooling with the kernel size ofρ,which is formulated as follows:

FIGURE 5 The structure of Multi-Granularity Mapping(MGM).Take scale ρ=16 as an example,the Separate Max Pooling (SMP) is applied to downsample.The Separate Conv1dNet (SC) is leveraged to reduce the dimension

Finally,we perform the Separate Fully Convolution (FC)layer to obtain the final features of the GaitGP,described asfc,formulated as follows:

In the testing phase,to obtain the discriminating ability,we splice all the features down to 256 dimensions as the final feature map,combining the global and fine-grained information to improve the comprehensiveness of the learning features.

3.4|Implementation details

3.4.1|Loss function

AsshowninFigure2,weaddaSeparatedTripletLossfunctionto supervise learning,which applies theSeparate Batch All(BA+)triplet loss[44]to train the network and use the corresponding columnfeature vectors between thedifferentadversarial samples to calculate the loss.The triplet loss is defined as follows:

whereNγis a random sample.Npis a positive sample with the same identity as theNγ.Nnis a negative sample with a different identity from theNγ.mis the margin of the triplet loss.The operation[ϑ]+is equal tomax(ϑ,0).

3.4.2|Training

The input of the network is a series of silhouettes.We randomly select samples from the entire gait sequence,which can be regarded as a time data enhancement method.We sample a batch of sizen×sfrom the training set,wherenrepresents the number of people with differentids,andsrepresents the number of different sequences used by each person with the sameidin the batch.Sampling strategies in[3,4]are applied,and theSeparate Batch All(BA+)triplet loss[44] is used to calculate the loss.

TABLE 2 Comparison of the settings for the MGM in five subbranches.“Sub” refers to the name of sub-branches.“P” refers to the number of partitions on feature maps.“Map Size”refers to the size of the output feature maps from each branch.“Dim”refers to the dimensionality and number of features for the output representations.“Feature”means the symbols for the output feature representation

3.4.3|Testing

The gait sequence is tested using the spatio-temporal features extracted for each gait sequence.The average Euclidean distance between the gallery and the feature column vector of the gallery can be used to match the metric.

4|EXPERIMENTS

In this section,we first describe two databases,CASIA-B and OU-MVLP,to evaluate our model GaitGP,followed by comparing the performance of GaitGP with the state-of-theart methods and ending with ablation study on CASIA-B to verify the effectiveness of each component in GaitGP.

4.1|Datasets and training details

4.1.1|CASIA-B

CASIA-B [11] is a widely used gait dataset containing 124 subjects,each of which includes 11 views.Among the views,there are 10 sequences with three gait conditions;one normal condition normal(NM)that includes six sequences.The first 4 sequences NM#01-04 form a gallery,and the remaining two sequences NM#05-06 are used as probes.In addition to the normal condition sequence,there are two sequences;one is wearing a coat cloth (CL)#01-02,the other is carrying a bag(BG)#01-02.The dataset enables researchers to simultaneously study cross-view and cross-wearing issues,in other words,each body contains 11 × (6+2+2)=110 sequences.There are various experimental schemes[45]based on CASIA-B to verify the feasibility and effectiveness of the proposed method.For fairness,this study strictly follows the popular protocol [6].Besides,there are three training settings which are configured according to the different training scales in the training stage[3],that is,small-scale training (ST),medium-scale training(MT),and large-scale training(LT).Among them,124 subjects are divided into two groups;24,63,and 74 subjects are put into the training set,and the remaining subjects are reserved for testing.During the test,the first 4 sequence conditions of NM (NM#01-04) are regarded as a gallery and the rest are divided into three subsets of walking conditions based on these six sequences,which are the NM subset of NM#05-06,the other BG subset of BG#01-02,and the last CL subset of CL#01-02.

4.1.2|OU-MVLP

OU-MVLP [12] is the newly released public gait database with the largest view changes,which consists of 10,307 subjects;each subject containing 14 viewsWe use the first 5153 for training and the remaining 5154 for testing.There are two sequences in the dataset.In the testing stage,the sequence#01 is classified as the gallery set,and the other sequences#00 are classified as the probe set.According to [12],four typical viewing anglesare evaluated.In addition to doing these four typical views,we conduct experiments with all the views [3,4,32,46].The data set can provide us with stable comparison results.

4.1.3|OULP

OULP [13] is a large dataset with only 4 view angles(55°,65°,75°,85°).There are 4,007 subjects (2135 males and 1872 females) with ages ranging from 1 to 94 years and each subject containing two sequences,one in the gallery and the other as a probe sample.Compared with CASIA-B,OULP has smaller view differences and no variants in walking conditions.However,the large number of subjects enables us to compare different gait recognition approaches with statistical significance.Our experimental setting is the same as in[20],since not all samples of each subject are covered from four view angles.A total of 3714 subjects (according to the file of first view angle) are used in the subsequent experiments.We use 1857 subjects as the train set and the rest as the test set.Note that the original silhouettes have already been cropped and aligned.We directly use the given silhouettes to construct the gait templates.

4.1.4|Training details

During the experiment,the lengthsof the input gait sequences is set to 30,the same as[3,4].We use the method mentioned in[12] to crop,align all input sequences,and adjust their size to64 × 64.The optimizer Adam is Adopted [47] to perform gradient optimization and the learning rate is set to 1e-4.In addition,the momentum is set to 0.9 and the margin of the Separate Triplet Loss is set to 0.2,the same as[44].In CASIAB,we set the batch size to（8，16）,and the number of training iterations is 90K.In OU-MVLP,because it contains far more sequences than CASIA-B,we set the number of parts of the GPConv layer in block2 and block3 to 2,2,4,4,and the batch size to（32，8）,the number of iterations is set to 250K,and the learning rate is set to 1e-5.

TABLE 3 In three experimental environments with different sample sizes(BT,MT,LT),CASIA-B’s average level 1 accuracy under all viewing angles and different conditions (not including the same viewing angle)

4.2|Comparison with the state‐of‐art methods

4.2.1|CASIA-B

As shown in Table 3,we compare our method with the latest gait recognition methods,which mainly include CNN-LB[20],GaitNet[46],GaitSet[3],MGAN[25]and ACL[33].To make a systematic and comprehensive comparison with the advanced methods,all conditions (NM,BG,CL) are included,and further experiments and comparative analyses are carried out with different training sample sizes (BT,MT,LT).The proposed method achieves the best recognition accuracy in almost all angles.

(1)As shown in Table 3,CNN-LB [20] is a GEI-based method and others are all based on the silhouettes,but the latter all perform better than the former.It shows that video-based methods have great potential in extracting more fine-grained information and distinguishing information from images.

(2)We discuss with GaitNet [46] and MGAN [25],which have the same structural purpose but different architecture composition.In GaitNet [45],the Auto-Encoder is introduced to obtain more distinguishing functions,and the multi-layer LSTM is applied for spatio-temporal modeling.MGAN uses a generative confrontation network to map different costumes to the same template from the front and side perspectives.In our model,we introduce the CSP to extract the local feature attention through channel-level division as the spatio-temporal attention feature of the subject.

(3)Compared with GaitSet [3],our structure is used a partitionable convolution unit called the PConv,which is used to obtain the channel-level feature fusing with spatial attention.The MGM of GaitGP also has a similar structure as that of GaitSet,but the MGM pays more attention to the fine-grained local segmentation using the independent operation to enlarge more representative features and reduce the similarity between different subjects.This result reveals the advantages of the PConv and MGM through experiments.From the experiment,GaitGP has obtained better results under various walking conditions on CASIA-B.

4.2.2|OU-MVLP

To prove the effectiveness of our method,we conduct two large-scale experiments on OUMVLP.(1) We use the same evaluation setting as [12] where 5153 people are trained and 5154 people are tested.The silhouettes of four typical viewsare evaluated for cross-view recognition,as shown in Table 4.(2) We list the results in two gallery collections including all the 14 views and the results are averaged on the gallery view (exclude the identical-view).We set the dimension of the global feature and the local feature as 512 and reduce the dimensionality through MGM to 256,as shown in Table 5.

4.2.3|OULP

To prove the broad applicability of our method,we also perform the experiments on OULP.The results are shownin Table 6:We compare our method with CNN-LB [20],GEINet [46],and MGAN [25].These methods are based on cross-view to calculate the accuracy,that is,calculating the average accuracy of each view angle excluding the same view angle.Our GaitGP performs better than these methods.

TABLE 4 OUMVLP results excluding the identical-view cases under four typical views（0°，30°，60°，90°）

TABLE 5 OUMVLP results excluding the identical-view cases under all views

4.3|Ablation study

To further verify the effectiveness of each component in our proposed network GaitGP,the two components of the CAFE,PConv and CSP,and the MGM module in the GPFC pipeline are included.We perform the ablation study of these components on the CASIA-B data set.Research,experimental results,and analysis are as follows.

TABLE 6 OULP cross-view average accuracies （% ）for all pairs of four view angles

4.3.1|Effectiveness of PConv

As introduced in Table 1,we present the parameter settings of the PConv.To evaluate the robustness,we design four groups of experiments.The blocks in the CAFE are composed of two PConv.In Exp.1-1,we set theτ=1 in Block1 for retaining its original state and the remaining two blocks are parameterized as 1.The difference between Exp.1-2 and Exp.1-1 is that Block1 remains unchanged,but the parameterτin Block2 and Block3 are set to 2 and 4.Exp.1-3 is based on Exp.1-2,but the latter two blocks are sets to 4.Similarly,the parameterτof Block2 and Block3 in Exp.1-4 are set to 4 and 8.All the results of these controlled experiments are shown in Table 7.

Comparing Exp.1-1 and Exp.1-2,on the one hand,we found that the blocks with the original state(τ=1) are not effective,which shows the advantages of partitionable extraction.On the other hand,the features of Exp.1-4 are too dispersed and lead to poor performance in the superficial layer,which is probably because of too much subdivision destroying the information of silhouette between the edges of the adjacent regions and increases the proportion of noise covariates.Finally,by comparing the differences of Exp.1-2,Exp.1-3,Exp.1-4,we observe that the average rank-1 accuracy first rises and then falls on the NM and BG subset,while it continues to rise under the CL subsets.It is believed that the reason for this phenomenon is that the different receptive fields of the top neurons can adapt to different walking conditions.

4.3.2|Effectiveness of CSP

The traditional spatial feature mapping [3,4] usesMax（·）orAvg（·）to aggregate spatial information.But,using them alone cannot realize the mapping adaptively.In this paper,we introduce CSP to achieve spatial feature mapping.Figure 3 shows its internal structure and describes the components used inside.Inspired by the idea in[1],slicing the feature map at the channel level,we design a new statistical functionSPand useConvNetto weight the local features to enhance attention.To verify the effectiveness of CSP,we design comparative experiments by implementing methods with different spatial feature mapping strategies on the CASIA-B data set.Note that the channel-level slice parameters are referred to the parametersτin the previous ablation experiment.

The results are shown in Table 8.Exp.2-1 uses the traditional statistical functionSPunder the conditions of NM and BG.Compared with Exp.2-2,we set the parameterτofSP1,SP2,SP3to slice 1,2,and 4 in different blocks,which has better performance.In Exp.2-3,the addition of theConvNetlayer aims to enhance the attention and make the aggregation of spatialinformationmoreeffective,reachingaccuracyratesof96.2%and 90.3%.Besides,in the CL setting,when the parameterτofSP1,SP2,SP3is set to 1,4,and 8,as shown in Exp.2-4,the highest accuracyrateis69.2%.Thismayindicatethatfine-grainedfeature extraction is better for extracting silhouette maps with bags.

TABLE 7 The ablation experiment performed on CASIA-B using the setup LT.The result is the average level 1 accuracy of all 11 views,excluding the case of the same view.Comparison of different parameter settings of PConv

TABLE 8 The ablation experiment performed on CASIA-B using setup LT.Results are rank-1 accuracies of all 11 views,excluding the case of the same view.Comparison of CSP with different settings for different blocks

4.3.3|Effectiveness of the MGM

We duplicate five branches of the intermediate feature maps obtained by the backbone network,namedPart1,Part2,Part3,Part4andPart5,and the corresponding configurations are shown in Table 2.From the experimental results,we found that setting the horizontal stripes asρ=2m-1,m∈1,2,3,4,5,the same as in [3],performs well,which shows that different fine-grained segmentation can better capture details that are easily ignored for recognition.

In our experiment,we explore the influence of multibranch architecture from two aspects.As shown in Table 9,on the one hand,the structure with only one partition branchPart1(considered as the global representation) is compared with the structure of integrating only four independent different multi-granularity branches.It is shown that the integrated strategy achieves better performance than any single participating network.It shows that,compared with the globalnetwork,the collaborative learning of branches has more discriminatory feature representations.On the other hand,we combine the two structures and compare them with the above two experiments.The effect of combining the global features and local features is higher than using one of them alone.We believe that the mutual influence between the four independent different multi-granularity branches supplements their blind spots in their learning process.

TABLE 9 The ablation experiment performed on CASIA-B using setup LT.The result is the average level 1 accuracy of all 11 views,excluding the case of the same view.Accuracy (%) of using different branches in MGM

TABLE 10 The ablation experiment performed on CASIA-B using setup LT.The result is the average level 1 accuracy of all 11 views,excluding the case of the same view.Accuracy (%) of using different branches in MGM

4.3.4|Efficiency of GaitGP

As discussed in [48],the efficiency of the pair-wise simulation degree learning method [49] is limited.On the other hand,since each sample only needs to be calculated once [3],our network takes 1.36 min to complete the test on 4 NVIDIA 1080TI GPUs.Table 10 lists the efficiency comparison on CASIA-B.

5|CONCLUSION

This paper proposes a new network architecture and designs the PConv to extract the global and partial features by combining the advantages of both.We also propose CSP for spatial learning attention and feature expression to improve the performance of gait recognition tasks.In addition,through the multi-granularity horizontal segmentation pipeline,MGM,different multi-granularity branches are integrated to obtain the final gait representation.Experimental results on three public datasets verify the effectiveness and efficiency of our method.

ACKNOWLEDGEMENTS

This work was partially supported by the Natural Science Foundation of Guangdong Province No.2018A030313318 and the Key-Area Research and Development Program of Guangdong Province No.2019B111101001.

ORCID

Jing Xiaohttps://orcid.org/0000-0002-5242-7909

CAAI Transactions on Intelligence Technology

2022年2期