Hand segmentation from a single depth image based on histogram threshold selection and shallow CNN

2018-11-02XUZhengzeZHANGWenjun

上海大学学报（自然科学版） 2018年5期

XU Zhengze,ZHANG Wenjun

(1.Shanghai Film Academy,Shanghai University,Shanghai 200072,China;2.School of Communication,East China Normal University,Shanghai 200241,China)

Abstract:Real-time hand gesture recognition technology significantly improves the user’s experience for virtualreality/augmented reality(VR/AR)applications,which relies on the identification of the orientation of the hand in captured images or videos.A new three-stage pipeline approach for fast and accurate hand segmentation for the hand from a single depth image is proposed.Firstly,a depth frame is segmented into several regions by histogrambased threshold selection algorithm and by tracing the exterior boundaries of objects after thresholding.Secondly,each segmentation proposal is evaluated by a three-layers shallow convolutional neural network(CNN)to determine whether or not the boundary is associated with the hand.Finally,all hand components are merged as the hand segmentation result.Compared with algorithms based on random decision forest(RDF),the experimental results demonstrate that the approach achieves better performance with high-accuracy(88.34%mean intersection over union,mIoU)and a shorter processing time(68 ms).

Key words:hand segmentation;histogram threshold selection;convolutional neural network(CNN);depth map

Recently,virtualreality/augmented reality(VR/AR)applications have become very popular in the realm of digital media technology.VR films,immersive films and AR games based on the next generation of interaction technology,requires a more natural interaction method than that offered by traditional devices such as the mouse and the keyboard.For instance,gesture recognition[1-3]have been introduced to significantly improve the user’s experience in VR/AR applications.Most real-time hand tracking algorithms rely on the identification of the hand orientation in captured images or videos.As such,the segmentation of the hand is an important preprocessing step to extract this feature from various kinds of obscuring backgrounds.Hand segmentation has been studied for many years.In the early days,researchers focused on the application of RGB(red,green,blue)color images and skin color-based segmentation models[4-6]to locate hands in an image.Oikonomidis,et al[7]trained a skin-color representation in HSV(hue,saturation,value)space offl ine,and then segmented hands by skin-color detection and thresholding online.Tzionas,et al[8]implemented segmentation using the Gaussian-Mixtures-Model(GMM).However,this model has rarely been deployed in real applications because of its own limitations in processing objects with a color similar to skin,and the output varies with skin color and illumination condition.

Due to the popularity of consumer depth sensors,substantial progress has been made in regard to fast and accurate segmentation of hands and numerous methods have been proposed.Liang,et al[9]assumed that the hand was the nearest object to the camera and that the palm region can be approximated as a circle based on the morphology of the hand.Qin,et al[10]detected the result of a rough hand region and separated it into two parts initially,and then they set a perpendicular line in the middle of the two center points as the cut line.Furthermore,clustering methods have also been investigated.A hierarchical clustering procedure has been used to obtain a rough hand RoI(region of interest),which was then modeled by a mixture of two GMMs[11].The success of the random decision forest(RDF)approach for the segmentation of body parts also presented new research opportunities for investigators.Tompson,et al[1]and Sharp,et al[12]proposed a similar method[13]based on the RDF method to classify diff erent pixels in a single depth image.The extraction ofthe RoI from backgrounds using RDF is becoming more popular and accurate.However,it is diffi cult to precisely and consistently separate a hand from the cluttered backgrounds for the following reasons.(1)Segmentation by the RDF method suffers from a “boundary pixel” issue,since its classification results for pixels near the classification boundary are not reliable,which has also been confirmed in Ref.[12].(2)Small holes and flaws are caused by the low resolution of a consumer depth camera,which imposes additional challenges on processing the depth information.(3)Large variations in hand orientations,postures and cluttered backgrounds.

This paper focuses on the problem of how to achieve fast and accurate segmentation of the hands from cluttered backgrounds,which is essentialto enable robust hand tracking.The hand segmentation method is proposed for hand pose estimation using only a single depth map.This method is a three-stage pipeline.Fig.1 shows the overview of the approach.A single depth image captured by RealSense[14]RGB-D camera is inputted.In Step 1,the depth image is segmented into several regions by thresholds,which is defined by the MINIMUM thresholding algorithm based on Ref.[15].It suggests that in the histogram,t will be chosen as the value of x at which y is minimized,in the valley between maxima of y.Then further separation of different objects in each region by tracing their exterior boundaries is required.In Step 2,each subdivided component proposalis evaluated by a shallow convolutionalneural network(CNN),which is trained as a binary classification function to predict whether it is a portion of the hand or not.In Step 3,all hand components labeled by CNN are merged as the hand segmentation result.To the best of our knowledge,this is the first work to leverage both shallow CNN and histogram-based thresholding algorithm which is normally used in gray image binarization.It is determined that histogram-based thresholding algorithms such as MINIMUM,MEAN and MEDIAN are the eff ective ways to separate objects between the foreground and background in depth images.To train the neural network to make good predictions,a set ofrealdata from database HandNet[16]containing more than 200 000 single depth images with large variations in background scenes,hand poses and orientations from 10 participants(6 male,4 female)were involved.Compared with the segmentation results which were obtained by an approach based on RDF in Ref.[1,12],the method gives higher accuracy effi ciently.

Fig.1 Overview of the approach

1 Methods of three-stage pipeline

1.1 Histogram-based thresholding algorithm

The histogram of the gray-level image represents the probability density function of the intensity values of the image,while the histogram of depth represents the probability density function of depth values.A threshold is usually used in image binarization to convert a gray-level image into a binary one.Similar to the gray-level image binarization based on a histogram of intensity values,the depth histogram thresholding algorithm proposes an approach to extract human object such as hands from a depth image based on a histogram of depth values.The two binary levels may represent objects in the foreground and background.Pixels,whose value exceeds a criticalvalue called the globalthreshold which is used across the whole image,are assigned to one category,and the rest are assigned to the other.Many algorithms have been proposed for automatic histogram threshold selection automatically.They are simple to understand and implement,and are computationally fast once the histogram has been obtained.

In a depth histogram,X is defi ned as a vector of ascending sorted depth values(X=[0,1,2,···,255]),and Y as a vector of corresponding accumulation of each depth values(Y=[Y0,Y1,Y2,···,Y255]).If the object is too close or too far from the camera,depth data acquired will become progressively less.For this reason,only the depth values between 25 and 230 are used to make sure that the depth data collected by the Realsense camera is highly accurate and reliable,and others are removed(see Fig.1 Step 1).An element Yiis considered as a peak if its value is greater than its two nieghbors as shown in Eq.(1).

Fig.2 Histogram-based threshold selection and regions of interest segmentation

The peak numbers and the distribution of the depth values at this stage are shown in Fig.2 Step 1 provides noisy information to identify the regions of interest.Therefore,using local peaks to segment an image could be ineffective with depth data.In practice,before searching for thresholds,this algorithm will usually require the y value in the histogram which represents the probability density for this particular distance to the depth camera to be smoothed.

In the next movement,one chiefmotivation is the fact that depth values which represent the distance between the nearest and farthest space in a scene should be fully utilized.An independent three-dimensional object is continuous not only in the x-axis and y-axis of the plane image,but also in the z-axis direction,which provides an edge to separate it from other objects using disconnected depth information.In other words,different objects are commonly detached in different positions on the depth histogram.Therefore,the problem is to estimate a value of the threshold T for which the image elements with an intensity value greater than T will contain background points,and the pixels with an intensity value lower than T will contain hand points.In the experiment,the selected x point in the valley between two maximum of y as the threshold,similar to the MINIMUM algorithm[15],separate the depth image into foreground and background with a minimum error.

After 100 iterations ofsmoothing operation(see Fig.2 Step 2),the majority ofthe depth histograms,about nine out of ten,can be converted bimodal.However,there is stilla small chance that the histogram is not bimodal.The multi-threshold searching algorithm is used to pick the lowest point between every two peaks.As showed in Fig.2 Step 3,two thresholds at diff erent x position are selected,and they divide the whole image into three parts(see Fig.2 Step 4).Finally,due to the depth information provided by low-resolution consumer depth sensors such as Kinect and Realsense,there are some small holes and flaws in the segmented contour(see Fig.3,small regions in the green circle are missing).These holes are incorrectly taken as empty pixels that will cause separation failure by directly tracing their exterior boundaries.Before each region is further subdivided,the binary image should be dilated to fillthese holes with a three pixelradius circle as the structuring element.

Fig.3 Holes and flaws caused by depth noises

After filling these smallholes,further subdivision ofdifferent objects in two-dimensional space can be processed by the pixel group algorithm.Contrary to the separation through depth information,pixel grouping utilize data in flat space to separate objects from each other.When every object is effectively and effi ciently subdivided by the histogram-based thresholding algorithm and tracing their exterior boundaries after image dilation,each of them is sent to a shallow CNN modelto classify whether it belongs to the hand or not.

1.2 Binary classifi cation by shallow CNN

After the aforementioned,regions that contain hands and other objects are generated separately.The problem then becomes the selection of the hand segmentation proposal that best matches the ground truth hand district.By predicting whether this region contains a portion of the hand or not,the hand segmentation problem is formulated as a binary classification problem.A classification model needs to be trained to predict a binary result for each proposal.To reduce computations and to facilitate fast and precise prediction,a shallow CNN architecture is deployed as the model.

1.2.1 CNN architecture

The CNN deployed in the second stage of the pipeline is similar to LeNet-5 based on Ref.[17].Fig.4 illustrates the CNN architecture,which takes regions separated using the preceding steps as inputs.As an example shown in Fig.4,one single frame of depth image is converted into five parts.Then every part of this image is fed into a three-layer CNN architecture which is composed of two Conv layers(C1,C2)and one fully connected layer(FC).Each Conv layer with rectified linear activation units(ReLU)is followed by a max pooling layer to process reductions.

Then all the outputs of the second convolutional layer C2 are flattened and sent to the FC layer to compute 256 outputs.Finally,2 numbers are outputted to produce a binary label for each image with classification using a softmax activation function.[0 1]means an image contains a part of the hand,while[1 0]means that there is no sign of the hand.

The forward propagation function implementation to build the above model is as follows:Conv2D−→ReLU−→Max-Pool−→Conv2D−→ReLU−→Max-Pool−→Flatten−→FC−→Softmax.

Fig.4 CNN architecture

1.2.2 Training

The network is trained in a standard way with mini-batch gradient descent and Adam optimization algorithm[18]to compute and minimize the softmax entropy loss between logistic and labels.The hyper parameters ofthe Adam algorithm are set to β1=0.9,β1=0.999 and ε=10−8.Weights of all layers are randomly initialized based on the Xavier initialization method[19]with a zero-mean normal distribution,while biases are initialized to zero.

During the training,decay learning rate policy has been introduced to gradually lower the learning rate along with an increase in mini-batch iterations and an increment at each training step.This policy applies an exponentialdecay function to a provided initiallearning rate which is set to 0.005 in this case.It requires a global step,a decay step and a decay rate value to compute the decayed learning rate.The decay learning rate is defined as 0.98,so after 100 epochs of every training data,the decay learning rate decreases from 0.005 to 0.000 676 631 7.Fig.5 showed the plot curve for the training experiment.

Fig.5 Plot curve of the decay learning rate during 100 epochs of training

Training the CNN modelis the most expensive and time consuming aspect ofthe training process,because many candidate parameters must be attempted and the softmax entropy loss must be computed at a rapidly growing number as the iteration increment of all the training data.To keep the training time to a practicallimit,a GPU-based training machine is employed with a modern 4-core CPU and a single Nvidia GeForce GTX 1070 GPU,resulting in a significantly increase in speed during training and testing.At the high end of the experiments,fine training this three-layer CNN model to 100 epochs from 640 000 depth images with 320×240 resolution took about 35 h on one 1070 GPU,which is considerably faster than CPU solutions,with a potential speed-up of a factor of 30∼50.The resulting modelhas roughly 1 M parameters,mostly from the FC layer.

1.2.3 Testing

The learned modelis tested on a dataset with 2 772 frames depth images of various kinds of hand gesture which are randomly collected and removed from the training dataset.These 2 772 frames are divided into 8 956 frames using Method A in the pipeline as previously described.Then depth data pass through the forward propagation in Fig.4 to a softmax activation function with a class label C=2.The evaluation with the pre-trained CNN model takes on average 32.25 ms on a modern 4 core desktop CPU to make a prediction for each depth test data frame,while it takes only 2.2 ms on a Nvidia 1070 GPU for highly-parallel processing.

1.3 Merging of hand parts

A single frame of depth image captured by the RGBD camera has been divided into two or more images,then every image that contains a portion of the hand which is predicted by the CNN modelcan be merged into a complete object using pseudo-code as shown in Fig.6.To make it easier for readers to obtain an intuition understanding of the whole pipeline,a complete processing step of the approach is given in the next section.

Fig.6 Pseudo-code 1:algorithm Step C of merge hand procedure

2 Experiment and results

2.1 Experiment setup

2.1.1 Dataset collection

Large volumes of dataset collections such as hundreds of labeled depth images,are very expensive and excessively time consuming,so real hand depth datasets from the existing database HandNet[16]is used in the experiment,which provides depth maps and ground truth segment maps for a hand in the image.This dataset is randomly split into 202 197,2 772 and 100 images for training,validation and testing,respectively.

2.1.2 Ground-truth labeling

Since the algorithm fi rst splits the single depth image into more images which causes the objects to be separated from each other in three-dimensionalspace,it is impossible to directly use the ground truth segment map within the original depth data provided in the HandNet.Creating labeled datasets for training via manual annotation is labor-intensive.Instead,an automatic algorithm was developed as represented by Eq.(2)to labelthese subdivided images to collect data sets for preparation.It is important to point out that hand labeled positive only means that this object in the depth image belongs to a part of the hand and does not means that it contains a complete hand.The ratio threshold for defining positive or negative is set to 0.5.This might cause slight reduction of the segmentation accuracy of the training data,because there is a small number of subdivided hands their intersection will be less than 0.5 and will be mislabeled as negative.To avoid this issue,a higher quality dataset via a manual annotation label is recommended for future work.

ratio=size(subdividedimg∩segmapgt)÷size(segmapgt),

2.1.3 Procedures pipeline

The method integrated the three procedures described above into a pipeline to process hand segmentation operation as shown in Fig.7.

Fig.7 Pseudo-code 2:complete algorithm of segment hand from the background

2.2 Results

Denote a proposal hand area from the three-step pipeline as Ip,and its corresponding ground-truth segmentation map as Igt.The similarity between the proposal and its corresponding ground-truth can be quantified by pixel intersection over union(IoU)[20].It is adopted to measure the performance of hand segmentation at the pixellevel.Proposals that have IoUlower than 0.5 are treated as negative examples.The remaining proposals are labeled with positive values,since their IoU over 0.5 shares a large overlap with the corresponding ground truth.The success rate of the segmentation method on the IoU threshold at 0.5 is calculated by the ratio of positive examples in allthe testing examples.

By calculating the IoU for each frame,the success rate curve is as shown in Fig.8 according to different IoU thresholds.The plot shows that the pipeline can generate high quality proposals that has high-overlap with the ground truth segmentation map.The success rate for the IoU threshold at 0.5,0.6,0.7,0.8 and 0.9 are approximately 98.01%,95.35%,91.34%,84.60%and 63.89%,respectively.

Fig.8 Success rate of the segmentation method on diff erent IoU threshold compared with RDF

For further quantitative evaluation,precision,recall(equivalent to accuracy in this scenario),F1 score and mean IoU(mIoU)are adopted in the experiment.However in the application of hand pose estimation and gesture recognition,the precision of hand segmentation is more valuable than recall.If the segmentation algorithm can not detect a hand in the image from the camera,recognition processing could simply drop the depth frame and process the next frame of the continuous video sequence.The performance of recognition will not be adversely impacted even if an image containing a hand is mislabeled as an empty image.

Since RDF is a popular algorithm that has achieved promising segmentation results in various scenarios[1,13,21-22],the pipeline method is compared with an RDF-based hand segmentation approach which was followed by a post processing procedure consisting of median filtering and blob detection[1].The mIoU is the average of the IoU of the hand class only.The CNN model is fine-tuned to the HandNet dataset for 100 epochs and on average,the mIoU of the proposed generation method is approximately 88.34%.

Table 1 The method compared with RDF in mIoU and precision/recall/F1 score measure%

It is observed in Table 1 that the approach achieves better performance in both precision/recall/F1 score measure and mean IoU evaluation metric.Furthermore,since the method is a three-step pipeline consisting of separation,classification and merging,there is no “boundary pixel” issue suff ered from the RDF method.Example results of the proposed and ground truth segmentation maps are shown in Fig.9.

Fig.9 Example segmentation results

In addition,the duration time is recorded during the execution ofthe three-step pipeline.Step 1 called the divide image procedure takes 200 examples and executed the processing within 699.78 ms,with an average cost of 3.49 ms on a modern four-core CPU without multithread optimization.In Step 2,the module of CNNPerdict had an average cost of 32.25 ms for the predict depth images which were preprocessed using the divide image procedure on a common 4-core desktop CPU using a trained CNN model.This time was sharply cut down to 2.2 ms on a GTX1070 GPU,which is approximately 15∼20 times faster than a CPU.In Step 3,merging all the parts of the predicted hand into a complete object is estimated to cost about 2.15 ms.

The current pure CPU implementation ofthe entire pipeline which including three steps runs at 37.89 ms per frame.The GPU-based approach required 7.84 ms to complete the hand segmentation,fast enough to allow follow-up methods to fulfill the task of fully articulated hand tracking at 25 Hz.The proposal,together with the use of GPUs for highly-parallel processing,has made real-time hand tracking more feasible.This allows the network to achieve a high-accuracy(88.34%mIoU),and a forward-propagation performance suitable to real-time applications(68 ms on an Nvidia GTX1070).

3 Conclusions and future works

This paper proposes a novel approach for hand segmentation of depth images.The results of the experiments demonstrate that the approach could produce more accurate hand segmentation than the existing baseline RDF method for various poses,orientations and sensing distances within a short processing time.

In the future,the methods are planned to evaluate by training/testing on all possible combinations of the existing database including HandSeg,Freiburg,NYU hand,and similar conclusions are expected to be made.

上海大学学报（自然科学版）

2018年5期