A Dual Discriminator Method for Generalized Zero-Shot Learning

2024-05-25TianshuWeiandJinjieHuang

Computers Materials&Continua 2024年4期

Tianshu Wei and Jinjie Huang,2,⋆

1School of Computer Science and Technology,Harbin University of Science and Technology,Harbin,150006,China

2School of Automation,Harbin University of Science and Technology,Harbin,150006,China

ABSTRACT Zero-shot learning enables the recognition of new class samples by migrating models learned from semantic features and existing sample features to things that have never been seen before.The problems of consistency of different types of features and domain shift problems are two of the critical issues in zero-shot learning.To address both of these issues,this paper proposes a new modeling structure.The traditional approach mapped semantic features and visual features into the same feature space;based on this,a dual discriminator approach is used in the proposed model.This dual discriminator approach can further enhance the consistency between semantic and visual features.At the same time,this approach can also align unseen class semantic features and training set samples,providing a portion of information about the unseen classes.In addition,a new feature fusion method is proposed in the model.This method is equivalent to adding perturbation to the seen class features,which can reduce the degree to which the classification results in the model are biased towards the seen classes.At the same time,this feature fusion method can provide part of the information of the unseen classes,improving its classification accuracy in generalized zero-shot learning and reducing domain bias.The proposed method is validated and compared with other methods on four datasets,and from the experimental results,it can be seen that the method proposed in this paper achieves promising results.

KEYWORDS Generalized zero-shot learning;modality consistent;discriminator;domain shift problem;feature fusion

1 Introduction

Traditional image classification methods need to collect a large number of images with annotations for model training,but for some new things that cannot massively collect training images,traditional image classification methods can not directly classify the new things.The emergence of zero-shot learning can solve this problem.Zero-shot learning learns from existing samples and then infers the categories of new things.Zero-shot learning recognizes new things using linguistic descriptions of the new things,and we refer to the linguistic descriptions as semantic features in this paper.

Two types of features are needed in zero-shot learning:Sample features(visual features)and the semantic features mentioned above.These two types of features belong to different feature spaces,and aligning these two types of features is very important.Aligning semantic and visual features is usually done by mapping them to the same feature space[1–5].We refer to these methods as embedding methods [6,7].However,these methods sometimes only consider information from the seen classes,which can cause a decrease in the accuracy when classifying the unseen class samples.

Addressing the problem of misclassification results for unseen classes,some researchers add information about unseen classes to their models,methods commonly used nowadays are generative models[8–11].Although the generative models can get good classification results,these methods need to train the generative model first and then use the generative model to obtain pseudo-samples about unseen classes.Then,a classifier is trained using pseudo-samples.The generative model methods make the process more complicated than other methods.Incorporating unseen class semantic features into the loss function [12] or adding a calibration term to the classification [13,14] is another technique to increase the classification accuracy of unseen class samples.In addition,some literature has also noted that the similarity between features also leads to a decrease in the zero-shot classification accuracy.Zhang et al.[15] proposed imposing orthogonality constraints between semantic features to differentiate between semantic features of different classes.This approach increased the differences between different categories and alleviated domain shift problems.

We have similarly employed adding information about unseen classes to the model.Unlike the methods mentioned above,a new feature alignment method is proposed in our model.In this paper,except the traditional mapping approach,we further use a dual discriminator approach to align the semantic and visual features.Instead of increasing the distance between different categories’visual and semantic features,we increased the consistency between the hidden space visual features with all class semantic features.This approach not only aligns features but also provides information about unseen classes.A new feature fusion approach is also used for classifier training to alleviate the bias problem.Our contributions are as follows:

(1) We propose a new model structure for solving the alignment problem of different modal features and the domain shift problem.

(2) To make a better alignment of semantic and visual features,this paper proposes a dual discriminator module and this dual discriminator method can provide information about the unseen classes.

(3) We propose a new feature fusion method by which the seen class features are perturbed to reduce the degree to which the classification results in the model are biased toward the seen classes and provide information on the unseen classes.

(4) Our method was validated on four different datasets.The experimental results demonstrate that the proposed model obtains promising results,especially in aPY dataset(5.1%).

2 Related Works

2.1 Zero-Shot Learning

Semantic features and visual features belong to different feature spaces with different dimensions,respectively.Usually,it is a choice to map these two features to the same feature space.Figs.1a and 1b show the two mapping methods: From semantic space to visual space and from visual space to semantic space.Liu et al.[6]proposed a Low-Rank Semantic Autoencoder(LSA)to enhance the zeroshot learning capability.Before classification,they used a mapping matrix to map semantic features to visual space.Tang et al.[4]mapped visual features to the semantic space and realized feature alignment and classification by calculating the mutual information between semantic features and visual features.In addition to the two mapping methods in Figs.1a and 1b,common feature space can be used in some literature.Hyperbolic spaces can maintain a hierarchy of features.Liu et al.[16] proposed to map the visual features and the semantic features into hyperbolic space.Li et al.[17] used direct sum decomposition for semantic features;the semantic features were decomposed into subspaces.The method in the literature [17] embedded semantic features and visual features into the common space.In addition,another method that maps semantic features to the visual space while projecting visual features to the semantic space.This method reduces the domain shift problem and allows better alignment of both features[5,18,19].These methods mentioned above only consider the information of the seen class when training the models but ignore the information provided by the unseen class semantic features.The compression of the unseen class information leads to the misclassification of the samples of the unseen class.Especially for generalized zero-shot,neglecting the unseen class information can cause most samples to be biased towards the seen classes.

Figure 1: Embedding methods

2.2 Domain Shift Problem

Since the unseen class samples only appear in the test set and the distribution is not the same between the seen class samples and the unseen class samples,this leads to a bias in the model when classifying the unseen class samples,and this phenomenon is domain shift problem.Especially for test sets containing the seen class categories,the unseen class samples are more likely to be misclassified as one of the seen class categories.Adding information about unseen classes to the model is proposed to address the problem mentioned above.Some researchers proposed generative models to generate unseen class samples[8–10,20].These methods use pseudo-samples instead of real samples for training the classifier.Huynh et al.[12] proposed another method.They proposed to add a term about the unseen class information in the loss function so that the information about the unseen class will not be too compressed.In addition to these two methods mentioned above,Jiang et al.[21] used class similarity as the coefficients in the loss function to improve the classification accuracy.In order to make semantic features more distinguishable,some researchers have imposed constraints on the semantic features of all classes,and such restrictions can distinguish the semantic features of different classes.In this way,all the features can be better categorized when mapped to the same feature space and alleviate the domain shift problem.Wang et al.[22]proposed to add orthogonal constraints to class prototypes in all class prototypes.Zhang et al.[15]proposed bi-orthogonal constraints on the latent semantic features and used the discriminator to reduce the modality differences.Zhang et al.[23]proposed corrected attributes for both seen and unseen class semantic features;the corrected attributes can be discriminative in zero-shot learning and alleviate the domain shift problem.Shen et al.[24]used spherical embedding space to classify the unseen class samples,this method used different radius and spherical alignments on angles to alleviate the prediction bias.

In the literature [15],the authors proposed the use of an adversarial network to distinguish the semantic features and visual features.Our method also uses a discriminator for the semantic features and visual features.Still,there is no orthogonality restriction on the semantic features in our method,and this paper employs a dual discriminator approach to align the features of different modalities.This dual discriminator can provide part of the information about the unseen class.To alleviate the problem that most of the unseen class samples are always classified into seen classes,we propose a feature fusion method that can reduce the seen class’s information and increase the unseen class’s information to some extent.

3 A Dual Discriminator Method for Generalized Zero-Shot Learning

3.1 Definition of Problem

The training set can be denoted byT={Xt,At,Yt}.We useU={Xu,Au,Yu} to represent the unseen classes.Xrepresents the visual features,Arepresents the semantic features andYrepresents the labels.We use the subscripttanduto represent seen classes and unseen classes.In conventional zero-shot learning(CZSL),the unseen samples can be classified into the unseen classes.In generalized zero-shot learning(GZSL),test samples are classified into all classes(both seen and unseen classes).

3.2 The Architecture of the Proposed Method

The proposed method is shown in Fig.2.We only consider GZSL in this paper.The visual featuresXtare encoded to get the hidden space featuresZt1,Zt2,andZt1=Zt2.The hidden space features are aligned with the seen class semantic featuresAtand unseen class semantic featuresAuthrough two discriminators.The features in the hidden space are decoded to get new visual featuresandand the new visual features are fused with the original visual features as the input featuresf1andf2to the classifier.We use lowercase letters to represent a feature.Each part of the model is described in detail below.

Semantic features and visual features belong to different feature spaces;mapping these two features to the same feature space and maintaining the consistency of these two features is an essential issue in zero-shot learning.Inspired by the literature [25],we use the latent space visual features to make the different modality features consistent.

In the literature [15],the authors used a discriminator to discriminate the different modality features.Different from the literature [15],we use two discriminators to enhance the consistency of the two modality features.We take one of the discriminators as an example,and its structure is shown in Fig.3.Inspired by generative adversarial networks[26],a discriminator can be used in generative adversarial networks to distinguish whether the sample is a generated sample or a real sample.This approach can make the generated samples more similar to the real samples.In this paper,we regard the hidden space visual features obtained by using the encoder as generative samples and regard the semantic features as real samples so that the discriminator can make the hidden space visual features more similar to the semantic features and enhance the visual features consistent with the semantic features.Also,to reduce the domain shift problem and increase the information of the unseen class,a discriminator is used for the semantic featuresAuand the hidden spatial visual features.

Figure 2: The proposed method

The other discriminator has the same structure as Fig.3.Inspired by Wasserstein Generative Adversarial Nets(WGAN)[26],we write the loss function of the discriminator in the following form:

Figure 3: The structure of the discriminator

Here,λ1andλ2represents the coefficients.D1andD2represent the two discriminators,where D1denotes the discriminator associated withZt1andAtand D2denotes the discriminator associated withZt2andAu.The subscriptPrepresents the distribution of the data.In this paper,our calculation ofis slightly different from that in the literature[26],we computeby=δ∗Zt1+(1-δ)∗Atandδ∼U(0,1).is computed asThese two discriminators align semantic and visual features and add the information of unseen classes.The encoder in Fig.1 can be seen as the generator and mean(•)represents the mean value.The loss function is shown in Eq.(2):

The hidden visual featureszt1are passed through the decoder to get the new visual featureswhich need to be consistent with the original visual features,and this relationship can be written as:

Similarly,for the hidden spatial featureszt2to get the new visual featuresthrough the decoder,the loss function concerning the original visual features is written as:

Here,Δx=xt-We first compute Δx,then we compute Eq.(4).We useΔxinstead ofxtbecausezt2contains a portion of the information of the unseen class,and we want to reduce the compression of the knowledge of the unseen classes after the decoder.In the latent space,we also want the different modality features to be consistent with each other.

If only the featuresXtare employed as input features to the classifier.The results will biased to the seen classes.So,before inputting the features into the classifier,feature fusion is used,as shown in

Here,μ1andμ2are coefficients.Feature fusion is equivalent to adding perturbations to the original visual features,which can compress the information about the seen classes and provide information about the unseen classes.The cross-entropy is used as the loss function in the classifier,yirepresents the true label andrepresents the predicted label.

The total loss function is:

whereβis the coefficient.The model proposed in this paper is optimized by alternating optimization method.The discriminator is firstly trained by Eq.(1),and then the other networks in the model are trained by Eq.(10).

4 Experiments

We validate our model on four datasets: Animals with Attribute 1 (AWA1) [27],Animals with Attribute 2(AWA2)[28],Attribute Pascal and Yahoo(aPY)[29]and Caltech-UCSD Birds-200-2011(CUB)[30].The details of these four datasets are shown in Table 1.

Table 1: The details of the four datasets

In the proposed model,we use the RMSProp method to optimize the discriminator modules and the Adam method to optimize the other part of the proposed model.The learning rate is 0.001 for AWA1 and AWA2 datasets,and the learning rate is 0.006 for CUB and aPY datasets.The output of the first layer in the encoder contains 512 units,and the output of the first layer in the decoder contains 256 units.The output dimensions of the fully connected layer in the discriminator are 1024 and 256.We setμ1=0.5 andμ2=1 in our model.The visual features and semantic features are taken from the literature [28].The dimension of the visual features is 2048.The complexity of the model are as follows:The flops for AWA1,AWA2,CUB and aPY are 4.86 M,4.86,6.77 M and 4.68 M,and the byte are 2.44 M,2.44 M,3.39 M,2.35 M.

4.1 Results of GZSL

The proposed method is compared with other methods in GZSL settings.The evaluation method is taken from the literature [28].We useCto denote the average per-class top-1 accuracy andHto denote the harmonic mean.The subscriptssandudenote the seen classes and the unseen classes.The equations are as follows:

The results of the proposed method are shown in Table 2.As seen from Table 2,the results of the proposed method on the AWA1 dataset are 2.2% lower than the best results.The method proposed in this paper achieves promising results on AWA2 and aPY datasets.Especially on the aPY dataset,the method in this paper outperforms the Spherical Zero-Shot Learning(SZSL)[24]method by 5.1%.The methods Semantic Autoencoder+Generic Plug-in Attribute Correction(SAE+GPAC)[23],SZSL[24],Transferable Contrastive Network(TCN)[21],and Modality Independent Adversarial Network(MIANet) [15] are considered the unseen semantic features in their models.Where SAE+GPAC,SZSL,and MIANet impose constraints on the semantic features,making the different classes of features more distinguishable.TCN proposed using the relationship of unseen class and seen class semantic features as the coefficients of the loss function.The method in this paper achieves better results than SAE+GPAC,SZSL,TCN,and MIANet these four methods on the AWA1,AWA2,and APY datasets,and the methods SZSL and TCN for the CUB dataset are better than the proposed method.In summary,the method in this paper gives good results on the AWA2 dataset and the APY dataset,and not as good as the other methods on the AWA1 dataset and the CUB dataset,especially on CUB dataset.This is because the CUB dataset is a fine-grained image dataset,although the method in this paper can provide features about unseen classes,it is not sufficiently discriminative between features of different classes,so it will lead to a decrease in classification results.

Table 2: The results in GZSL

4.2 Parameters Influences

Figs.4–7 show the effects of β in Eq.(10)on the generalized zero-shot classification results.

Figure 4: The effects of β on AWA1

Figure 5: The effects of β on AWA2

In Figs.4–7,this paper uses ‘tr’and ‘ts’to denote the average per-class top-1 accuracy of the seen classes and the unseen classes,respectively.For the AWA1 and AWA2 datasets,asβincreases,the accuracy is increased for the harmonic mean and unseen classes and decreased for the seen classes.For the aPY dataset,an increase inβhas little effect on the harmonic mean,while the accuracy decreases for the seen classes and increases for the unseen classes.For the CUB dataset,accuracy increases for unseen class samples and decreases for seen class samples.In summary,asβincreases,the accuracy of the unseen classes increases,while the accuracy of the seen classes decreases.

Figure 6: The effects of β on aPY

Figure 7: The effects of β on CUB

4.3 Ablation Experiments and tSNE

The results of the ablation experiments are shown in Table 3.The method without discriminator and feature fusion is denoted as the baseline.We use visual features as the input featuresf1andf2for the classifier in the baseline.We use‘baseline+feature fusion’to indicate that the model does not contain discriminators,f1andf2are calculated using Eqs.(7) and (8).‘baseline+feature fusion+one discriminator’denotes the method adds a discriminator related to semantic features of the seen classes.

Table 3: Ablation experiments

Table 3 shows that for AWA1,AWA2,and CUB,the fusion of features in the three dataset models can drastically improve the harmonic mean.‘baseline+feature fusion’improves the accuracy of the seen classes compared to the baseline method,but does not reduce the accuracy of the unseen classes too much,which indicates that ‘baseline+feature fusion’can improve the accuracy of the seen classes while still making the unseen class samples not massively biased toward the seen classes.‘baseline+feature fusion’can make the increase in both seen and unseen classes on aPY compared to the baseline method.From Table 3,it can be seen that when the discriminator is added,there is an increase in harmonic mean;this is because adding the discriminator not only adds information about the unseen class but also makes the features of the different modalities more consistent.

Figs.8a and 8b show the tSNE for the AWA2 dataset.Fig.8a shows the unseen class visual features in the AWA2 dataset,and Fig.8b shows the visual featuresf2obtained using feature fusion.Since the training set samples are used to obtainf2,the number of samples obtained for each class is different.The figure shows that the method proposed in this paper can provide a part of the distribution similar to the original sample features.

Figure 8: The tSNE of AWA2

4.4 The Influence of the Features ΔX

Fig.9 shows the results of replacing ΔXin Eq.(4) with the original visual featureXt.From Fig.9,although good results can be obtained using the original visual features,the results are still low compared to the method in this paper.

Fig.10 shows the classification accuracy for each unseen class on the aPY dataset when replacing ΔXwith the original featureXt.From Fig.9,the accuracy is less than the method proposed in this paper,except for very few classes where the accuracy increases when using the original features.

Figure 9: The harmonic mean of the original train features used in Eq.(4)

Figure 10: The accuracy of the unseen class samples of aPY

5 Conclusions

We propose a new model structure for the consistency problems of different modal features and domain shift problems in generalized zero-shot learning.Using a dual discriminator structure in the proposed model can lead to a better alignment of semantic and visual features,and this dual discriminator structure can provide part of the information about the unseen class.At the same time,this paper adopts a new feature fusion method to reduce the information about seen classes and provide information about unseen classes,so the model is not too biased towards seen classes in generalized zero-shot classification and improves the harmonic mean.We have experimented with our proposed model on four datasets,and the experimental results show the effectiveness of our approach,especially on the aPY dataset.We will further explore using an attention mechanism approach to extract more discriminative features,which will enable better alignment of features across modalities,and more discriminative features can improve the accuracy of zero-shot classification.

Acknowledgement:The authors sincerely appreciate the editors and reviewers for their valuable work.

Funding Statement:The authors received no specific funding for this study.

Author Contributions:Study design and draft manuscript preparation: Tianshu Wei;reviewing and editing the manuscript:Jinjie Huang.

Availability of Data and Materials:The datasets used in the manuscript are public datasets.The datasets used in the manuscript are available from https://www.mpi-inf.mpg.de/departments/computer-vision-an d-machine-learning/research/zero-shot-learning/zero-shot-learning-the-good-the-bad-and-the-ugly.

Conflicts of Interest:The authors declare that they have no conflicts of interest to report regarding the present study.