Information about http://ssli.ee.washington.edu/vj/files/euro2005.pdf

Maximum Margin Learning and Adaptation of MLP Classifers …

Tags: classification errors, classifier, classifiers, divergence, electrical engineering university, flexible adaptation, flexible mechanism, handwriting, independent model, jsm, malkin, maximum margin, maximum separation, optimal solution, promising results, relative entropy, speech recognition, stead, university of washington seattle, vowel,
Pages: 4
Language: english
Display cached document
Page 1
image
Page 2
image
Page 3
image
Page 4
image
            Maximum Margin Learning and Adaptation of MLP Classifers

                                      Xiao Li, Jeff Bilmes and Jonathan Malkin

                                        Department of Electrical Engineering
                                         University of Washington, Seattle
                                      {lixiao,bilmes,jsm}@ee.washington.edu



                        Abstract                             Maximum separation margin is a relatively relaxed objec-
                                                             tive compared to minimum relative entropy in the sense
    Conventional MLP classifiers used in phonetic recog-     that it only intends to minimize classification errors in-
nition and speech recognition may encounter local min-       stead of the divergence between two distributions. More
ima during training, and they often lack an intuitive and    importantly, this objective is guaranteed to converge to a
flexible adaptation approach. This paper presents a hy-      unique optimal solution. Furthermore, we propose a sup-
brid MLP-SVM classifier and its associated adaptation        port vector based adaptation strategy which offers an in-
strategy, where the last layer of a conventional MLP is      tuitive and flexible mechanism to balance the roles that
learned and adapted in the maximum separation mar-           the speaker-independent (SI) model and the adaptation
gin sense. This structure also provides a support vec-       data play in adaptation. A user adaptation scheme related
tor based adaptation mechanism which better interpo-         to our work can be found in [11] for handwriting recog-
lates between a speaker-independent model and speaker-       nition. The difference is that [11] adopts an incremen-
dependent adaptation data. Preliminary experiments on        tal learning approach to adaptation, while our approach
vowel classification have shown promising results for        attempts to minimize test errors on the adaptation data.
both MLP learning and adaptation problems.                   Though we investigate the application of vowel classifi-
                                                             cation in this work, our methods can be applied to general
                   1. Introduction                           MLP classification and adaptation problems, e.g. to hy-
Multilayer perceptron (MLP) classifiers have been pop-       brid speech recognition systems [4, 5].
ularly used in vowel classification and general phonetic          The rest of the paper is organized as follows: Sec-
recognition systems [1, 2, 3] because of their efficient     tion 2 and Section 3 discuss the learning and adaptation
discriminative training ability. They also have been inte-   strategies respectively for the hybrid MLP-SVM classi-
grated into HMMs to enhance speech recognition [4, 5].       fier. Section 4 provides a background of our vowel clas-
The learning objective of an MLP classifier is usually       sification application. Section 5 presents our preliminary
minimum relative entropy. Ideally an MLP outputs the         evaluation, followed by conclusions in Section 6.
posterior probabilities of the classes given an observa-
tion, and this will naturally minimize classification er-            2. Hybrid MLP-SVM Classifier
rors. However, minimum relative entropy is too strict
                                                             The essential idea of a hybrid MLP-SVM classifier is to
an objective to have an analytical solution. In practice,
                                                             replace the hidden-to-output layer of an MLP by optimal
the optimization of this non-convex objective function is
                                                             margin hyperplanes [10]. We believe this hybrid MLP-
often achieved by back-propagation, which is not guar-
                                                             SVM classifier is superior to pure MLPs in that it finds
anteed to find a global optimum. Similarly, most of
                                                             a unique solution to the last layer parameters via convex
the existing solutions to MLP adaptation have the same
                                                             optimization with a primal-dual interpretation, and that it
objective as MLP learning, and a common adaptation
                                                             guarantees an upper bound on test errors [12]. Further-
strategy is either partially retraining network parame-
                                                             more, this classifier can be implemented more efficiently
ters, or adding augmentative, speaker-dependent neurons
                                                             than nonlinear SVMs trained in the input space. This is
[4, 5, 6, 7, 8, 9].
                                                             because a nonlinear SVM requires selecting and tuning
    In this work, we present an MLP classifier enhanced
                                                             a kernel to achieve a good nonlinear mapping from the
by support vector machines (SVM) [10]. The idea is to
                                                             input space to a transformed feature space where data are
replace the hidden-to-output layer of an MLP by maxi-
                                                             presumably linearly separable. In the case of a hybrid
mum margin hyperplanes. In fact this structure does not
                                                             MLP-SVM classifier, this nonlinear mapping is implic-
change the nature of an MLP; it is essentially an MLP
                                                             itly optimized during the MLP training in the form of a
learned with the maximum separation margin criterion.
                                                             sigmoid kernel.
   This work was funded by NSF under Award ISS/ITR-0326382        Specifically, we first build up a simple MLP with one
hidden layer. The input layer consists of D  W nodes,          The number of free parameters to estimate at the adapta-
where D is the dimension of the feature vectors, and W         tion stage often relies on input, hidden and output layer
is input window size. The hidden layer has N hidden            dimensions; (c) The interpolation between the SI model
nodes, and the output layer has K output nodes repre-          and the adaptation data is not always intuitive and flexi-
senting K class probabilities. The hidden-to-output layer      ble.
weight vector and bias with respect to the k th output are          In this work, we propose to update only the hidden-
denoted as wk and bk . A sigmoid function is used as the       to-output layer at adaptation time in the maximum margin
nonlinear activation function at the hidden layer; and the     sense, while keeping the input-to-hidden layer intact. As
output probabilities are normalized by a softmax func-         mentioned in Section 2, the input-to-hidden layer acts as
tion. At the stage of training, the network is optimized       a nonlinear mapping from the original D·W -dimensional
via back-propagation to minimize the relative entropy be-      input space to a new N -dimensional feature space, while
tween the output distribution and the true label distribu-     the hidden-to-output layer simply acts as K binary linear
tion. At classification time, the softmax function only        classifiers in this transformed feature space. Fixing the
serves as a normalizer, and the decision is essentially de-    input-to-hidden layer is akin to fixing the kernel, while
termined by a set of linear discriminant functions             retraining the hidden-to-output layer is equivalent to up-
                dk (ht ) = wk , ht + bk ,               (1)    dating the SVs for a specific speaker.
                                                                    Since only the SVs contribute to the decision bound-
where ht is the hidden node vector of the tth sample.          ary, training on the SVs only would give exactly the same
     In the second training phase, we take as input the hid-   hyperplane as training on the whole data set. This makes
den node vectors computed from the training data using         a SVM amenable to incremental learning [13] where only
the optimized input-to-hidden layer parameters. We then        the SVs are preserved and are combined with new data in
train optimal margin hyperplanes, {wk , bk }, for each         the next training epoch. The user adaptation problem has
class k = 1..K on these inputs. Note that we use the           been tackled in the same fashion, where the SVs trained
SVM scheme of "one-versus-the-rest" [12] to deal with          using user-independent data are combined with a subset
multiple classes for a better comparison with MLP clas-        of user-dependent data for adaptation. Examples of this
sifiers. Also, the MLP labels {1,0} for a particular class     can be found in the field of handwritten character recog-
are converted to {1,-1} to accommodate SVM formulas.           nition [11, 14]. The adaptation problem, however, is not
In fact, the resulting classifier has exactly the same dis-    entirely the same as the problem of incremental learn-
crimination functions as in Equation (1). The only differ-     ing. The former aims to reduce the test error of user-
ence lies in the learning objective: among all the oriented    dependent data, whereas the latter aims to reduce the test
hyperplanes for a specific binary classifier, there exists     error of all data. Furthermore, the number of SVs ob-
a unique optimal one which maximizes the margin be-            tained from the training set is often larger than that of the
tween any training sample and the hyperplane. This op-         adaptation data. Therefore, it is not always effective to
timal margin hyperplane can be found by solving the fol-       treat all the old SVs and the adaptation data equally or
lowing constrained quadratic optimization problem [12]         even to discard a portion of the adaptation data.
(here we consider only one binary classifier and drop the           To solve the adaptation problem, we propose to
index k from our notation for simplicity),                     weight the slack variables to make an error on the train-
               1      2
       min     2 w      + C tRg t                              ing data less costly than one on the adaptation data.
     w,b,
 subject to yt ( w, ht + b) + t - 1  0 and t  0,               Again we only consider one binary classifier. We define
                                                     (2)       SV g and wg as the SVs and the corresponding hyper-
where Rg denotes the training set, t are slack variables       plane obtained from the SI data Rg ; and SV a and wa
introduced for non-separable data, and C penalizes those       as those obtained from the adaptation data R a . Note that
samples across the boundary. By introducing Lagrange           SV g Ra = Ø. We then modify the objective function
multipliers t , we have the solution w = tRg t yt ht .         in Equation (2) as
The resulting hyperplane is determined by those samples                       1       2
                                                                                          +C           p t t + C
                                                                     min      2   w            tSV g               tRa t
with nonzero t values, known as support vectors (SV).                w,b,
                                                                subject to yt ( w, ht + b) + t - 1  0 and
                                                                           t  0, t  SV g Ra
              3. Adaptation Strategy                                                                                    (3)
As mentioned in the introduction, nearly all the existing      In this way, we can adjust how important the role that the
MLP adaptation algorithms achieve the tradeoff between         SI data plays in the adapted classifier. In an extreme case,
the SI model and the adaptation data by partially retrain-     where pt = 1, t  SV g , the above objective is equiva-
ing the original network or by training additional neurons     lent to training a SVM using all old SVs and all adapta-
[4, 5, 7, 6, 8, 9]. This strategy, however, has the follow-    tion data. At the other extreme, where pt = 0, t  SV g ,
ing limitations: (a) Similiar to MLP learning, the opti-       the adaptation leads to a completely new SVM trained
mization is not guaranteed to reach a global optimum; (b)      using only the adaptation data. Between these two ex-
tremes, we would like to weight each sample in SV g by                                                Tongue Advancement
                                                                                                   Front   Central   Back
how likely it is to be generated from the adaptation data
distribution. Specifically                                                                 High    i         i
                                                                                                             -              U




                                                                           Tongue Height
                        1                                                                                               o
                pt = g( ( wa , ht + ba )),            (4)                                          e
                        s                                                                  Mid
                                                                                                                        a
where s = tRa |t | is a scalar for normalization, and
                                                                                           Low
                                                                                                       æ     ^
g can be a monotonically increasing function converting
a real number to a probability. In this work, we use an
indicator function g(x) = (x > d) for efficiency, where                                           Figure 1: Vowel set
d is a constant controlling the amount of SI data infor-
mation incorporated in adaptation. All SV g are selected
when d = -, while none of them are selected when                (b) falling/level/rising (pitch); (c) loud/normal/quiet (in-
d = +.                                                          tensity). There are 2 × 3 × 3 = 18 utterances for each
    Finally, it is worth noting that this idea can be consid-   vowel from a single speaker.
ered analogous to [9]. The adaptation scheme presented              The recordings of 10 speakers were allocated to the
in [9] aims to minimize the relative entropy. It interpo-       training set, while those of the remaining 5 speakers were
lates the SI model and the adaptation data by retraining        used for adaptation and evaluation. There were 180 ut-
only the most "active" hidden neurons, i.e. those with the      terances (approximately 22,000 sample frames) for each
maximum variance over the adaptation data. In our work,         vowel class used for SI training. For a particular test
instead, we use the maximum margin objective for adap-          speaker, the 18 utterances for each vowel class were fur-
tation and achieve interpolation by combining part of the       ther divided into 6 subsets with 3 utterances each. Each
SI support vectors with the adaptation data.                    subset was used for adaptation and the remaining sub-
                                                                sets for evaluation. There were 3/15 utterances (approx-
          4. Application and Database                           imately 360/1,840 sample frames) for each vowel in the
                                                                adaptation/evaluation subsets respectively. We calculated
The Vocal Joystick (VJ) project, conducted at the Uni-          the mean of the error rates over these 6 subsets, and we
versity of Washington, is intended to assist individuals        repeated this for each speaker. The final classification er-
with motor impairments in human-machine interaction             ror rate was an average over the 5 test speakers, and hence
using non-verbal vocal parameters. The VJ system of-            essentially an average of 30 subsets.
fers control mechanisms for both continuous and discrete
tasks. In an exemplary application of cursor control,
                                                                          5. Experiments and Results
vowel quality is utilized to control the direction of cursor
movement, while voice intensity and pitch are used to de-       Using these data, we conducted two sets of experiments.
termine speed and acceleration. Using such an interface,        The first one only included 4 vowel classes, /æ/, /a/, /u/
the computer does not need to wait for a complete com-          and /i/, while the second added the other 4 vowels leading
mand to actuate an action, but rather continuously listens      to 8 classes. For both experiments, we evaluated SI and
to a user and maps his voice into cursor movement.              adapted classifiers on the same evaluation subsets. We
     The VJ system performs frame-by-frame vowel clas-          varied the amount of adaptation data by choosing either 1,
sification in order to capture the vocal tract shift in real-   2 or 3 utterances in each adaptation subset, corresponding
time. Since the vowels pronounced in the VJ framework           to 1.2s, 1.8 and 3.6 seconds, on average.
may have a huge dynamic range in both intensity and                 To construct the speaker-independent hybrid MLP-
pitch values, a reliable frame-level vowel classifier robust    SVM classifier, we first built a two-layer perceptron. The
to energy and pitch variations is indispensable. Further-       input layer consists of W =7 frames of MFCCs and their
more, this classifier should be amenable to adaptation to       deltas (with mean subtraction and variance normaliza-
further improve accuracy, since a vowel class articulated       tion), leading to 182 dimensions. The hidden layer has
from one speaker might overlap, in acoustic space, with         N =50 nodes. The W and N values were empirically
a different vowel class from another speaker. Therefore,        found the best for our task. The output layer has either
our proposed MLP-SVM classifier and its adaptation al-          4 or 8 nodes, corresponding to the 4 or 8 vowel classes.
gorithm can be well applied to this problem.                    As proposed, we replaced the hidden-to-output layer by
     We have collected a data set of constant-vowel utter-      optimal margin hyperplanes trained by SVMTorch [15],
ances, consisting of 8 vowels whose IPA symbols and ar-         where C=10 in the 4-class case and C=1 for the 8-class
ticulatory gestures are depicted in Figure 1. This data         case. For comparison, we also built up a GMM classifier
set so far contains utterances from 15 speakers, but is         with 16 mixtures (empirically the best) per vowel class.
expected to have many more speakers eventually. Each            Table 1 summarizes the average error rates using these
speaker articulated each vowel with all combinations of         SI classifiers. Our proposed hybrid classifier obtained the
the following configurations: (a) long/short (duration);        lowest error rate for both experiments. It improved over
                           4-class     8-class                         8-class            1.2s        1.8s        3.6s
              GMM          14.87%      44.12%                        GMM+MLLR           36.64%      33.55%      32.48%
               MLP         10.81%      39.95%                           MLP             31.82%      28.37%      25.48%
             MLP-SVM        9.30%      37.05%                         MLP-SVM           28.37%      28.20%      27.27%

        Table 1: Avg.error rates of SI classifiers             Table 3: Avg. error rates of adapted 8-class classifiers


         4-class         1.2s          1.8s       3.6s
                                                              Scott Drellishak for VJ data colleciton.
       GMM+MLLR         12.10%       10.35%      9.27%
          MLP           10.25%        9.33%      8.34%
        MLP-SVM         8.59%        8.21%       7.37%
                                                                                    7. References
                                                               [1] H.Leung and V.Zue, "Phonetic classification using multi-
 Table 2: Avg. error rates of adapted 4-class classifiers          layer perceptions," in ICASSP, 1990.
                                                               [2] S.A.Zahorian and Z.B.Nossair, "A partitioned neural net-
                                                                   work approach for vowel classification using smoothed
                                                                   time/frequency features," IEEE Trans. on Speech and Au-
the pure MLP by a relative 13.9% error rate reduction in           dio Processing, 1999.
the 4-class case and 7.3% in the 8-class case.                 [3] P.Schmid and E.Barnard, "Explicit, n-best formant fea-
    Finally, we conducted adaptation experiments using             tures for vowel classification," in ICASSP, 1997.
the method proposed in Section 3. The SVs of the train-        [4] H.Bourlard and N.Morgan, Connectionist Speech Recog-
ing set, again denoted as SV g , were combined with                nition: A Hybrid Approach. Kluwer Academic Publish-
the adaptation data to update the optimal margin hyper-            ers, 1994.
planes. In the 4-class case, the lowest error rate was ob-     [5] A.J.Robinson, "An application of recurrent nets to phone
tained when d = -, meaning all SV g samples were                   probability estimation," IEEE Trans. on Neural Networks,
used in adaptation. In the 8-class case, the best perfor-          1994.
mance was achieved when about 50% of the SV g sam-             [6] V. Abrash, H. Franco, A. Sankar, and M. Cohen, "Con-
ples were used. It is interesting to notice that the old           nectionist speaker normalization and adaptation," in eu-
SVs were not always helpful in adaptation. For compar-             rospeech, 1995.
ison, we adapted the GMM classifier by maximum like-           [7] J.Neto, L. Almeida, M. Hochberg, C. Martins, L. Nunes,
lihood linear regression (MLLR), and adapted the MLP               S. Renals, and T. Robinson, "Speaker-adaptation for hy-
                                                                   brid HMM-ANN continuous speech recognition system,"
classifier by further training the hidden-to-output layer
                                                                   in Proc. Eurospeech, 1995.
under the minimum relative entropy criterion (this was
                                                               [8] N.Strom, "Speaker adaptation by modeling the speaker
done on all hidden neurons since we only had 50 hidden
                                                                   variation in a continuous speech recognition system," in
nodes). Table 2 and Table 3 shows the lowest average er-           ICSLP, 1996.
ror rates we obtained using each classifier when the adap-
                                                               [9] J.Stadermann and G.Rigoll, "Two-stage speaker adapta-
tation data was 1.2, 1.8 and 3.6 seconds respectively. The         tion of hybrid tied-posterior acoustic models," in ICASSP,
support vector based adaptation consistently achieved the          2005.
best performance when the adaptation data is in a small       [10] V.Vapnik, The Nature of Statistical Learning Theory,
amount, while the simple MLP adaptation approach be-               Chapter 5. Springer-Verlag, New York, 1995.
came superior when more adaptation data was available         [11] N. Matic, I. Guyon, J. Denker, and V. Vapnik, "Writer
in the 8-class case.                                               adaptation for on-line handwritten character recognition,"
                                                                   in Proc. Intl. Conf. on Pattern Recognition and Document
                   6. Conclusions                                  Analysis, 1993.
                                                              [12] B.Scholkopf and A.J.Smola, Learning with kernels. The
The hybrid MLP-SVM classifier presented in this work               MIT Press, 2001.
combines the MLP's ability to model nonlinearity with
the SVM's superior classification power. Furthermore,         [13] N.Syed, H.Liu, and K.Sung, "Incremental learning with
the idea of weighting the slack variables in learning the          support vector machines," in Proc. Workshop on Support
optimal margin hyperplanes offers an intuitive and flexi-          Vector Machines at the Intl. Joint Conf. on Aritifical Intel-
ble adaptation mechanism. In a preliminary experiment              ligence, 1999.
for an in-house application, the hybrid MLP-SVM clas-         [14] B.-B.Peng, Z.-X.Sun, and X.-G. Xu, "SVM-based in-
sifier outperformed conventional MLP classifiers for SI            cremental active learning for user adaptation for online
classification problems. Its associated adaptation strategy        graphics recognition system," in Proc.Intl.Conf.on Ma-
worked remarkably well by using only a small amount of             chine Learning and Cybernetics, 2002.
adaptation data. When more adaptation data was avail-         [15] R.Collobert and S.Bengio, "SVMTorch: support vector
able, a simple MLP adaptation scheme became the best               machines for large-scale regression problems," The jour-
in the 8-class problem. The authors would like thank               nal of Machine Learning Research, 2001.
Richard Wright, Kelley Kilanski, Andrea Macleod and