Information about http://www.elec.qmul.ac.uk/people/markp/2006/DaviesPlumbley06-dmrn.pdf

ENTROPY BASED BEAT TRACKING EVALUATION …

Tags: algorithms, beats, conjunction, converse, digital music, dixon, ground truth, histogram, human subjects, interval, ity, normalised, novel approach, proportion, queen mary university, queen mary university of london, test database, thresholds, uation, university of london,
Pages: 4
Language: english
Created: Mon Jul 3 10:42:53 2006
Display cached document
Page 1
image
Page 2
image
Page 3
image
Page 4
image
                    ENTROPY BASED BEAT TRACKING EVALUATION

                                      Matthew E. P. Davies and Mark D. Plumbley
                                               Centre for Digital Music
                                           Queen Mary, University of London


                          ABSTRACT                                     total number of correct beats. To reflect the metrical ambigu-
                                                                       ity of human subjects tapping in time to music, both measures
In this paper, we present a novel approach to beat tracking eval-      are re-evaluated to allow for beats to occur at double or half
uation, based on finding the error between automatically gener-        the annotated metrical level. This leads to multiple measures
ated beat locations and ground truth annotations. The error is         of beat accuracy, which can be harder to interpret than a single
normalised to the current inter-annotation-interval, such that the     all-encompassing value of accuracy.
greatest observable error can be ± 50% of a beat. We form a            The difficulties with beat evaluation are not limited to generating
histogram of normalised beat error, from which we estimate the         multiple measures of accuracy. The use of thresholds in defining
entropy as a measure of beat tracking performance, where low           allowance windows around ground truth annotations can also be
entropy indicates accurate beat locations, with the converse true      problematic. Dixon [1] defines a fixed window of ±70ms around
for high entropy. We evaluate the performance of a human tap-          each annotation, in which a beat must fall for it to be accurate.
per in conjunction with three published beat tracking algorithms       For songs with a fast tempo the ±70ms window will cover a
over an annotated test database and compare the results of our         greater proportion of the current inter-annotation-interval (IAI),
entropy based approach to existing evaluation methods.                 (the time between successive beat annotations) than for slower
Keywords ­ Beat tracking, evaluation, entropy                          songs, and will therefore be biased towards music with a faster
                                                                       tempo. This bias can be removed by defining a tempo-dependent
                     1. INTRODUCTION                                   threshold, where beats are accurate if they fall within 17.5% of
                                                                       the current IAI, as in [5, 6, 9]. However this does not address
The task of beat tracking of musical audio is well known and           a more inherent problem of using thresholds, that a beat at the
conceptually quite simple; the aim being to replicate the human        edge of an allowance window would be deemed accurate, but
ability of foot-tapping in time to a piece of music [1]. Recently,     the same beat would be inaccurate if it occurred 1ms later, even
as techniques have improved, automatically extracted beat times        though human listeners are incapable of discriminating between
have increasingly been used as a musically meaningful temporal         events at this time scale [10, p.29].
segmentation for higher level analysis, for example in chord es-       The motivation for our approach to beat tracking evaluation is to
timation [2], structural segmentation and thumb-nailing [3] and        provide a single measure of beat accuracy that is easy to interpret
rhythmic pattern classification [4].                                   and does not rely on a fixed threshold. We achieve this, not by
While many approaches exist for beat tracking (e.g. [1, 5, 6]) an      classifying individual beat accuracy, but by measuring the beat
important related area which has received less attention is that       error. We extract the time between each beat and the nearest
of evaluation ­ the problem of defining a suitable measure of          ground truth annotation and normalise it by the current IAI (to
beat accuracy. Given an agreed means of evaluation and publicly        give a maximum error of ±50% of a beat). We then formulate
available annotated test databases (containing hand labelled beat      a histogram of normalised beat error and estimate its entropy.
locations), the ability to reliably compare the performance of dif-    High entropy will result from a uniform distribution, where beats
ferent algorithms becomes much simpler. However, at present,           are randomly distributed across the musical piece and provide no
there are no freely available databases, nor an agreed method-         information about the locations of the ground truth annotations.
ology for beat tracking evaluation. As a result, researchers use       Conversely, an entropy of zero will arise when all beats are ex-
private databases [7, 5] and evaluate beat accuracy using their        actly equal to all annotations. To permit comparisons against
own metrics (e.g. [1, 5, 8, 9]), making comparative studies much       other evaluation metrics, we invert and normalise the entropy to
harder to undertake.                                                   give a beat accuracy measure between 0 and 100%. We evaluate
For a given piece of music, where generated beat locations are         the performance of three published beat tracking algorithms and
compared to ground truth beat annotations, most existing meth-         a human tapper over a large annotated test database using our en-
ods (e.g. [1, 5, 9]) classify individual beats as correct or incor-    tropy based measure and compare results to existing evaluation
rect based on whether they occur within a pre-defined allowance        approaches.
window around a particular ground truth annotation. The overall
accuracy for the piece is then given as the mean of the accuracy
of each beat. Similarly, the mean of each piece is taken to indi-                              2. APPROACH
cate the accuracy over a test database of many pieces.
                                                                       To pursue an objective approach to beat tracking, a sequence
Klapuri et al [5] adopt a continuity based approach, where a
                                                                       of ground truth beat annotations is required against which the
given beat is only accurate if it falls within a specified allowance
                                                                       output of a beat tracker can be compared. The first step in our
window and the same is true of the previous beat. Accuracy is
                                                                       evaluation method is to find the error between the beats and the
calculated in two ways, first as the ratio of the longest continu-
                                                                       annotations. For each ground truth annotation aj , we could mea-
ously correct segment to the length of the input and then as the
                                                                       sure the distance to the nearest beat b however this would limit
    Email: matthew.davies@elec.qmul.ac.uk                              us to analysis of the annotated metrical level [8, 6]. To allow
                           Figure 1. Calculation of beat error j,q from beat annotations aj and beat locations b .


us to measure each metrical level simultaneously we partition            variance, the entropy gives a measure of the peakiness of the
the input signal into beat length segments, each centred on a            distribution, therefore rewarding beats that are consistently re-
ground truth annotation, and extract all beats q that fall within        lated to the ground truth annotations, but is blind to the metrical
this range,                                                              level or phase shift at which the beats occur. We calculate the
                                                                         entropy as
          q = b        :    a j -    b < a j +  
                                  j-1           j               (1)
                                                                                                     K
                   aj -aj-1                 a   -a                                                   X              1
where  j-1     =       2
                                and  j  = j+1 j2
                                                   represent the                               H=          xk log                       (3)
boundaries of the beat     length segment around aj .
                                                   We now find                                                      xk
                                                                                                     k=1
the beat error j,q as the distance between each q and aj , nor-          where there are K bins and xk are the heights of the bins. The
malised to the current IAI. The furthest any beat q can be from          histogram bins are normalised such that K xk = 1 and to
                                                                                                                  P
                                                                                                                      k=1
the nearest annotation is bounded between -50% and 50% of a              maintain a real-valued output xk > 0 for all k. As shown in [11]
beat as shown in figure 1,                                               the entropy is non-negative and bounded between 0 in the best
                       8     q -aj                                       case, where beats exactly equal the annotations and log(K) for
                       >     
                                         q  aj                           the uniform case. We can invert and normalise the entropy value
                       <      j-1
               j,q =         q -aj
                                         q > aj                 (2)                                         ^
                                                                         to give a measure of beat accuracy H between 0 and 100%,
                       >        j
                       :
                              U     no beats q .                                              H - log(K)
                                                                                           ^
                                                                                           H=             100%.                         (4)
If beats are tapped at a metrical level above the annotated loca-                                    1
                                                                                                log( K )
tions, e.g. every other beat is close to an annotation, then there
                                                                         The normalisation removes any dependency on the number of
will be instances where no beats q occur within each annotation
                                                                         bins used in the beat error histogram, however we empirically
centred beat segment. To overcome this missing data, we assign
                                                                         set K = 40. To give an overall measure of accuracy for a test
the error value j,q = U , where U is a uniformly distributed ran-
                                                                         database of multiple files we have two options. First, we can cal-
dom variable in the range (-0.5, 0.5), equivalent to a beat error                                        ^
                                                                         culate the normalised entropy Hn for each file n in the database,
between -50% and 50%.                                                                                ¯
                                                                         and find the Mean entropy Hn as
To provide a visual insight into the performance of a beat track-
ing system, we can formulate a histogram of beat error. If we                                           N
believe that the beats are accurate (i.e. close to the annotations)                             ¯    1 X ^
                                                                                                Hn =       Hn .                         (5)
then we can expect to observe a histogram that resembles a delta                                     N n=1
function, with a strong peak at 0% error. In the worst case, where       Alternatively, we can form a single histogram of beat error for
beats are randomly distributed, and bear no relevance to the an-         all files in the database (as shown in figure 2) with the entropy
notations, we can expect a wide, uniform-like distribution.              calculated once using eqn. (3) but replacing xk with the mean
Figure 2 shows four example beat error histograms, those of a            bin height xk over all N files,
                                                                                      ¯
human tapper and three beat tracking algorithms, generated from
the test database used in section 3. In addition to a strong peak                                         N
                                                                                                       1 X
centred at 0% error, we can also observe significant peaks at                                   xk =
                                                                                                ¯            xn,k .                     (6)
                                                                                                       N n=1
±50% of a beat. These outer peaks occur either as result of
tapping consistently on the off-beat (in anti-phase to the anno-         After which we can normalise the entropy using eqn. (4) to give
tations) or tapping at twice the annotated rate, where half of the                                   ^¯
                                                                         a measure of Global entropy Hn .
beats will be close to 0% error and the remaining beats split be-
tween +50% and -50% error.                                                                        3. RESULTS
When looking to extract a quantitative measure of beat accuracy
from a beat error histogram, we might first consider the variance,       We include results for our evaluation measures over a beat an-
expecting it to be inversely proportional to beat accuracy. How-         notated test database containing 222 files, each one minute in
ever the outer peaks of the distribution, which must be consid-          length over six musical genres: Dance, Rock, Jazz, Classical,
ered at least partially correct, would distort the variance, where       Folk and Choral. For further details see [7, 6]. For comparison
as erroneous beats that were closer to 0% error would not. As            against our entropy based approaches, we include the evaluation
an alternative we extract the information theoretic measure of           method used by Dixon [1] as well as the continuity based method
entropy [11] from the beat error histogram. In contrast to the           of Klapuri et al [5] also used in [6].
                                                        Human Data                                         Dixon
                                        3000                                            3000

                                        2500                                            2500

                                        2000                                            2000

                                        1500                                            1500

                                        1000                                            1000

                                         500                                            500

                                          0                                               0
                                         -50%               0                     50%    -50%               0                   50%
                                                Normalised beat error (beats)                   Normalised beat error (beats)


                                                            DP                                              KEA
                                        3000                                            3000

                                        2500                                            2500

                                        2000                                            2000

                                        1500                                            1500

                                        1000                                            1000

                                         500                                            500

                                          0                                               0
                                         -50%               0                     50%    -50%               0                   50%
                                                Normalised beat error (beats)                   Normalised beat error (beats)




Figure 2. Beat error histograms. Clockwise from Top Right: Dixon [1], Klapuri et al [5] labelled KEA, Davies and Plumbley [6]
labelled DP, and a Human tapping performance.


Dixon's [1] measure of beat accuracy is calculated as follows                           tapper to be the highest performing, and as shown in [6] the hu-
                                                                                        man is more successful in finding the correct metrical level and
                                hits                                                    the on-beat than the algorithmic approaches, but that beat local-
            Dixonacc =                      100%                                (7)
                          hits + F + + F -                                              isation is often poorer. We can confirm this by inspection of the
where a `hit' is defined as a beat that occurs within ± 70ms of                         beat error histograms in figure 2. Very few beats occur at ±50%
an annotated beat location. F + refers to the number of false                           for the human but the main peak, centred on 0% error, is wider
positives, i.e. the number of beats which are not matched to                            than that of KEA and DP. In order for the Mean entropy for the
any annotation and F - which refers the number of unmatched                             human to be lower than KEA and DP, we infer that a greater
annotations, or false negatives.                                                        proportion of very accurate cases with high normalised entropy
                                                                                         ^
                                                                                        Hn , outweigh the more consistent, but generally less accurate,
The continuity based beat evaluation approach [5, 6] differs from
Dixon's [1] method as it requires beats to be continually correct.                      human performance. This suggests that the human beat error
Beats must be within 17.5% of the nearest annotation and the                            histogram is a more realistic representation of the performance
local tempo must not differ by more than 17.5%. In all, four                            for a given file, but the algorithmic histograms will vary from in-
measures of beat accuracy can be calculated. These allow for                            accurate uniform-like up to accurate delta-like distributions. We
continuous tracking at the correct metrical level (CML cont), the                       therefore believe the Global entropy measure to be more infor-
total number of correct beats at the correct metrical level (CML                        mative than the Mean entropy.
total). Both measures can then be recalculated allowing for the                         To further inspect the differences between the evaluation mea-
beats to be tapped at twice or half the annotated level, referred                       sures, we analysed each evaluation method with artificial input
to as the allowed metrical levels, giving AML cont. and AML                             data. To create the artificial data we took the ground truth beat
total. We retain only the strictest, CML cont. and the least strict,                    annotations and gradually degraded the performance, perturbing
AML total. A more detailed description may be found in [6].                             beats with uniform distributions of increasing width. Figure 3
Results for our entropy based approach and those described above                        shows the effect of increasing the width of the uniform distribu-
are given in Table 1 illustrating the performance of a human                            tion from a relative error of 2.5% of a beat, in 2.5% steps, up
tapper and three published beat tracking algorithms: Dixon [1],                         to totally random with 100% beat error. We can observe that
Davies and Plumbley [6] (labelled DP) and Klapuri et al [5] (la-                        all but the entropy approaches remain 100% accurate even for a
belled KEA).                                                                            beats perturbed by a uniform distribution of width of 20% (i.e.
                                                                                        ±10% around each annotation). We also note that none of the
                       4. DISCUSSION                                                    approaches are linear. Therefore observing an accuracy of 80%
                                                                                        for algorithm A would not be twice as good as algorithm B at
The first observation about the results is that the entropy based                       40%, as the numbers alone might suggest. It can be shown that
measures appear much stricter than the other evaluation meth-                           the entropy of an exactly uniform distribution with a fixed num-
ods. The principal reason for this being that 100% accuracy is                          ber of bins is equal to log(n), where there are n bins of height
not possible unless the beats are equal to the annotations, where                       1/n. Although the Global entropy curve is not linear, it is at
as with the Dixon [1] and continuity based approaches [6] per-                          least log-linear, so some meaningful relative comparison can be
fect tracking is possible if beats are consistently within the al-                      made between competing algorithms. The Mean entropy curve
lowance windows. We should also note that the relative ordering                         is always greater than or equal to the Global entropy curve, as
the algorithms is not consistent across all evaluation measures.                        the two can only be equal when every beat error distribution is
The Mean entropy and CML cont. place the human tapper as less                           perfectly uniform, as any irregularities in bin height will reduce
accurate than KEA and DP, although for the remaining measures                           the entropy and therefore increase the normalised beat accuracy.
the human is most accurate. Under all measures the Dixon [1]                            A current limitation of our approach is that the entropy calcu-
approach is weakest. Intuitively we should expect the human                             lation is shift-invariant with respect to the ordering of the his-
                                      Beat                   Mean                    Global         CML           AML           Dixon
                                    Tracker               Entropy (%)              Entropy (%)    Cont. (%)      Total (%)     Acc. (%)
                                     Human                    33.7                     20.1         52.8           87.7          77.2
                                     DP[6]                    36.8                     14.2         54.8           78.7          61.5
                                    KEA[5]                    38.3                     15.5         55.8           80.1          64.6
                                    Dixon[1]                  23.2                      5.5         21.9           52.0          35.4

Table 1. Beat accuracy results. The performance of a human tapper with three published algorithms are compared over our two
proposed entropy measures, two continuity based measures from Davies and Plumbley [6] and Dixon's [1] approach.



                               Beat Accuracy for Noisy Beat Annotations                                         6. ACKNOWLEDGEMENTS
                   100
                                                                   CML cont
                   90                                              AML total                     MEPD is supported by a college studentship from Queen Mary
                                                                   Dixon Acc
                                                                   Mean Beat entropy             University of London. This research has been partially funded
                   80                                              Global Beat entropy
                                                                                                 by EPSRC grants GR/S75802/01 and GR/S82213/01.
                   70


                   60                                                                                                 7. REFERENCES
    Accuracy (%)




                   50                                                                             [1] S. Dixon, "Automatic extraction of tempo and beat from
                   40                                                                                 expressive performances," Journal of New Music Re-
                                                                                                      search, vol. 30, pp. 39­58, 2001.
                   30
                                                                                                  [2] J. P. Bello and J. Pickens, "A robust mid-level representa-
                   20                                                                                 tion for harmonic content in music signals," in Proceedings
                   10
                                                                                                      of 6th International Conference on Music Information Re-
                                                                                                      trieval, London, United Kingdom, 2005, pp. 304 ­ 311.
                    0
                         0.2            0.4              0.6              0.8            1        [3] M. Levy, M. Sandler, and M. Casey, "Extraction of high
                                 Width of Uniform Distribution (Beats)
                                                                                                      level musical structure from audio data and its application
                                                                                                      to thumbnail generation," in Proceedings of IEEE Inter-
Figure 3. Comparison of beat tracking accuracy for beat anno-                                         national Conference on Acoustics, Speech and Signal Pro-
tations perturbed by a white noise process.                                                           cessing (ICASSP), 2006.
                                                                                                  [4] S. Dixon, F. Gouyon, and G. Widmer, "Towards charac-
                                                                                                      terisation of music via rhythmic patterns," in Proceedings
                                                                                                      of 5th International Conference on Music Information Re-
togram bins. Although unlikely, beats consistently 10% ahead                                          trieval, Barcelona, Spain, 2004, pp. 509­517.
of the beat would be considered just as accurate to beats centred                                 [5] A. P. Klapuri, A. Eronen, and J. Astola, "Analysis of the
on 0%, even though the beats themselves would be perceptually                                         meter of acoustic musical signals," IEEE Transactions on
out of time with the input. We intend to investigate this further,                                    Audio, Speech and Language Processing, vol. 14, no. 1,
possibly incorporating the mean of the distribution into the beat                                     pp. 342­355, 2006.
accuracy calculation.                                                                             [6] M. E. P. Davies and M. D. Plumbley, "Context-dependent
                                                                                                      beat tracking of musical audio," Tech. Rep., Queen Mary
                                                                                                      University of London, Centre for Digital Music, 2006,
                                                                                                      http://www.elec.qmul.ac.uk/people/markp/2006/C4DM-
                                                                                                      TR-06-02.pdf.
                                5. CONCLUSIONS
                                                                                                  [7] S. Hainsworth, Techniques for the Automated Analysis of
                                                                                                      Musical Audio, Ph.D. thesis, Department of Engineering,
We have presented an entropy based method for the evaluation                                          Cambridge University, 2004.
of beat tracking systems. We extract the entropy from a distri-                                   [8] A. T. Cemgil, B. Kappen, P. Desain, and H. Honing, "On
bution of normalised beat error, and scale the calculated value to                                    tempo tracking: Tempogram representation and Kalman
indicate beat accuracy between 0 and 100%. We have demon-                                             filtering," Journal Of New Music Research, vol. 29, no.
strated that the beat error histogram is a useful visualisation of                                    4, pp. 259­273, 2000.
beat tracking performance, within which the simultaneous anal-                                    [9] M. Goto and Y. Muraoka, "Issues in evaluating beat track-
ysis of multiple metrical levels is possible, a feature not present                                   ing systems," in Working Notes of the IJCAI-97 Workshop
in other approaches to beat tracking evaluation. Although beat                                        on Issues in AI and Music - Evaluation and Assessment,
accuracy from our entropy measure suggests poorer performance                                         1997, pp. 9­16.
than published approaches, it is an analytically meaningful statis-                              [10] J. London, Hearing in Time: Psychological Aspects of Mu-
tic which is not reliant on pre-defined parametric thresholds. As                                     sical Meter, Oxford University Press, 2004.
part of our future work we intend to further investigate the va-
                                                                                                 [11] M. D. Plumbley, "On information theory and unsupervised
lidity of using entropy to evaluate beat trackers, and plan to con-
                                                                                                      neural networks," Tech. Rep., Cambridge University, De-
duct listening tests to infer which evaluation methods are most
                                                                                                      partment of Engineering, 1991.
perceptually meaningful.