Tags: algorithms, beats, conjunction, converse, digital music, dixon, ground truth, histogram, human subjects, interval, ity, normalised, novel approach, proportion, queen mary university, queen mary university of london, test database, thresholds, uation, university of london,
ENTROPY BASED BEAT TRACKING EVALUATION
Matthew E. P. Davies and Mark D. Plumbley
Centre for Digital Music
Queen Mary, University of London
ABSTRACT total number of correct beats. To reflect the metrical ambigu-
ity of human subjects tapping in time to music, both measures
In this paper, we present a novel approach to beat tracking eval- are re-evaluated to allow for beats to occur at double or half
uation, based on finding the error between automatically gener- the annotated metrical level. This leads to multiple measures
ated beat locations and ground truth annotations. The error is of beat accuracy, which can be harder to interpret than a single
normalised to the current inter-annotation-interval, such that the all-encompassing value of accuracy.
greatest observable error can be ± 50% of a beat. We form a The difficulties with beat evaluation are not limited to generating
histogram of normalised beat error, from which we estimate the multiple measures of accuracy. The use of thresholds in defining
entropy as a measure of beat tracking performance, where low allowance windows around ground truth annotations can also be
entropy indicates accurate beat locations, with the converse true problematic. Dixon [1] defines a fixed window of ±70ms around
for high entropy. We evaluate the performance of a human tap- each annotation, in which a beat must fall for it to be accurate.
per in conjunction with three published beat tracking algorithms For songs with a fast tempo the ±70ms window will cover a
over an annotated test database and compare the results of our greater proportion of the current inter-annotation-interval (IAI),
entropy based approach to existing evaluation methods. (the time between successive beat annotations) than for slower
Keywords Beat tracking, evaluation, entropy songs, and will therefore be biased towards music with a faster
tempo. This bias can be removed by defining a tempo-dependent
1. INTRODUCTION threshold, where beats are accurate if they fall within 17.5% of
the current IAI, as in [5, 6, 9]. However this does not address
The task of beat tracking of musical audio is well known and a more inherent problem of using thresholds, that a beat at the
conceptually quite simple; the aim being to replicate the human edge of an allowance window would be deemed accurate, but
ability of foot-tapping in time to a piece of music [1]. Recently, the same beat would be inaccurate if it occurred 1ms later, even
as techniques have improved, automatically extracted beat times though human listeners are incapable of discriminating between
have increasingly been used as a musically meaningful temporal events at this time scale [10, p.29].
segmentation for higher level analysis, for example in chord es- The motivation for our approach to beat tracking evaluation is to
timation [2], structural segmentation and thumb-nailing [3] and provide a single measure of beat accuracy that is easy to interpret
rhythmic pattern classification [4]. and does not rely on a fixed threshold. We achieve this, not by
While many approaches exist for beat tracking (e.g. [1, 5, 6]) an classifying individual beat accuracy, but by measuring the beat
important related area which has received less attention is that error. We extract the time between each beat and the nearest
of evaluation the problem of defining a suitable measure of ground truth annotation and normalise it by the current IAI (to
beat accuracy. Given an agreed means of evaluation and publicly give a maximum error of ±50% of a beat). We then formulate
available annotated test databases (containing hand labelled beat a histogram of normalised beat error and estimate its entropy.
locations), the ability to reliably compare the performance of dif- High entropy will result from a uniform distribution, where beats
ferent algorithms becomes much simpler. However, at present, are randomly distributed across the musical piece and provide no
there are no freely available databases, nor an agreed method- information about the locations of the ground truth annotations.
ology for beat tracking evaluation. As a result, researchers use Conversely, an entropy of zero will arise when all beats are ex-
private databases [7, 5] and evaluate beat accuracy using their actly equal to all annotations. To permit comparisons against
own metrics (e.g. [1, 5, 8, 9]), making comparative studies much other evaluation metrics, we invert and normalise the entropy to
harder to undertake. give a beat accuracy measure between 0 and 100%. We evaluate
For a given piece of music, where generated beat locations are the performance of three published beat tracking algorithms and
compared to ground truth beat annotations, most existing meth- a human tapper over a large annotated test database using our en-
ods (e.g. [1, 5, 9]) classify individual beats as correct or incor- tropy based measure and compare results to existing evaluation
rect based on whether they occur within a pre-defined allowance approaches.
window around a particular ground truth annotation. The overall
accuracy for the piece is then given as the mean of the accuracy
of each beat. Similarly, the mean of each piece is taken to indi- 2. APPROACH
cate the accuracy over a test database of many pieces.
To pursue an objective approach to beat tracking, a sequence
Klapuri et al [5] adopt a continuity based approach, where a
of ground truth beat annotations is required against which the
given beat is only accurate if it falls within a specified allowance
output of a beat tracker can be compared. The first step in our
window and the same is true of the previous beat. Accuracy is
evaluation method is to find the error between the beats and the
calculated in two ways, first as the ratio of the longest continu-
annotations. For each ground truth annotation aj , we could mea-
ously correct segment to the length of the input and then as the
sure the distance to the nearest beat b however this would limit
Email: matthew.davies@elec.qmul.ac.uk us to analysis of the annotated metrical level [8, 6]. To allow
Figure 1. Calculation of beat error j,q from beat annotations aj and beat locations b .
us to measure each metrical level simultaneously we partition variance, the entropy gives a measure of the peakiness of the
the input signal into beat length segments, each centred on a distribution, therefore rewarding beats that are consistently re-
ground truth annotation, and extract all beats q that fall within lated to the ground truth annotations, but is blind to the metrical
this range, level or phase shift at which the beats occur. We calculate the
entropy as
q = b : a j - b < a j +
j-1 j (1)
K
aj -aj-1 a -a X 1
where j-1 = 2
and j = j+1 j2
represent the H= xk log (3)
boundaries of the beat length segment around aj .
We now find xk
k=1
the beat error j,q as the distance between each q and aj , nor- where there are K bins and xk are the heights of the bins. The
malised to the current IAI. The furthest any beat q can be from histogram bins are normalised such that K xk = 1 and to
P
k=1
the nearest annotation is bounded between -50% and 50% of a maintain a real-valued output xk > 0 for all k. As shown in [11]
beat as shown in figure 1, the entropy is non-negative and bounded between 0 in the best
8 q -aj case, where beats exactly equal the annotations and log(K) for
>
q aj the uniform case. We can invert and normalise the entropy value
< j-1
j,q = q -aj
q > aj (2) ^
to give a measure of beat accuracy H between 0 and 100%,
> j
:
U no beats q . H - log(K)
^
H= 100%. (4)
If beats are tapped at a metrical level above the annotated loca- 1
log( K )
tions, e.g. every other beat is close to an annotation, then there
The normalisation removes any dependency on the number of
will be instances where no beats q occur within each annotation
bins used in the beat error histogram, however we empirically
centred beat segment. To overcome this missing data, we assign
set K = 40. To give an overall measure of accuracy for a test
the error value j,q = U , where U is a uniformly distributed ran-
database of multiple files we have two options. First, we can cal-
dom variable in the range (-0.5, 0.5), equivalent to a beat error ^
culate the normalised entropy Hn for each file n in the database,
between -50% and 50%. ¯
and find the Mean entropy Hn as
To provide a visual insight into the performance of a beat track-
ing system, we can formulate a histogram of beat error. If we N
believe that the beats are accurate (i.e. close to the annotations) ¯ 1 X ^
Hn = Hn . (5)
then we can expect to observe a histogram that resembles a delta N n=1
function, with a strong peak at 0% error. In the worst case, where Alternatively, we can form a single histogram of beat error for
beats are randomly distributed, and bear no relevance to the an- all files in the database (as shown in figure 2) with the entropy
notations, we can expect a wide, uniform-like distribution. calculated once using eqn. (3) but replacing xk with the mean
Figure 2 shows four example beat error histograms, those of a bin height xk over all N files,
¯
human tapper and three beat tracking algorithms, generated from
the test database used in section 3. In addition to a strong peak N
1 X
centred at 0% error, we can also observe significant peaks at xk =
¯ xn,k . (6)
N n=1
±50% of a beat. These outer peaks occur either as result of
tapping consistently on the off-beat (in anti-phase to the anno- After which we can normalise the entropy using eqn. (4) to give
tations) or tapping at twice the annotated rate, where half of the ^¯
a measure of Global entropy Hn .
beats will be close to 0% error and the remaining beats split be-
tween +50% and -50% error. 3. RESULTS
When looking to extract a quantitative measure of beat accuracy
from a beat error histogram, we might first consider the variance, We include results for our evaluation measures over a beat an-
expecting it to be inversely proportional to beat accuracy. How- notated test database containing 222 files, each one minute in
ever the outer peaks of the distribution, which must be consid- length over six musical genres: Dance, Rock, Jazz, Classical,
ered at least partially correct, would distort the variance, where Folk and Choral. For further details see [7, 6]. For comparison
as erroneous beats that were closer to 0% error would not. As against our entropy based approaches, we include the evaluation
an alternative we extract the information theoretic measure of method used by Dixon [1] as well as the continuity based method
entropy [11] from the beat error histogram. In contrast to the of Klapuri et al [5] also used in [6].
Human Data Dixon
3000 3000
2500 2500
2000 2000
1500 1500
1000 1000
500 500
0 0
-50% 0 50% -50% 0 50%
Normalised beat error (beats) Normalised beat error (beats)
DP KEA
3000 3000
2500 2500
2000 2000
1500 1500
1000 1000
500 500
0 0
-50% 0 50% -50% 0 50%
Normalised beat error (beats) Normalised beat error (beats)
Figure 2. Beat error histograms. Clockwise from Top Right: Dixon [1], Klapuri et al [5] labelled KEA, Davies and Plumbley [6]
labelled DP, and a Human tapping performance.
Dixon's [1] measure of beat accuracy is calculated as follows tapper to be the highest performing, and as shown in [6] the hu-
man is more successful in finding the correct metrical level and
hits the on-beat than the algorithmic approaches, but that beat local-
Dixonacc = 100% (7)
hits + F + + F - isation is often poorer. We can confirm this by inspection of the
where a `hit' is defined as a beat that occurs within ± 70ms of beat error histograms in figure 2. Very few beats occur at ±50%
an annotated beat location. F + refers to the number of false for the human but the main peak, centred on 0% error, is wider
positives, i.e. the number of beats which are not matched to than that of KEA and DP. In order for the Mean entropy for the
any annotation and F - which refers the number of unmatched human to be lower than KEA and DP, we infer that a greater
annotations, or false negatives. proportion of very accurate cases with high normalised entropy
^
Hn , outweigh the more consistent, but generally less accurate,
The continuity based beat evaluation approach [5, 6] differs from
Dixon's [1] method as it requires beats to be continually correct. human performance. This suggests that the human beat error
Beats must be within 17.5% of the nearest annotation and the histogram is a more realistic representation of the performance
local tempo must not differ by more than 17.5%. In all, four for a given file, but the algorithmic histograms will vary from in-
measures of beat accuracy can be calculated. These allow for accurate uniform-like up to accurate delta-like distributions. We
continuous tracking at the correct metrical level (CML cont), the therefore believe the Global entropy measure to be more infor-
total number of correct beats at the correct metrical level (CML mative than the Mean entropy.
total). Both measures can then be recalculated allowing for the To further inspect the differences between the evaluation mea-
beats to be tapped at twice or half the annotated level, referred sures, we analysed each evaluation method with artificial input
to as the allowed metrical levels, giving AML cont. and AML data. To create the artificial data we took the ground truth beat
total. We retain only the strictest, CML cont. and the least strict, annotations and gradually degraded the performance, perturbing
AML total. A more detailed description may be found in [6]. beats with uniform distributions of increasing width. Figure 3
Results for our entropy based approach and those described above shows the effect of increasing the width of the uniform distribu-
are given in Table 1 illustrating the performance of a human tion from a relative error of 2.5% of a beat, in 2.5% steps, up
tapper and three published beat tracking algorithms: Dixon [1], to totally random with 100% beat error. We can observe that
Davies and Plumbley [6] (labelled DP) and Klapuri et al [5] (la- all but the entropy approaches remain 100% accurate even for a
belled KEA). beats perturbed by a uniform distribution of width of 20% (i.e.
±10% around each annotation). We also note that none of the
4. DISCUSSION approaches are linear. Therefore observing an accuracy of 80%
for algorithm A would not be twice as good as algorithm B at
The first observation about the results is that the entropy based 40%, as the numbers alone might suggest. It can be shown that
measures appear much stricter than the other evaluation meth- the entropy of an exactly uniform distribution with a fixed num-
ods. The principal reason for this being that 100% accuracy is ber of bins is equal to log(n), where there are n bins of height
not possible unless the beats are equal to the annotations, where 1/n. Although the Global entropy curve is not linear, it is at
as with the Dixon [1] and continuity based approaches [6] per- least log-linear, so some meaningful relative comparison can be
fect tracking is possible if beats are consistently within the al- made between competing algorithms. The Mean entropy curve
lowance windows. We should also note that the relative ordering is always greater than or equal to the Global entropy curve, as
the algorithms is not consistent across all evaluation measures. the two can only be equal when every beat error distribution is
The Mean entropy and CML cont. place the human tapper as less perfectly uniform, as any irregularities in bin height will reduce
accurate than KEA and DP, although for the remaining measures the entropy and therefore increase the normalised beat accuracy.
the human is most accurate. Under all measures the Dixon [1] A current limitation of our approach is that the entropy calcu-
approach is weakest. Intuitively we should expect the human lation is shift-invariant with respect to the ordering of the his-
Beat Mean Global CML AML Dixon
Tracker Entropy (%) Entropy (%) Cont. (%) Total (%) Acc. (%)
Human 33.7 20.1 52.8 87.7 77.2
DP[6] 36.8 14.2 54.8 78.7 61.5
KEA[5] 38.3 15.5 55.8 80.1 64.6
Dixon[1] 23.2 5.5 21.9 52.0 35.4
Table 1. Beat accuracy results. The performance of a human tapper with three published algorithms are compared over our two
proposed entropy measures, two continuity based measures from Davies and Plumbley [6] and Dixon's [1] approach.
Beat Accuracy for Noisy Beat Annotations 6. ACKNOWLEDGEMENTS
100
CML cont
90 AML total MEPD is supported by a college studentship from Queen Mary
Dixon Acc
Mean Beat entropy University of London. This research has been partially funded
80 Global Beat entropy
by EPSRC grants GR/S75802/01 and GR/S82213/01.
70
60 7. REFERENCES
Accuracy (%)
50 [1] S. Dixon, "Automatic extraction of tempo and beat from
40 expressive performances," Journal of New Music Re-
search, vol. 30, pp. 3958, 2001.
30
[2] J. P. Bello and J. Pickens, "A robust mid-level representa-
20 tion for harmonic content in music signals," in Proceedings
10
of 6th International Conference on Music Information Re-
trieval, London, United Kingdom, 2005, pp. 304 311.
0
0.2 0.4 0.6 0.8 1 [3] M. Levy, M. Sandler, and M. Casey, "Extraction of high
Width of Uniform Distribution (Beats)
level musical structure from audio data and its application
to thumbnail generation," in Proceedings of IEEE Inter-
Figure 3. Comparison of beat tracking accuracy for beat anno- national Conference on Acoustics, Speech and Signal Pro-
tations perturbed by a white noise process. cessing (ICASSP), 2006.
[4] S. Dixon, F. Gouyon, and G. Widmer, "Towards charac-
terisation of music via rhythmic patterns," in Proceedings
of 5th International Conference on Music Information Re-
togram bins. Although unlikely, beats consistently 10% ahead trieval, Barcelona, Spain, 2004, pp. 509517.
of the beat would be considered just as accurate to beats centred [5] A. P. Klapuri, A. Eronen, and J. Astola, "Analysis of the
on 0%, even though the beats themselves would be perceptually meter of acoustic musical signals," IEEE Transactions on
out of time with the input. We intend to investigate this further, Audio, Speech and Language Processing, vol. 14, no. 1,
possibly incorporating the mean of the distribution into the beat pp. 342355, 2006.
accuracy calculation. [6] M. E. P. Davies and M. D. Plumbley, "Context-dependent
beat tracking of musical audio," Tech. Rep., Queen Mary
University of London, Centre for Digital Music, 2006,
http://www.elec.qmul.ac.uk/people/markp/2006/C4DM-
TR-06-02.pdf.
5. CONCLUSIONS
[7] S. Hainsworth, Techniques for the Automated Analysis of
Musical Audio, Ph.D. thesis, Department of Engineering,
We have presented an entropy based method for the evaluation Cambridge University, 2004.
of beat tracking systems. We extract the entropy from a distri- [8] A. T. Cemgil, B. Kappen, P. Desain, and H. Honing, "On
bution of normalised beat error, and scale the calculated value to tempo tracking: Tempogram representation and Kalman
indicate beat accuracy between 0 and 100%. We have demon- filtering," Journal Of New Music Research, vol. 29, no.
strated that the beat error histogram is a useful visualisation of 4, pp. 259273, 2000.
beat tracking performance, within which the simultaneous anal- [9] M. Goto and Y. Muraoka, "Issues in evaluating beat track-
ysis of multiple metrical levels is possible, a feature not present ing systems," in Working Notes of the IJCAI-97 Workshop
in other approaches to beat tracking evaluation. Although beat on Issues in AI and Music - Evaluation and Assessment,
accuracy from our entropy measure suggests poorer performance 1997, pp. 916.
than published approaches, it is an analytically meaningful statis- [10] J. London, Hearing in Time: Psychological Aspects of Mu-
tic which is not reliant on pre-defined parametric thresholds. As sical Meter, Oxford University Press, 2004.
part of our future work we intend to further investigate the va-
[11] M. D. Plumbley, "On information theory and unsupervised
lidity of using entropy to evaluate beat trackers, and plan to con-
neural networks," Tech. Rep., Cambridge University, De-
duct listening tests to infer which evaluation methods are most
partment of Engineering, 1991.
perceptually meaningful.