Tags: abstract models, elec, experimental results, introduction issue, london e1 4ns, manual input, mark levy, markov models, music track, musical structure, phrase length, probability distribution, qmul, queen mary university, queen mary university of london, real music, road london, segment lengths, segmentation, university of london,
NEW METHODS IN STRUCTURAL SEGMENTATION OF MUSICAL AUDIO
Mark Levy, Mark Sandler
Centre for Digital Music
Queen Mary, University of London
Mile End Road, London E1 4NS, UK
mark.levy@elec.qmul.ac.uk, mark.sandler@elec.qmul.ac.uk
ABSTRACT models (HMMs), where the hidden states correspond to seg-
We describe a simple model of musical structure and two ment types, are used for segmentation in [5], and a two-pass
related methods of extracting a high-level segmentation of a method, in which candidate segment boundaries are detected
music track from the audio data, including a novel use of hid- and the intervening segments then used to initialise an HMM,
den semi-Markov models. We introduce a semi-supervised is outlined in [6]. These approaches are much less restrictive,
segmentation process which finds musical structure with im- but unfortunately the chosen model implicitly defines a geo-
proved accuracy given some very limited manual input. We metric probability distribution for segment durations, which
give experimental results compared to existing methods and is not what we observe in real music, where segment lengths
human segmentations. are typically multiples of some basic phrase length, for ex-
ample 8, 16 or 32 bars, only very rarely taking other values.
In our own work [7] we have partially addressed the
1. INTRODUCTION issue of modelling expected segment lengths realistically
This paper introduces a simple model of musical structure, within a clustering framework, by including a term express-
and shows how it can be either formalised directly as a seg- ing the relative unlikelihood of segments being shorter than
ment model, or used to inform a clustering approach to au- an experimentally-determined minimum duration. In this pa-
dio segmentation. Although the model is very general, and per we introduce a method for estimating the underlying base
could be applied to symbolic data (notes, chords, etc.) drawn phrase length of a piece of music from audio data, and show
from score representations of music, our focus here is on its how this can either be passed to our clustering method, or
application directly to audio data, in order to extract high- used to initialise a different model for high-level segmenta-
level musical structure from audio tracks with at most a small tion, which allows us to specify a full probability distribution
amount of human intervention. Knowledge of this structure for segment lengths.
has immediate practical applications in the context of audio Although unsupervised approaches to segmentation have
or video editing, enabling the development of features such been shown to achieve results similar to human judgement
as `jump to start of next phrase', `align with current musical in some cases, research into parallel problems in image seg-
phrase', etc. It leads easily to the automatic extraction of mu- mentation [8] suggests that a small amount of supervision
sical summaries or `thumbnail' segments, for use in browsing may offer large gains in segmentation accuracy. In the case
and searching the large audio collections which are rapidly of musical audio segmentation, a major cause of fragility in
becoming commonplace through the popularity of download unsupervised methods is the difficulty of providing good ini-
sites, MP3 players and associated technologies. Compar- tial parameter values to models such as HMMs, whose train-
ison of segment models also offers possible new methods ing algorithms are well known to be susceptible to poor ini-
for content-based audio retrieval, similarity search and music tialisation [9]. We address this problem here with a simple
recommendation. semi-supervised approach to segmentation.
Early research into automatic segmentation of musical The organisation of the rest of this paper is as follows:
audio [1, 2] had some success in the partial extraction of section 2 describes the musical model and the audio fea-
high-level musical structure by a self-similarity search over tures used; section 3 describes how a segment-length distri-
spectral features to find repeated sections. This approach has bution is estimated; section 4 introduces the new segmen-
some drawbacks: only some sections are identified and la- tation methods; section 5 describes the semi-supervised ap-
belled, the choice of distance metrics and thresholds for sim- proach and illustrates output segmentations; section 6 evalu-
ilarity is somewhat ad hoc, and the search is computationally ates the performance of the methods in relation to previous
expensive, requiring the calculation of pairwise distances work and some manual segmentations; section 7 outlines fur-
between all analysis frames within a track. Self-similarity ther work.
search has been extended in [3, 4] to extract the structure
of complete pop tracks by incorporating information from 2. A MODEL OF MUSICAL STRUCTURE
beat-tracking and applying a set of heuristics to identify seg- 2.1 The nature of musical structure
ments as being one of a very small number of specific types
(instrumental, verse or chorus). These methods make some We can describe the structure of most music as follows. Ac-
highly limiting assumptions about the structure being sought tivity at the level of the beat, i.e. notes to be sung or played
that apply only to conventional pop music. Hidden Markov by particular instruments, is organised into regular phrases,
typically in Western music of 4 or 8 bars, where each bar con-
This research was supported by EPSRC grant GR/S84750/01 (Hierar- tains 3 or 4 beats. These phrases are concatenated, whether
chical Segmentation and Semantic Markup of Musical Signals). by the composer on manuscript paper, or the producer or mu-
sician in the recording studio, to form structural sections, or 40
segments, in accordance with the stylistic norms of the par-
ticular musical genre in question. A piece is then constructed 30
timbre type
from a sequence of segments of various types, in an order
20
again largely determined by its genre.
While this is clearly a weak account of the real act of 10
composition, it does suggest a straightforward formal model
of musical structure that we can reasonably expect to fit a 0
great many pieces of music, for example most popular, folk, 0 200 400 600 800 1000
world and much classical music. We assume further that we
have some way of labelling each beat or frame of the music in Figure 1: Sequence of timbre-types against manual segmen-
such a way that beats or frames which are musically similar tation. Note how related segments (shown in same back-
are assigned the same label. A piece can then be represented ground shade) contain similar sequences of states.
as a sequence of labels {yt },t = 1, 2, ..., T , and the task of
segmentation consists of assigning each of the yt to one of 1500
a set of segment-types Q = {q1 , q2 , ..., qM }, subject to suit-
able expectations, which we can express as conditional prob- 1000
B(l)
ability distributions, on the resulting segment durations and
the particular sequences of labels observed in each segment- 500
type.
0
0 10 20 30 40 50 60
2.2 The observation sequence lag l
Given an audio track, we aim to label each beat as belonging
to one of N possible timbre-types dividing the overall space Figure 2: Approximate autocorrelation of timbre-type se-
of timbre used in the track. We first estimate the beat using quence, showing first strong peak at 4-bar phrase length.
a beat-tracking algorithm [10] and then extract constant-Q
1
spectra at 8 -octave resolution, using a hop equal to the beat-
length (typically 300-400ms) and a window of three times the moving window of length w. We then sum pairwise distances
hop size. The spectra are normalised and subjected to Prin- between histograms at lags 1 l D to give an approximate
T -l
cipal Component Analysis. Finally we combine the first 20 autocorrelation B(l) = t=1 -dKL (xt , xt+l ) where dKL (x, x )
PCA components and the normalised envelope to yield 21- is a symmetrised Kullback-Liebler divergence, reflecting the
dimensional feature vectors. We train an N-state HMM on relative likelihood of the two histograms being drawn from
the sequence of feature vectors, with a single Gaussian out- two separate or one combined distribution of timbre-types,
put distribution for each state, and a single covariance matrix and is given by
tied across all states. We then Viterbi-decode the features
using the trained model to give the most likely sequence of N
x(i) x (i)
timbre-types. In music with a relatively small overall timbre dKL (x, x ) = x(i) log + x (i) log
i=1 q(i) q(i)
space, such as simple verse-chorus songs, we observe that the
labelled timbre-types correspond clearly to particular combi-
nations of notes performed with a similar instrumentation. where q(i) = x(i)+x (i) .
2
Figure 1 shows a typical sequence of timbre-type labels. The Figure 2 shows an example of the resulting function, in
number of timbre-types N should be large enough to show which candidate phrase lengths appear as peaks. The de-
clear variation in the labels observed in segments of differ- gree of smoothing of B(l), and therefore the smallest phrase
ent type, but small enough for the computational demands length to be considered, is controlled by the histogram win-
of (and the number of parameters to be learned by) a formal dow size w. To estimate a base phrase length d pl , we first
model to remain manageable. Following experiments over a calculate a moving baseline B0 (l) by smoothing B(l) with a
small test set of tracks we use a value of N = 40. Note that median filter of length 5, and then pick the larger of the first
although we use approximately beat-length analysis frames two peaks in B(l) - B0 (l). In experiments over a test set of
in this paper, this is not a requirement of our approach, but popular music, d pl was reliably found to be a four-bar phrase
rather is intended i) to aid clarity of discussion and ii) to limit length when using a histogram window size of w = 7 (with
computational requirements by keeping the maximum seg- beat-length frames). Given this base phrase length, it is sim-
ment length to a modest value. In music where full beat- ple to estimate the overall distribution of segment lengths for
tracking is possible, we are able to use strictly beat-length a track, because in most music the length of the great ma-
frames, but we reserve discussion of this for a future paper. jority of segments is some multiple of d pl , as illustrated in
Figure 3, which shows the distribution of segment lengths
3. ESTIMATING EXPECTED SEGMENT LENGTHS over the test set according to expert human segmentations.
Periodicity and rhythmic structure at the beat and bar level 4. SEGMENTATION METHODS
have previously been estimated from the autocorrelation or
`beat spectrum' of suitable extracted features [11, 12]. We 4.1 The segmental or hidden semi-Markov model
extend this to search for periodicity at the phrase level by In an HMM [9], at each time t the system is in one of M
a direct analysis of the sequence of timbre-type labels. We states {q1 , q2 , ..., qM } and generates an observation yt ac-
first create normalised histograms {xt } of timbre-types over a cording to a distribution P(yt |Qt = qi ) = bi (yt ). The sys-
initialised by T (i) = 1.
120
State initial and transition probabilities can then be re-
estimated as
100 i 0 (i)
i =
^
i i 0 (i )
number of segments
80
t=1 t (i)ai j t ( j)
T
ai j =
^
60 j t=1 t (i)ai j t ( j )
T
40 and state duration probabilities as
20 ^ t t-d (i)P(d|i)P(yt-d+1 , ..., yt |i)t (i)
P(d|i) =
d t t-d (i)P(d |i)P(yt-d +1 , ..., yt |i)t (i)
0
0 2 4 6 8 10 In the case of discrete observations, if we continue to
segment length as multiple of dpl
treat observations as conditionally independent, i.e.
Figure 3: Distribution of segment lengths over test set. t
P(yt-d+1 , ..., yt |i) = P(yt |i) (1)
t =t-d+1
tem then makes a transition to another state with probabil-
ity P(st+1 = q j |Qt = qi ) = ai j . Consequently i) the dura- the observation probabilities can be re-estimated as
tion d for which the system stays in any given state i has an
implicit geometric distribution P(d) = ad-1 (1 - aii); and ii) ^ t:yt =k