Information about http://www.elec.qmul.ac.uk/people/markl/eusipcoTEX.pdf

NEW METHODS IN STRUCTURAL SEGMENTATION OF MUSICAL AUDIO …

Tags: abstract models, elec, experimental results, introduction issue, london e1 4ns, manual input, mark levy, markov models, music track, musical structure, phrase length, probability distribution, qmul, queen mary university, queen mary university of london, real music, road london, segment lengths, segmentation, university of london,
Pages: 5
Language: english
Created: Mon Jan 30 16:21:46 2006
Display cached document
Page 1
image
Page 2
image
Page 3
image
Page 4
image
Page 5
image
      NEW METHODS IN STRUCTURAL SEGMENTATION OF MUSICAL AUDIO
                                                     Mark Levy, Mark Sandler
                                                    Centre for Digital Music
                                               Queen Mary, University of London
                                              Mile End Road, London E1 4NS, UK
                                    mark.levy@elec.qmul.ac.uk, mark.sandler@elec.qmul.ac.uk


                          ABSTRACT                                     models (HMMs), where the hidden states correspond to seg-
We describe a simple model of musical structure and two                ment types, are used for segmentation in [5], and a two-pass
related methods of extracting a high-level segmentation of a           method, in which candidate segment boundaries are detected
music track from the audio data, including a novel use of hid-         and the intervening segments then used to initialise an HMM,
den semi-Markov models. We introduce a semi-supervised                 is outlined in [6]. These approaches are much less restrictive,
segmentation process which finds musical structure with im-            but unfortunately the chosen model implicitly defines a geo-
proved accuracy given some very limited manual input. We               metric probability distribution for segment durations, which
give experimental results compared to existing methods and             is not what we observe in real music, where segment lengths
human segmentations.                                                   are typically multiples of some basic phrase length, for ex-
                                                                       ample 8, 16 or 32 bars, only very rarely taking other values.
                                                                            In our own work [7] we have partially addressed the
                    1. INTRODUCTION                                    issue of modelling expected segment lengths realistically
This paper introduces a simple model of musical structure,             within a clustering framework, by including a term express-
and shows how it can be either formalised directly as a seg-           ing the relative unlikelihood of segments being shorter than
ment model, or used to inform a clustering approach to au-             an experimentally-determined minimum duration. In this pa-
dio segmentation. Although the model is very general, and              per we introduce a method for estimating the underlying base
could be applied to symbolic data (notes, chords, etc.) drawn          phrase length of a piece of music from audio data, and show
from score representations of music, our focus here is on its          how this can either be passed to our clustering method, or
application directly to audio data, in order to extract high-          used to initialise a different model for high-level segmenta-
level musical structure from audio tracks with at most a small         tion, which allows us to specify a full probability distribution
amount of human intervention. Knowledge of this structure              for segment lengths.
has immediate practical applications in the context of audio                Although unsupervised approaches to segmentation have
or video editing, enabling the development of features such            been shown to achieve results similar to human judgement
as `jump to start of next phrase', `align with current musical         in some cases, research into parallel problems in image seg-
phrase', etc. It leads easily to the automatic extraction of mu-       mentation [8] suggests that a small amount of supervision
sical summaries or `thumbnail' segments, for use in browsing           may offer large gains in segmentation accuracy. In the case
and searching the large audio collections which are rapidly            of musical audio segmentation, a major cause of fragility in
becoming commonplace through the popularity of download                unsupervised methods is the difficulty of providing good ini-
sites, MP3 players and associated technologies. Compar-                tial parameter values to models such as HMMs, whose train-
ison of segment models also offers possible new methods                ing algorithms are well known to be susceptible to poor ini-
for content-based audio retrieval, similarity search and music         tialisation [9]. We address this problem here with a simple
recommendation.                                                        semi-supervised approach to segmentation.
     Early research into automatic segmentation of musical                  The organisation of the rest of this paper is as follows:
audio [1, 2] had some success in the partial extraction of             section 2 describes the musical model and the audio fea-
high-level musical structure by a self-similarity search over          tures used; section 3 describes how a segment-length distri-
spectral features to find repeated sections. This approach has         bution is estimated; section 4 introduces the new segmen-
some drawbacks: only some sections are identified and la-              tation methods; section 5 describes the semi-supervised ap-
belled, the choice of distance metrics and thresholds for sim-         proach and illustrates output segmentations; section 6 evalu-
ilarity is somewhat ad hoc, and the search is computationally          ates the performance of the methods in relation to previous
expensive, requiring the calculation of pairwise distances             work and some manual segmentations; section 7 outlines fur-
between all analysis frames within a track. Self-similarity            ther work.
search has been extended in [3, 4] to extract the structure
of complete pop tracks by incorporating information from                     2. A MODEL OF MUSICAL STRUCTURE
beat-tracking and applying a set of heuristics to identify seg-        2.1 The nature of musical structure
ments as being one of a very small number of specific types
(instrumental, verse or chorus). These methods make some               We can describe the structure of most music as follows. Ac-
highly limiting assumptions about the structure being sought           tivity at the level of the beat, i.e. notes to be sung or played
that apply only to conventional pop music. Hidden Markov               by particular instruments, is organised into regular phrases,
                                                                       typically in Western music of 4 or 8 bars, where each bar con-
    This research was supported by EPSRC grant GR/S84750/01 (Hierar-   tains 3 or 4 beats. These phrases are concatenated, whether
chical Segmentation and Semantic Markup of Musical Signals).           by the composer on manuscript paper, or the producer or mu-
sician in the recording studio, to form structural sections, or                      40
segments, in accordance with the stylistic norms of the par-
ticular musical genre in question. A piece is then constructed                       30




                                                                       timbre type
from a sequence of segments of various types, in an order
                                                                                     20
again largely determined by its genre.
    While this is clearly a weak account of the real act of                          10
composition, it does suggest a straightforward formal model
of musical structure that we can reasonably expect to fit a                           0
great many pieces of music, for example most popular, folk,                               0            200          400       600        800        1000
world and much classical music. We assume further that we
have some way of labelling each beat or frame of the music in         Figure 1: Sequence of timbre-types against manual segmen-
such a way that beats or frames which are musically similar           tation. Note how related segments (shown in same back-
are assigned the same label. A piece can then be represented          ground shade) contain similar sequences of states.
as a sequence of labels {yt },t = 1, 2, ..., T , and the task of
segmentation consists of assigning each of the yt to one of                          1500
a set of segment-types Q = {q1 , q2 , ..., qM }, subject to suit-
able expectations, which we can express as conditional prob-                         1000




                                                                      B(l)
ability distributions, on the resulting segment durations and
the particular sequences of labels observed in each segment-                          500
type.
                                                                                          0
                                                                                              0         10       20        30       40         50     60
2.2 The observation sequence                                                                                              lag l
Given an audio track, we aim to label each beat as belonging
to one of N possible timbre-types dividing the overall space          Figure 2: Approximate autocorrelation of timbre-type se-
of timbre used in the track. We first estimate the beat using         quence, showing first strong peak at 4-bar phrase length.
a beat-tracking algorithm [10] and then extract constant-Q
            1
spectra at 8 -octave resolution, using a hop equal to the beat-
length (typically 300-400ms) and a window of three times the          moving window of length w. We then sum pairwise distances
hop size. The spectra are normalised and subjected to Prin-           between histograms at lags 1  l  D to give an approximate
                                                                                                T -l
cipal Component Analysis. Finally we combine the first 20             autocorrelation B(l) = t=1 -dKL (xt , xt+l ) where dKL (x, x )
PCA components and the normalised envelope to yield 21-               is a symmetrised Kullback-Liebler divergence, reflecting the
dimensional feature vectors. We train an N-state HMM on               relative likelihood of the two histograms being drawn from
the sequence of feature vectors, with a single Gaussian out-          two separate or one combined distribution of timbre-types,
put distribution for each state, and a single covariance matrix       and is given by
tied across all states. We then Viterbi-decode the features
using the trained model to give the most likely sequence of                                                     N
                                                                                                                             x(i)             x (i)
timbre-types. In music with a relatively small overall timbre                                     dKL (x, x ) =  x(i) log         + x (i) log
                                                                                                                i=1          q(i)             q(i)
space, such as simple verse-chorus songs, we observe that the
labelled timbre-types correspond clearly to particular combi-                                                
nations of notes performed with a similar instrumentation.            where q(i) = x(i)+x (i) .
                                                                                         2
Figure 1 shows a typical sequence of timbre-type labels. The              Figure 2 shows an example of the resulting function, in
number of timbre-types N should be large enough to show               which candidate phrase lengths appear as peaks. The de-
clear variation in the labels observed in segments of differ-         gree of smoothing of B(l), and therefore the smallest phrase
ent type, but small enough for the computational demands              length to be considered, is controlled by the histogram win-
of (and the number of parameters to be learned by) a formal           dow size w. To estimate a base phrase length d pl , we first
model to remain manageable. Following experiments over a              calculate a moving baseline B0 (l) by smoothing B(l) with a
small test set of tracks we use a value of N = 40. Note that          median filter of length 5, and then pick the larger of the first
although we use approximately beat-length analysis frames             two peaks in B(l) - B0 (l). In experiments over a test set of
in this paper, this is not a requirement of our approach, but         popular music, d pl was reliably found to be a four-bar phrase
rather is intended i) to aid clarity of discussion and ii) to limit   length when using a histogram window size of w = 7 (with
computational requirements by keeping the maximum seg-                beat-length frames). Given this base phrase length, it is sim-
ment length to a modest value. In music where full beat-              ple to estimate the overall distribution of segment lengths for
tracking is possible, we are able to use strictly beat-length         a track, because in most music the length of the great ma-
frames, but we reserve discussion of this for a future paper.         jority of segments is some multiple of d pl , as illustrated in
                                                                      Figure 3, which shows the distribution of segment lengths
3. ESTIMATING EXPECTED SEGMENT LENGTHS                                over the test set according to expert human segmentations.

Periodicity and rhythmic structure at the beat and bar level                                        4. SEGMENTATION METHODS
have previously been estimated from the autocorrelation or
`beat spectrum' of suitable extracted features [11, 12]. We           4.1 The segmental or hidden semi-Markov model
extend this to search for periodicity at the phrase level by          In an HMM [9], at each time t the system is in one of M
a direct analysis of the sequence of timbre-type labels. We           states {q1 , q2 , ..., qM } and generates an observation yt ac-
first create normalised histograms {xt } of timbre-types over a       cording to a distribution P(yt |Qt = qi ) = bi (yt ). The sys-
                                                                                     initialised by T (i) = 1.
                     120
                                                                                          State initial and transition probabilities can then be re-
                                                                                     estimated as
                     100                                                                                             i 0 (i)
                                                                                                                        
                                                                                                             i =
                                                                                                             ^
                                                                                                                  i i 0 (i )
                                                                                                                          
number of segments




                      80
                                                                                                                     t=1 t (i)ai j t ( j)
                                                                                                                      T
                                                                                                      ai j =
                                                                                                      ^
                      60                                                                                            j t=1 t (i)ai j t ( j )
                                                                                                                        T


                      40                                                             and state duration probabilities as

                      20                                                                ^           t t-d (i)P(d|i)P(yt-d+1 , ..., yt |i)t (i)
                                                                                                        
                                                                                        P(d|i) =
                                                                                                   d  t t-d  (i)P(d  |i)P(yt-d  +1 , ..., yt |i)t (i)
                                                                                                         

                       0
                           0     2       4       6       8        10                     In the case of discrete observations, if we continue to
                                segment length as multiple of dpl
                                                                                     treat observations as conditionally independent, i.e.
                     Figure 3: Distribution of segment lengths over test set.                                                     t
                                                                                                   P(yt-d+1 , ..., yt |i) =               P(yt  |i)     (1)
                                                                                                                              t  =t-d+1
tem then makes a transition to another state with probabil-
ity P(st+1 = q j |Qt = qi ) = ai j . Consequently i) the dura-                       the observation probabilities can be re-estimated as
tion d for which the system stays in any given state i has an
implicit geometric distribution P(d) = ad-1 (1 - aii); and ii)                                     ^        t:yt =k