Transcript
Page 1: Fingerprinting to Identify Repeated Sound Events Lab in ...dpwe/talks/icassp07-fprint-poster.pdf · ROSA Laboratory for the Recognition and Organization of Speech and Audio • To

Fingerprinting to Identify Repeated Sound Events in Long-Duration Personal Audio Recordings

James P. Ogle and Daniel P.W. Ellis • LabROSA, Columbia University • [email protected], [email protected]

for ICASSP'07 • 2007-04-09 [email protected]

Summary: Body-worn audio recorders can collect huge “personal audio” archives of everything heard by the user, but navigating this data is a challenge. We investigate a noise-resistant audio fingerprint as a way to identify recurrent sound events. The fingerprint works well for data that is highly repeatable (e.g.phone rings) but not for more “organic” sounds (door closures etc.).

• Consumer MP3 players (e.g. iRiver T10) can also record continuously for over 12 hours on a single rechargable AA battery- Easy to collect a “personal audio archive” of everything heard throughout the day- .. but finding anything in the recordings can take close to real-time

• We are researching ways get useful information from this data- e.g. automatic retrospective calendar of activities/locations

• Work so far addresses segmenting and clustering archives [Ellis & Lee 06]- works with time frames of 6..120 sec- investigates best features to capture background ambience- segmentation via BIC criterion (like speaker segmentation)

- cluster recurring ambiences/environments with spectral clustering• This work looks for repeating foreground events based on fingerprinting

- repeating events may be relevant to user e.g. phone rings, theme songs- data-mining: repeats can be identified without user intervention- want to find repeats despite changes in channel and background (unlike exact repeats of [Johnson et al 00, Kashino et al 03, Herley 06])

• Vision is for interactive browser/calendar displaying multiple sources of information gleaned from recordings and other sources

COLUMBIA

UNIVERSITY

Personal Audio Archives

LabROSALaboratory for the Recognition andOrganization of Speech and Audio

• To find repeating events in the long-duration recordings, we use the fingerprinting technique from Shazam [Wang 02, 06]

- Prominent peaks – landmarks – are selected in a spectrogram, thresholded to have a roughly constant rate in six frequency bands

- Each landmark is paired with up to 9 neighbors nearby in time-frequency - Each pair gives a combinatorial hash defined by the frequencies of the

two landmarks and the time between them {f1, f2, ∆t} • quantizing each component to 6 bits gives 218 (262,144) distinct hashes - An index file records the

times when each hash occurs (a multi-hour recording has an index of <10MB)

- Multiple hashes occurring around two time locations with the same relative timing indicate a repeated sound event

Key Advantages of Shazam Fingerprint: - No time framing to influence the hash (unlike [Burges et al 03, Herley 06])

- Spectral peaks make hashes almost invariant to background noise- Missing any single hash does not preclude matching- Lower bound number of matching hashes allows precision/recall tradeoff

Sound Event Fingerprints• To find possible earlier instances of events in current window:

- retrieve times of all earlier instances of current hashes (fast because store is indexed by hash value)- make a histogram of relative timings- look for large peak → repeated event ... nearly constant time search

Example• Histogram of # shared hashes in

5 sec windows for 30 min personal audio recording, with two instances of a phone ring and three plays of music recording “Song A”.

Evaluation• Exactly-repeating sounds (alarms,

recordings) are detected well; “organic” sounds (speech, door closing) are not.

• Search for particular event (telephone ring)shows excellent resistance to backgroundnoise.

References[Burges et al 03] C.J.C. Burges, J.C. Platt, S. Jana “Distortion discriminant analysis for audio

fingerprinting” IEEE Tr. Speech and Audio Proc., 11(3):165–174, 2003. [Ellis & Lee 06] D.P.W. Ellis and K. Lee “Accessing minimal-impact personal audio archives” IEEE

MultiMedia Magazine, 13(4):30–38, 2006. [Herley 06] C. Herley “ARGOS: Automatically extracting repeating objects from multimedia streams”

IEEE Tr. Multimedia, 8(1):115–129, 2006.[Johnson et al 00] S. Johnson & P. Woodland ‘‘A Method for Direct Audio Search with Applications to

Indexing and Retrieval” Proc. ICASSP, III:1427–1430, 2000.[Kashino et al 03] K. Kashino, T. Kurozumi, H. Murase “A quick search method for audio and video

signals based on histogram pruning” IEEE Tr. Multimedia, 5(3):348–357, 2003. [Wang 02] A. Wang “An industrial-strength audio search algorithm” Proc. ISMIR 2003.[Wang 06] A. Wang “The Shazam music recognition service” Comm. ACM, 49(8):44–48, Aug 2006.

Finding Repeated Events

08:00

08:30

09:00

09:30

10:00

10:30

11:00

11:30

12:00

12:30

13:00

13:30

14:00

14:30

15:00

15:30

16:00

16:30

17:00

17:30

18:00

18:30

19:00

19:30

20:00

preschool

cafe

preschoolcafelecture

office

officeoutdoorgroup

lab

cafemeeting2

office

office

outdoorlab

cafe

Ron

Manuel

Arroyo?

Lesser

2004-09-13

L2cafeoffice

outdoorlecture

outdoor

meeting2

outdoorofficecafeoffice

outdoorofficepostlecoffice

DSP03

compmtg

Mike

Sambarta?

2004-09-14

L2cafeofficemeetinglecture2

office

officecafeoutdoor

preschool

graham

keansub

2004-09-15

meetingcafeofficemeeting2postlec

lecture

outdooroutdooroffice

outdooroutdoormeetingofficepostlecmeeting

officemeeting2

nathancafe

DSP04

office

2004-09-16

cafegroup

cafegroupgroup

outdoor

group

groupoffice

cafe

cafecafe

rdg

seminar

gp

2004-09-17

Category Recall PrecisionProduction Audio 29/30 97% 29/34 85%Alert Sounds 45/65 69% 45/45 100%OrganicSounds 0/20 0% 0/20 0%

ArchiveLength (min) 60 120 180 390Search time (ms) 21 31 37 131

SNR/ dB 3 -3 -9 -15 -21Recall 100% 89% 89% 89% 56%

freq

/ Bar

k

Clock time

Hand-markedboundaries

5

10

15

20

21:00 22:00 23:00 0:00 1:00 2:00 3:00 4:00

Spectrogram-like representationof 8 hour recording shows energy(intensity), spectral flatness (saturation)and variance (hue) in 1 min windows .

Phone ring - Shazam fingerprint

0 0.5 1 1.5 2 2.5 time / s level / dB0

1

2

3

4

freq

/ kH

z

-80

-70

-60

-50

-40

-30

-20

-10

f1

f2∆t

2007-04-11-0839.idx00|00|00: 7012.45 11052.33 96384.2800|00|01: 123.11 125.87 23004.66 61993.8300|00|02:00|00|03: 71552.34 101663.03...

Song A

Song A

Telephone ring

Telephone ring

time/sec

count

time/

sec

500

500

1000

1000

1500

1500

105

2

20

50100200

50010002000

Top Related