Subtitle-Music Alignment
Addressing the Problem
What is the Problem?
We want to make a system that align singing in polyphonic music audio with subtitles, which is basically textual lyrics with time stamp for each line.
Difficulties
- Accompaniment could reduce recognition accuracy.
- Variability of phonation in singing will also affect the system.
System Overview
Solution for Mentioned Difficulties
- Apply source separation to extract vocal line.
- Adapt HMM models using singing voice training data to overcome the difference between speech and singing.
Desired Structure
Implementeion
- Trained HMM using ARCTIC speech database, while an annoted singing phonemes database is not available.
- Created phonetic transcription of the subtitle.
- Use HCopy in HTK to get LPC features of the phonetic transcription.
- Use HVite in HTK to get Veterbi forced alignment result.
Things could be Improved
- Should use singing dataset to adapt this HMM model.
- Should automatically add entries which are not available for CMU phonemes dictionary.
- Should apply vocal activity detection.
Result
The Alignment output
The alignment error is showed in this table:
Song | Average | Standard Deviation |
---|---|---|
Creep | 1006ms | 1.65 |
Creep(vocal only) | 1129ms | 1.16 |
Blank Space | 429ms | 0.81 |
After eliminating outliners(error greater than 3s): Noter: most outliners would be improved by better vocal appearence detection algorithm, and some outliners are caused by wrongly labelled subtitle files.
Song | Average | Standard Deviation |
---|---|---|
Creep | 546ms | 0.76 |
Creep(vocal only) | 913ms | 0.83 |
Blank Space | 323ms | 0.26 |
Graph for each lyrics line’s error is as followed:
Thoughts about the Result
Given the performance between two songs, one guess is the length of lyrics lines provided in subtitle files would influence the detection of words, for that a large portion of accompaniment has been eliminated. Also, if normalized error is measured, the difference between songs might be more alike.
About the effect of vocal separation – I didn’t expect the result would be like this… One thought is our models include noise model, whereas the vocal-only file doesn’t include much noise.
Larger dataset and tests with more songs need to be done.
A visualization of the output
Video
Reference
Mesaros A, Virtanen T. Automatic alignment of music audio and lyrics[C]//Proceedings of the 11th Int. Conference on Digital Audio Effects (DAFx-08). 2008.
Mesaros A. Singing voice identification and lyrics transcription for music information retrieval invited paper[C]//Speech Technology and Human-Computer Dialogue (SpeD), 2013 7th Conference on. IEEE, 2013: 1-10.
Mauch M, Fujihara H, Goto M. Song Prompter: An accompaniment system based on the automatic alignment of lyrics and chords to audio[C]//Late-breaking session at the 10th International Conference on Music Information Retrieval. 2010.