Good news for karaoke kings: You may soon be singing like superstars.
Researcher Mark Smith from Purdue University and Georgia Tech grad Matthew Lee are creating a computerized system that makes average singers sound like professionals. The researchers will design computer models for voice analysis and synthesis that break down an original human singing voice, then modify it to sound better. Though more work is needed, the specialized programs already can alter certain qualities of a voice such as pitch, duration, and vibrato, or modulation in frequency.
Smith, head of Purdue's electrical and engineering school, began work years ago on an underlying sinusoidal model that divides the human singing voice into sine-wave segments. Smith and Lee have taken it further by developing a method for modifying sine-wave parameters in the segments to improve the quality of singing. Though the team has had some success improving the quality of singing voices in their database, they still can't handle all types of voices reliably.
"Characterizing the properties of a good voice in terms of computed sine-wave components is not a trivial task," says Smith. "The problem is further complicated by the wide variety of singing styles and voice types in our population." For example, sine-wave components for male and female voices are significantly different. Turns out the higher-pitched voices of females are easier to work with, says Smith.
A key aspect of the sinusoidal-model technique is an overlap-add construction, in which a singing voice is partitioned into segments and processed in blocks. Because the model is designed around overlapping blocks, voice synthesis sounds natural, not choppy, says Smith. How it works: Singing is converted into a sequence of numbers, which is modified into a new set of numbers that represents a more-professional singing voice. The new numbers are then fed to a digital-to-analog converter and to a speaker.
Broader applications for such a system could include synthesizing musical instruments and improving the quality of text-to-speech programs.