Video sprites and concatenative sound synthesis

Sometimes crossing the borders of your own domain can yield some very interesting insights. After reading a nice blog post on recent applications of seam carving in the video processing world, I stumbled across an interesting article on controlled animation of video sprites. Allow me to explain why I think it is interesting for music technologists like us.

Let's start by briefly explaining what that video sprites article is talking about: using existing video recordings of an object ("sprite") to create new video sequences of that same object following a path or a set of movements that you defined (think 3D object movement scripting, but then with real footage). The recipe goes something like this:

  • make a long video recording of an object moving on a known background
  • isolate the object using key framing and do some cleanup to eliminate unusable video frames
  • find pairs of frames in the video sequence that are sufficiently similar so that they can be used for transitions from one part of the sequence to another without noticeable jerks (1)
  • define a desired movement for the object using motion constraints related to location, path, collision, etc... (as in "the hamster from the orignal video recording should now run around on this circle")
  • find the transition sequence that best matches the defined movement
  • et voilà, you have your hamster running around on your circle in a natural and smooth frame sequence

So, why is this in any way relevant to music technology?
Well, while reading this, it immediately made me think of concatenative sound synthesis. That is a technique where you concatenate carefully selected items from a massive set of very small audio snippets to create a new sound. The selection process is usually based on some kind of pre-defined goal described by one or more feature trajectories, like a predefined pitch contour, timbre pattern, amplitude curve, etc... This is a technique that is already in use in the speech synthesis world for a long time, and gained a lot of success because it produces natural sounding speech, in contrast to most pure sound synthesis techniques.
Actually, concatenative sound synthesis can be applied at several time scales: if you do it on a micro scale, you can create new sounds from existing ones, whereas if you do it on a macro scale, you can create mashups of songs where you replace parts of the song with similar parts from other songs (or dissimilar for that matter if you want something totally different).

Not that this is all new of course (look up the work by Diemo Schwarz for example, in particular this overview paper), it just always amazes me when something crosses my path from a seemingly totally different domain that still turns out to be using similar techniques as what we are using in the music technology field. Makes you think what techniques we could possibly borrow from financial time series analysis, heh? Or perhaps better not? ;-)


(1) This is done by first training a classifier to make this distinction, based on a manual classification of 1000 frame pairs as either acceptable or non-acceptable for transition, and then applying this classifier to all frame pairs. The authors actually use two different classifiers, and they use the second linear classifying function as a measure for visual difference, from which a transition cost is computed.