It turns out that the salience detector discussed previously is already a great tracker of proto-objects by itself. One could simply apply the algorithm to every frame of a video sequence and get a good idea of the location of the objects. However, what is getting lost is correspondence information. Imagine a video sequence of a busy scene, such as from a city center or a sports stadium. Although a saliency map could highlight all the proto-objects in every frame of a recorded video, the algorithm would have no way to know which proto-objects from the previous frame are still visible in the current frame. Also, the proto-objects map might contain some false-positives, such as in the following example:
Note that the bounding boxes extracted from the proto-objects map made (at least) three mistakes in the preceding example: it missed highlighting a player (upper-left), merged two players into the same bounding box, and highlighted some additional arguably non-interesting...