In learning mode, the user marks a rectangle in a training image specifying the region that has to be learned. Then, VOCUS computes the bottom-up saliency map and the most salient region inside the rectangle. So, the system is able to determine automatically what is important in a specified region. It concentrates on parts that are most salient and disregards the background or less salient parts.
Next, weights are determined for the feature and conspicuity maps, indicating how important a feature is in the specified region. The weights are the quotient of the mean saliency in the target region and in the background : This computation considers not only which features are the strongest in the region of interest, it regards also which features separate the best region from the rest of the image.
Several training images: Learning weights from one single training image usually yields good results if the target object occurs in all test images in a similar way, i.e., on a similar background. To enable a more stable recognition even on varying backgrounds, we determine the average weights from several training images by computing the geometric mean of the weights, i.e., where is the number of training images. An algorithm for choosing the training images is proposed in [6]. It showed that, usually, even in complex scenarios 5 training images suffice; for ball detection, already two training images yielded the best performance.