In learning mode, the user marks a rectangle in a training image specifying the region that has to be learned. Then, VOCUS computes the bottom-up saliency map and the most salient region inside the rectangle. So, the system is able to determine automatically what is important in a specified region. It concentrates on parts that are most salient and disregards the background or less salient parts.
Next, weights are determined for the feature and conspicuity
maps, indicating how important a feature is in the
specified region. The weights are the quotient of the mean saliency in
the target region and in the background
:
This computation considers not only which features are the strongest
in the region of interest, it regards also which features separate the
best region from the rest of the image.
Several training images:
Learning weights from one single training image usually yields
good results if the target object occurs in all test images in a
similar way, i.e., on a similar background. To enable a more
stable recognition even on varying backgrounds, we determine the
average weights from several training images by computing the
geometric mean of the weights, i.e.,
where
is the number of
training images. An algorithm for choosing the training images
is proposed in [6]. It showed that,
usually, even in complex scenarios 5 training images suffice; for
ball detection, already two
training images yielded the best performance.