There are three groups of operations that computer vision algorithms or models do on regular images: image classification, object detection, and semantic/instance segmentation (which we will simply call “segmentation” in the future). Do not worry, you are not obliged to know about algorithm architectures!
Labeling itself is based on the model type that you use. The information your labels input to the algorithm must be of the same sort as the one you expect at the output.
Finally, the algorithm is trained on a set of images with human labeled data (training data) and learns how to predict classes, bounding boxes, or contours of previously unseen images. Note: currently only image classification model training is available.
Let us briefly consider a few examples.
1. Image Classification - “What is in the image?”
Here, each image is labeled with a class it belongs to. This is called single-label classification.
You may also want to find more than one object class in the image. This is called multi-label classification.
2. Object detection - “What is in the image and where?”
Here, the task is not only to predict what kind of object is in the image, but also to estimate the coordinates of a rectangular box around the object. Object detection is in a way similar to “multi-label” classification, because we may find several classes of objects in the same image, or even several instances of the same object class.
3. Semantic/Instance segmentation - “What is in the image and where exactly?”
Here, the goal is to predict the exact contours of the object. In instance segmentation, only the object we are looking for is labeled, while in semantic segmentation every pixel of the image is labeled.
This kind of task is popular in image manipulation, also making special effects in video, when some object has to be cut out from an image or video.