Weakly Supervised Object Detection (WSOD) is a task that detects objects in an image using a model trained only on image-level annotations. Current state-of-the-art models benefit from self-supervised instance-level supervision, but since weak supervision does not include count or location information, the most common ``argmax'' labeling method often ignores many instances of objects. To alleviate this issue, we propose a novel multiple instance labeling method called object discovery. We further introduce a new contrastive loss under weak supervision where no instance-level information is available for sampling, called weakly supervised contrastive loss (WSCL). WSCL aims to construct a credible similarity threshold for object discovery by leveraging consistent features for embedding vectors in the same class. As a result, we achieve new state-of-the-art results on MS-COCO 2014 and 2017 as well as PASCAL VOC 2012, and competitive results on PASCAL VOC 2007.
We introduce a novel multiple instance labeling method which addresses the limitations of current labeling methods in WSOD. Our proposed object discovery module explores all proposed candidates using a similarity measure to the highest-scoring representation. We further suggest a weakly supervised contrastive loss (WSCL) to set a reliable similarity threshold. WSCL encourages a model to learn similar features for objects in the same class, and to learn discriminative features for objects in different classes. To make sure the model learn appropriate features, we provide a large number of positive and negative instances for WSCL through three feature augmentation methods suitable for WSOD.
We evaluate the proposed method with other state-of-the-art algorithms on COCO14 and 17. Regardless of backbone and dataset, our method achieves the new state-of-the-art performance for all the evaluation metrics.
We also compare the performance of the state-of-the-art methods on VOC07 and 12 with both SS and MCG proposal methods.
We provide the qualitative results that show how our method addresses three main challenges of WSOD – part domination, grouped instances and missing objects, compared to OICR [1].
We compare prediction results of OICR [1] on the left and Ours on the right. Our model shows much better results for COCO, which contains more instances per image.
As a result, we visualize the pseudo groundtruths captured by OICR [1], MIST [2] and Ours. OICR [1] selects only the top-scoring proposal per category ignoring all the other instances. Although multiple objects are captured by MIST [2], it also selects many false positives e.g., object-like background. Ours successfully captures true positives.
[1] Tang, Peng, et al. "Multiple Instance Detection Network With Online Instance Classifier Refinement."
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) 2017.
[2] Ren, Zhongzheng, et al. "Instance-Aware, Context-Focused, and Memory-Efficient Weakly Supervised Object Detection."
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) 2020.
@inproceedings{seo2022od-wscl,
author = {Seo, Jinhwan and Bae, Wonho and Sutherland, Danica J. and Noh, Junhyug and Kim, Daijin},
title = {{Object Discovery via Contrastive Learning for Weakly Supervised Object Detection}},
booktitle = {European Conference on Computer Vision (ECCV)},
year = {2022}
}