Abstract and 1. Introduction

  1. Few-Shot Personalized Instance Recognition
  2. Object-Conditioned Bag of Instances
  3. Experimental Results
  4. Conclusion
  5. References

4. EXPERIMENTAL RESULTS

Most of the evaluation focuses on the personal instance-level accuracy, since our modules do not influence in any way the object-level detection accuracy and bounding box regression.

For the sake of clarity, we consider that each input sample contains one single instance. Nonetheless, our method can handle multiple instances in input samples by running the instance-level prototype search for each input object independently. Provided that the general object detection results are accurate, our results would not change.

Unless otherwise stated, we report all results on YOLOv8n, being the most suitable for deployed applications, and in the case of 2 instances per each object-level category.

Same domain. The first scenario we design considers 1-Shot from All Sequences (1SAS), therefore the same domain is seen during few-shot training (one sample per each sequence) and testing (all remaining samples from all sequences). Table 1 reports Acci in multiple setups having a variable number of instances per object-level class. First, we observe that gradient-based fine-tuning methods (e.g., FT) are not effective and obtain comparable results to a random classifier (lower bound). OBoI via PFSL methods such as SimpleShot [13] and ProtoNet [12] show large gains compared to FT by learning a metric space from the extracted features. In both cases, the augmentation of embeddings via our multi-order statistics boost the recognition accuracy significantly, especially in presence of multiple instances per object. Remarkably, we can personalize YOLOv8n to achieve 77.08% Acci when detecting 18 personal instances via just a few samples and via a backpropagation-free approach, assuming that a correct object-level classification and bounding box regression are output from the detection head. Fig. 3a reports Acci and Acco of OBoIs via ProtoNet in the case of 2 instances per object. We consider three configurations for ProtoNet: at the logits level (i.e., the output of the last layer of the detector’s head), at the encoder embedding level (i.e., the output of the detector’s encoder) or via our multi-statistics augmented encoder embeddings (i.e., with features augmented via multi-order statistics). We observe that our proposed solution consistently improves or achieves comparable results on every object class.

Other domain. We designed a more realistic, yet challenging, setup considering 1-Shot from 1 Sequence (1S1S) during training and all remaining samples during testing. The

model experiences different domains at training time (one sample from the first sequence only) and at testing time (all remaining samples from all sequences). Table 2 summarizes the main results. Reducing the training samples further decreased the accuracy of FT, compared to the 1SAS setup. Also SimpleShot and ProtoNet show lower accuracy, due to the fewer training samples and the domain gap of the 1S1S setup. Nonetheless, they exhibit large gains over FT. Our method shows a significant improvement of the performance in every case, even in the presence of a domain shift, and especially in case of multiple instances per object category. We argue that the gain attained by our approach is lower than the previous one due to the difficulty in reliably matching multi-order statistics between a single input sample from a single domain and several target samples from several domains. Fig. 3b reports Acco; similarly to the previous case, we confirm that our solution obtains robust results across most of the classes.

Variable training shots are studied in Fig. 4, where we observe that OBoIs with our AEE improve personal recognition accuracy regardless of the number of available training samples (i.e., shots) for both ProtoNet and SimpleShot.

Other YOLOv8 sizes are evaluated in Table 3 on both general object detection and personalized instance recognition task. Larger YOLOv8 models can improve detection performance, and this correlates with the personal instance recognition accuracy. The improvement of larger YOLOv8

models comes at a cost of a significantly larger model size and slower FPS: YOLOv8x improves personal recognition by about 25% compared to YOLOv8n, while having about 22× larger size and 3.6× slower inference. The final choice depends on the hardware specifications of target devices.

Computational inference time of our AEE on top of the OBoI with ProtoNet increases by as little as 0.8% making our method lightweight with a nearly negligible impact.

Additional ablation studies to evaluate our design choices are reported here on ProtoNet. Removing the object-level conditioning lowered Acci in the 1SAS setup from 77.1% of our approach (ProtoNet + AEE) to 70.9%, showing a relative gain of about 3% compared to the baseline (68.8%). This is due to the larger search space in metric learning. Removing the mask Si,k leads to more background noise regions to flow into prototype computation, and it decreases accuracy by 2.6% in the 1SAS setup, and by even more (5.1%) in the 1S1S setup, since background varies across different sequences.

Another personal dataset (iCWT) is shown in Table 4 against the highest baseline ProtoNet. Our approach exhibits robust gains across all setups ranging from 18 to 90 instances.

5. CONCLUSION

In this paper, we introduced a few-shot instance-level personalization of object detectors. We proposed a new method (OBoI) to personalize detection models to recognize user-specific instances of object categories. OBoI is a backpropagation-free metric learning approach on a multi-order statistics feature space. We believe that this setup and our method could pave the way to personal instance-level detection and could stimulate future research and applications.

6. REFERENCES

[1] Ovidiu Vermesan and Joel Bacquet, ¨ Internet of Things– The Call of the Edge: Everything Intelligent Everywhere, CRC Press, 2022.

[2] Junfeng He, Khoi Pham, Nachiappan Valliappan, Pingmei Xu, Chase Roberts, Dmitry Lagun, and Vidhya Navalpakkam, “On-device few-shot personalization for real-time gaze estimation,” in ICCVW, 2019.

[3] Nidhi Arora, Daniel Ensslen, Lars Fiedler, Wei Wei Liu, Kelsey Robinson, Eli Stein, and Gustavo Schuler, “The ¨ value of getting personalization right - or wrong - is multiplying,” McKinsey & Company, pp. 1–12, 2021.

[4] “An Intelligent At-Home Helper – How the Bespoke Jet Bot™ AI+ Takes Care of Your Pet When You’re Away,” https://tinyurl.com/srjs6zcn.

[5] Raffaello Camoriano, Giulia Pasquale, Carlo Ciliberto, Lorenzo Natale, Lorenzo Rosasco, and Giorgio Metta, “Incremental robot learning of new objects with fixed update time,” in ICRA. IEEE, 2017, pp. 3207–3214.

[6] Vincenzo Lomonaco and Davide Maltoni, “CORe50: a new dataset and benchmark for continuous object recognition,” in CoRL, 2017, pp. 17–26.

[7] Yu-Xiong Wang, Liangke Gui, and Martial Hebert, “Few-shot hash learning for image retrieval,” in ICCVW, 2017, pp. 1228–1237.

[8] Deunsol Jung, Dahyun Kang, Suha Kwak, and Minsu Cho, “Few-shot metric learning: Online adaptation of embedding for retrieval,” in ACCV, 2022.

[9] Jiancai Zhu, Jiabao Zhao, Jiayi Zhou, Liang He, Jing Yang, and Zhi Zhang, “Uncertainty-aware few-shot class-incremental learning,” in ICASSP, 2023, pp. 1–5.

[10] Aymane Abdali, Vincent Gripon, Lucas Drumetz, and Bartosz Boguslawski, “Active learning for efficient fewshot classification,” in ICASSP, 2023, pp. 1–5.

[11] “Yolov8 by ultralytics,” https://github.com/ ultralytics/, Accessed: 2023-08-20.

[12] Jake Snell, Kevin Swersky, and Richard Zemel, “Prototypical networks for few-shot learning,” NeurIPS, 2017.

[13] Yan Wang, Wei-Lun Chao, Kilian Q Weinberger, and Laurens Van Der Maaten, “Simpleshot: Revisiting nearest-neighbor classification for few-shot learning,” arXiv:1911.04623, 2019.

[14] Sai Yang, Fan Liu, Delong Chen, and Jun Zhou, “Fewshot classification via ensemble learning with multiorder statistics,” IJCAI, 2023.

[15] Umberto Michieli, Pablo Peso Parada, and Mete Ozay, “Online continual learning in keyword spotting for lowresource devices via pooling high-order temporal statistics,” INTERSPEECH, 2023.

[16] Umberto Michieli and Mete Ozay, “Online continual learning for robust indoor object recognition,” IROS, 2023.

[17] Umberto Michieli and Mete Ozay, “HOP to the Next Tasks and Domains for Continual Learning in NLP,” in AAAI, 2024.

[18] Vignesh Kothapalli, Ebrahim Rasromani, and Vasudev Awatramani, “Neural collapse: A review on modelling principles and generalization,” TMLR, 2023.

[19] Vardan Papyan, XY Han, and David L Donoho, “Prevalence of neural collapse during the terminal phase of deep learning training,” PNAS, vol. 117, no. 40, pp. 24652–24663, 2020.

[20] Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollar, and ´ C Lawrence Zitnick, “Microsoft coco: Common objects in context,” in ECCV. Springer, 2014, pp. 740–755.

[21] Alina Kuznetsova, Hassan Rom, Neil Alldrin, Jasper Uijlings, Ivan Krasin, Jordi Pont-Tuset, Shahab Kamali, Stefan Popov, Matteo Malloci, Alexander Kolesnikov, et al., “The open images dataset v4: Unified image classification, object detection, and visual relationship detection at scale,” IJCV, vol. 128, no. 7, pp. 1956–1981, 2020.

[22] Giulia Pasquale, Carlo Ciliberto, Francesca Odone, Lorenzo Rosasco, and Lorenzo Natale, “Are we done with object recognition? the icub robot’s perspective,” Robotics and Autonomous Systems, vol. 112, pp. 260– 281, 2019.

[23] Athanasios Papoulis and S Unnikrishna Pillai, Probability, random variables and stochastic processes, 2002.

[24] Umberto Michieli and Pietro Zanuttigh, “Continual semantic segmentation via repulsion-attraction of sparse and disentangled latent representations,” in CVPR, 2021, pp. 1114–1124.

[25] Umberto Michieli and Pietro Zanuttigh, “Incremental learning techniques for semantic segmentation,” in ICCVW, 2019.

[26] Timothee Lesort, Vincenzo Lomonaco, Andrei Stoian, ´ Davide Maltoni, David Filliat, and Natalia DıazRodrıguez, “Continual learning for robotics: Definition, framework, learning strategies, opportunities and challenges,” Information Fusion, 2020.

Authors:

(1) Umberto Michieli, Samsung Research UK;

(2) Jijoong Moon, Samsung Research Korea;

(3) Daehyun Kim, Samsung Research Korea;

(4) Mete Ozay, Samsung Research UK.


This paper is available on arxiv under CC by 4.0 Deed (Attribution 4.0 International) license.