Selection of relevant information to improve image classification using Bag of Visual Words

  1. Fidalgo Fernández, Eduardo
Supervised by:
  1. Enrique Alegre Gutiérrez Director
  2. Víctor González Castro Co-director

Defence university: Universidad de León

Fecha de defensa: 09 December 2015

  1. José María Sebastián Zúñiga Chair
  2. Manuel Castejón Limas Secretary
  3. David Martín Gómez Committee member

Type: Thesis


In this thesis we have proposed several approaches to refine the extraction of features related with the objects of interest present in an image with the final objective of improving image classification when Bag of Visual Words (BoW) is used. One of the most common descriptors used in the BoW framework is SIFT which is very frequent to combine them with other features to improve image classification. It is the case of Edge-SIFT, which extracts SIFT descriptors from an edge image obtained with the compass operator. The resulting edges depend on the radius, which is one parameter of this operator. In this thesis, we evaluate how different radius values of the compass operator used for computing Edge-SIFT descriptors affect the image classification. We demonstrate that the radius recommended in the literature is not the most suitable for most of the situations and that choosing its value for each image increases even more the accuracy in the classification. And finally, we propose a method to estimate a value for the compass radius that yields better classification accuracy than the obtained using the one recommended by the literature. The second main research line in this thesis deals with how to remove - or how to filter - trifling information using different strategies based on masks obtained from a single saliency map. When dense SIFT descriptors are extracted from the whole image they contain information coming from the background that makes the correct classification of the objects of interest more difficult. We present several filtering strategies based on a single saliency map and the separate dictionaries that can be created using foreground or background features, i.e. features extracted from points inside or outside the saliency map, respectively. The presented strategies start removing only image key points based in these foreground vs. background dictionaries. They continue with filtering semantic attention regions using two different methods: one based on intersection of saliency maps regions and the other based on keypoint voting using again foreground and background dictionaries. As we will present in the corresponding chapter, all of them produced very successful results. Our last research line takes the previous one a step forward. It explores how more than one saliency map and several information levels for each of them can be used and combined to improve image classification. A saliency map can be considered as a topographic surface that represents the level of visual attention. The amount of information displayed at different “heights” of this surface does not have the same X relevance for image classification. We demonstrate how the information extracted from different “heights” from a saliency map affects the classification. We will refer to these levels of information as saliency information slices - SIS. After obtaining the global accuracy for each individual information slice using BoW to classify several image datasets, we demonstrated that their combination provides better results than when the features for each slice are used independently. When these SIS are combined, we also found out that increasing the number of slices does not lead us to an accuracy improvement and that combining slices from different saliency maps results in accuracies that are between the ones when they are not combined. We think that all our contributions will enhance feature selection provide to the research community several alternatives to outcome common problems in the initial stages of the image classification process when they are used standalone or combined with other strategies.