Applications of scene text spotting to the darknet and industry 4.0

Blanco Medina, Pablo

Applications of scene text spotting to the darknet and industry 4.0

Blanco Medina, Pablo

Supervised by:

Enrique Alegre Gutiérrez Director
Eduardo Fidalgo Fernández Director

Defence university: Universidad de León

Fecha de defensa: 11 December 2023

Committee:

David Martín Gómez Chair
Laura Fernández Robles Secretary
Noelia Vállez Enano Committee member

Department:

DEP. INGENIERÍA ELÉCTRICA Y SIST Y AUTOM

Type: Thesis

DIALNET BULERIA editor

Abstract

In this thesis,wework on the task of Text Spotting, within the field of Computer Vision. In this manuscript, we propose new algorithms, methods, and datasets that can be used to detect, recognize, and enhance text character sequences found within images, based on the need for information retrieval on systems that cannot crawl or access such information by any other means that is not a graphical representation. Motivated by our work alongside the Spanish National Cybersecurity Institute (INCIBE), we focus our research on recovering character sequences found within visualmedia of both darknet and industrial sources. We intend to support INCIBE products and services related to cybersecurity that may monitor potential illegal activities and critical infrastructures. To improve scene text recognition performance, we analyze images in terms of their irregularity, because some methods often claim to be robust on irregular datasets that contain a large amount of irregular text. After building a classification model for these categories, we created a new dataset, the Fully Irregular Text (FIT-Text) dataset, composed primarily of irregular images, with the intention that other methods, oriented to this problem, can use it to evaluate their performance. We propose a new performance metric, the Contained-Levenshtein (C-Lev) accuracy. Literature scene text recognizers have traditionally reported both the accuracy and the normalized edit distance on datasets as a performance metric, but never combined the two into a singular, effective metric that can help discern between severe and low priority mistakes. C-Lev also serves as a label-checking tool, helping methods stay robust against minor human-generated labeling errors. To increase scene text accuracy, we propose the integration of string-distance measurements as components of the loss functions in both CTC and Attention recognizers. Testing various distances as the proposed weight, we consider the Hamming distance the most beneficial, with a total improvement of over 6% accuracy using literature datasets. For scene text detectors, we propose a new metric that assigns value to scene text images according to their documented regions, the Text Density Distribution (TDD), which classifies visual media according to the spatial distribution of region clusters. We also propose using this metric to train scene text detectors, whilemonitoring their computational cost and performance balance. We note that the detection F1 score only drops 4% when using less than 30% of the training dataset, reducing the computational cost below half of the original approaches and noting how scene text detectors can performjust as well with reduced data. In our last contribution, we implement morphological operation layers in scene text systems to make both discarded regions more visible for any method and to reduce the amount of text-like false negatives. Since such operations can negatively impact the recognition stage of end-to-end systems, we combine these techniques with our previous recognition contributions, improving performance in end-to-end systems up to 1.5%with opening operations and smaller kernels. We also assist INCIBE in the classification of industrial screenshots as belonging to preestablished types, before post-processing techniques can be applied to further decisionmaking processes. Our proposals focus on computer vision, machine learning, data analysis, and data mining techniques, resulting in the creation of four datasets; TOICO-1K, related to Tor darknet, CRINF-300 and CRINF-Text for images related to the field of Industry 4.0, and FIT-Text for global scene task purposes focused on irregular-only texts. Using TOICO-1K, we evaluate the performance of scene text detectors, recognizers, end-to-end systems and Optical Character Recognition (OCR) systems on Tor images. We highlight the areas where each approach can be best utilized and the images and contexts they struggle most, proposing enhancements such as rectification, superresolution, and string-matching technique approaches. Our CRINF-300 and CRINF-Text provide a context for image classification of industrial screenshot systems and the application of end-to-end scene text within logging systems, using fine-tuning and transfer learning to create robust classifiers.