[简体中文]

Mask TextSpotter: An End-to-End Trainable Neural Network for Spotting Text with Arbitrary Shapes

Pengyuan Lyu Minghui Liao Cong Yao Wenhao Wu Xiang Bai
ECCV [pdf]

 

Abstract

      Recently, models based on deep neural networks have dominated the fields of scene text detection and recognition. In this paper, we investigate the problem of scene text spotting, which aims at simultaneous text detection and recognition in natural images. An end-to-end trainable neural network model for scene text spotting is proposed. The proposed model, named as Mask TextSpotter, is inspired by the newly published work Mask R-CNN. Different from previous methods that also accomplish text spotting with end-to-end trainable deep neural networks, Mask TextSpotter takes advantage of simple and smooth end-to-end learning procedure, in which precise text detection and recognition are acquired via semantic segmentation. Moreover, it is superior to previous methods in handling text instances of irregular shapes, for example, curved text. Experiments on ICDAR2013, ICDAR2015 and Total-Text demonstrate that the proposed method achieves state-of-the-art results in both scene text detection and end-to-end text recognition tasks.

 

Method

Illustration of the architecture of the method
Illustration of the mask branch. Subsequently, there are four convolutional layers, one de-convolutional layer, and a final convolutional layer which predicts maps of 38 channels (1 for global text instance map; 36 for character maps; 1 for background map of characters)
Label generation of mask branch. Left: the blue box is a proposal yielded by RPN, the red polygon and yellow boxes are ground truth polygon and character boxes, the green box is the horizontal rectangle which covers the polygon with minimal area. Right: the global map (top) and the character map (bottom)
Overview of the pixel voting algorithm. Left: the predicted character maps; right: for each connected regions, we calculate the scores for each character by averaging the probability values in the corresponding region

 

Results 

Visualization results of ICDAR 2013 (the left), ICDAR 2015 (the middle) and Total-Text (the right)
Results on ICDAR2013. “S”, “W” and “G” mean recognition with strong, weak and generic lexicon respectively
Results on ICDAR2015. “S”, “W” and “G” mean recognition with strong, weak and generic lexicon respectively
The detection results on ICDAR2013 and ICDAR2015. For ICDAR2013, all methods are evaluated under the “DetEval evaluation protocol. The short sides of the input image in “Ours (det only)” and “Ours” are set to 1000
Results on Total-Text. “None” means recognition without any lexicon. “Full” lexicon contains all words in test set
Qualitative comparisons on Total-Text without lexicon. Top: results of TextBoxes [1]; Bottom: results of ours
Ablation experimental results. “Ours (a)”: without character annotations from the real images; “Ours (b)”: without weighted edit distance

BibTeX

@article{lyu2018mask,
  title={Mask TextSpotter: An End-to-End Trainable Neural Network for Spotting Text with Arbitrary Shapes},
  author={Lyu, Pengyuan and Liao, Minghui and Yao, Cong and Wu, Wenhao and Bai, Xiang},
  journal={arXiv preprint arXiv:1807.02242},
  year={2018}
}

Join the Discussion