GSoC Chronicles — commit the CRNN cometh the Text

Pulkit Mishra
9 min readAug 18, 2020

This week the work mainly focussed on creating REST APIs for Text Extraction from images and videos and this blog aims to document how exactly this has been implemented, all the things learned in the process, and what can be improved in the future.


Scene text is the text that appears in an image captured by a camera in an outdoor environment. The problem of extracting text from “in the wild” photos in an outdoor environment is called Scene Text Extraction. As opposed to text in documents, text in natural scenes exhibits much higher diversity and variability in terms of languages, colors, fonts, sizes, orientations, shapes, aspect ratios, and layouts. Also, the backgrounds of natural scenes are virtually unpredictable with the presence of patterns that are extremely similar to text and may contain foreign objects that occlude text, which potentially leads to confusion and mistakes. Add to the mix the poor imaging conditions like text instances may be of low resolution and suffer severe distortion due to inappropriate shooting distance or angle, or blurred because of out of focus or shaking, or noised on account of low light level, or corrupted by highlights or shadows. All of this makes Scene Text Extraction an extremely hard problem to solve.

As highlighted in the survey paper, there are two ways of approaching this problem.

Firstly and in fact quite naturally it can be broken down into two subproblems — Scene Text Detection and Scene Text Recognition where the idea is to localize the text by getting the bounding boxes of all regions that contain text in an image (or a video frame) which is called Scene Text Detection and then crop out these localized texts which can be in different fonts and recognize them which is called Scene Text Recognition. A top-level working of the entire flow is explained in the diagram below.

Workflow of Scene Text Extraction as implemented in PMR

Poor Man’s Rekognition in the current state and form follows this two-stage approach. It is worth noting that the Scene Text Extraction from videos also takes place utilizing the same model where the video is broken down into frames and the model inference is found out for each of these frames. So basically, the above flow runs for each of the frames to produce the output in case of a video. We’ll delve deeper into Scene Text Detection and Scene Text Recognition and see exactly how the neural nets for these two tasks have been trained and deployed.

However, before we do that, the second approach which is an end to end approach must also be talked about as it overcomes a major drawback of the two-stage approach by preventing the propagation of error between the detection and recognition models. Even in the end to end approach, the large majority of the work has taken place with a two-stage pipeline where cropped feature maps are fed to recognition modules instead of images. The first and truly most significant work that has been done on a one-stage end-to-end pipeline has been done by Xing et al. (2019) who predict character and text bounding boxes as well as character type segmentation maps in parallel. The text bounding boxes are then used to group character boxes to form the final word transcription results. As shown in Papers With Code this approach outperforms almost all other approaches in prominent benchmarks like ICDAR and COCO-Text. However, the code for the same hasn’t been released yet and the CharNet model is licensed as CC-BY-NC 4.0 which is not GPL compliant (license that PMR follows). Thus it only seemed prudent to stick with the two-stage approach as described above for now.

Scene Text Detection

It is safe to say that most of the work on scene text detection is based out of general object detection. Algorithms are designed by modifying the region proposal and bounding box regression modules of general detectors to localize text instances directly. They mainly consist of stacked convolutional layers that encode the input images into feature maps with each spatial location at the feature map corresponding to a region of the input image. The feature maps are then fed into a classifier to predict the existence and localization of text instances at each such spatial location.

Among one-stage detectors, we have SSD that inspires TextBoxes by Liao et al., 2017 which tries to fit the varying orientations and aspect-ratios of text by defining default boxes as quadrilaterals with different aspect-ratio specs. EAST by Zhou et al., 2017 simplifies the anchor-based detection by adopting U-Net to integrate features from different levels. Input images are encoded as one multichannel feature map instead of multiple layers of different spatial sizes as in SSD. The feature at each spatial location is then used to regress the rectangular or quadrilateral bounding box of the underlying text instances directly. With its highly simplified pipeline, EAST is able to perform inference at real-time speed. Other methods adapt the two-staged object detection framework of R-CNN, where the second stage corrects the localization results based on features obtained by Region of Interest (ROI) pooling which improves the performance manifolds but the efficiency takes a hit.

However, the detection of scene text has a different set of characteristics and challenges that require unique methodologies and solutions. Thus, the latest methods also rely on special representation based on sub-text components to solve the challenges of long text and irregular text. In pixel-level methods, such as PixelLink a dense prediction map is generated by indicating whether each pixel in the original image belongs to any text instances or not. Basically, they can be seen as a special case of instance segmentation. Connectionist Text Proposal Network is the torchbearer for component-level methods. CTPN models inherit the idea of anchoring and recurrent neural network for sequence labeling. They stack an RNN on top of CNNs. Each position in the final feature map represents features in the region specified by the corresponding anchor. Assuming that text appears horizontally, each row of features are fed into an RNN and labeled as text/non-text. Overall, detection based on sub-text components enjoys better flexibility and generalization ability over shapes and aspect ratios of text instance. The main drawback is that the module or post-processing step used to group segments into text instances may be vulnerable to noise, and the efficiency of this step is highly dependent on the actual implementation, and therefore may vary among different platforms.

Keeping all of this in mind, TextBoxes++: A Single-Shot Oriented Scene Text Detector has been implemented in PMR for scene text detection. It is heavily inspired by the Tensorflow implementation of the code that has been released by the authors. Textboxes++ was chosen because of its ability to detect arbitrary-oriented scene text with both high accuracy and efficiency in a single network forward pass. There is no post-processing required other than an efficient non-maximum suppression. Texboxes++ replaces the rectangular box representation in a conventional object detector by a quadrilateral or oriented rectangle representation. To achieve better receptive field that covers text regions which are usually long, convolutional kernels are used. Thus, TextBoxes++ directly outputs word bounding boxes at multiple layers by jointly predicting text presence and coordinate offsets to anchor boxes. The final outputs are produced after applying non-maximum suppression on all boxes. It builds upon the earlier mentioned TextBoxes with the ability to handle arbitrary-oriented text.

Architecture of TextBoxes++

The architecture of TextBoxes++ as shown above is essentially a fully convolutional network and inherits from the popular VGG-16 architecture, keeping the 13 layers from conv1_1 through conv5_3, and converting the last two fully-connected layers of VGG-16 into convolutional layers by parameters down-sampling. It is followed by eight other convolutional layers divided into four stages (conv8 to conv11) with different resolutions by max-pooling and are appended after conv7, and 6. After that, we have 6 Text-box layers connected to 6 intermediate convolutional layers. Each location of a text-box layer predicts an n-dimensional vector for each default box consisting of the text presence scores (2 dimensions), horizontal bounding rectangles offsets (4 dimensions), and rotated rectangle bounding box offsets (5 dimensions) or quadrilateral bounding box offsets (8 dimensions). A non-maximum suppression is applied during test phase to merge the results of all 6 text-box layers. Note that “#c” stands for the number of channels.

Most deep learning models are data-thirsty. Their performance is guaranteed only when enough data are available. In the field of text detection and recognition, this problem is more urgent since most human-labeled datasets are small, usually containing around merely 1K − 2K data instances. Thus, the model has been trained on Synth90k dataset containing 8 million training images and their corresponding ground truth words.

Scene Text Recognition

As opposed to text detection, in case of recognition, the approaches largely revolve around convolutional recurrent neural networks. CNNs are used to encode images into feature spaces and the decoding module can either employ Connectionist Temporal Classification or the encoder-decoder framework.

The CTC decoding module is adopted from speech recognition, and to apply it in scene text recognition, the input images are viewed as a sequence of vertical pixel frames. The network outputs a per-frame prediction, indicating the probability distribution of label types for each frame. The CTC rule is then applied to edit the per-frame prediction to a text string. During training, the loss is computed as the sum of the negative log probability of all possible per-frame predictions that can generate the target sequence by CTC rules. CRNNs are composed by stacking RNNs on top of CNNs and use CTC for training and inference. It is worth highlighting that both CTC and encoder-decoder frameworks are originally designed for 1-dimensional sequential input data, and therefore are applicable to the recognition of straight and horizontal text, which can be encoded into a sequence of feature frames by CNNs without losing important information. However, characters in oriented and curved text are distributed over a 2-dimensional space. It remains an open challenge to effectively represent oriented and curved text in feature spaces in order to fit the CTC and encoder-decoder frameworks, whose decodes require 1-dimensional inputs. For oriented and curved text, directly compressing the features into a 1-dimensional form may lose relevant information and bring in noise from background, thus leading to inferior recognition accuracy.

CRNN has been used in TextBoxes++ and thus the same has been implemented in PMR for scene text recognition. It is heavily inspired by this Tensorflow implementation of CRNN. It can naturally handle sequences in arbitrary lengths and is not confined to any predefined lexicon. The reasoning behind using CRNNs for scene text recognition is that it tends to occur in the form of sequence, not in isolation. Unlike general object recognition, recognizing such sequence-like objects often requires the system to predict a series of object labels, instead of a single label. Therefore, recognition of such objects can be naturally cast as a sequence recognition problem. Another unique property of sequence-like objects is that their lengths may vary drastically. Consequently, the most popular deep models like DCNN cannot be directly applied to sequence prediction, since DCNN models often operate on inputs and outputs with fixed dimensions, and thus are incapable of producing a variable-length label sequence. Thus, Convolutional Recurrent Neural Networks is used since it is a combination of DCNN and RNN and it can directly learn from sequence labels (like words) and requires no detailed annotations (like characters). Not only does it have the property of DCNN of learning informative representations directly from image data, requiring neither hand-craft features nor preprocessing steps like binarization/segmentation, component localization, etc. it also has properties of RNNs, being able to produce a sequence of labels and is thus unconstrained to the lengths of sequence-like objects.

CRNN network architecture

The architecture consists of convolutional layers, which extract a feature sequence from the input image, recurrent layers, which predict a label distribution for each frame and a transcription layer, which translates the per-frame predictions into the final label sequence. This model has also been trained on the Synth90k dataset just like the detection model.

A postprocessing layer has been added that uses wordninja to probabilistically split concatenated words using NLP.

API Creation

The work is summarised in the following PRs :

  1. Scene Text Detection to localize text in images (PR #93)
  2. Scene Text Recognition to read the localized text (PR #92)
  3. Django REST API for text detection and recognition (PR #98)
  4. Text Extraction from Videos (PR #103)