"The Robust Reading Competition has moved to its new permanent space at http://rrc.cvc.uab.es. This site will remain available but will not accept any further submissions from January 2015 onwards. Please use the new site at http://rrc.cvc.uab.es for up to date information and to submit new results. You can continue to use your existing user accounts while all associated data have been transferred to the new site. If you encounter any problem, please contact us. Apologies for any inconvenience caused."
Why do I need to register?
Registering is important for us as it gives us an indication about the possible participation to the competition and also a way to contact potential participants in case we need to communicate useful information about the competition (and only about the competition). You will need to be registered in order to get access to the "Downloads" and "Submit Results" sections.
Am I obliged to participate if I register?
No, registration is only meant to be an expression of interest and it will give you access to the "Downloads" and "Submit Results" section.
Do I have to participate in all of the tasks of the Challenge?
No. You can participate in any and as many of the tasks as you wish to.
I noticed there are three Challenges organised under the “ICDAR 2013 Robust Reading Competition”. Do I have to participate in other challenges as well?
No you do not have to. But we would really appreciate it if you did!
We have strived to structure all challenges in the same way, so if you have a system that can be trained and produce results for one of them, then your system should also be easily adapted for the rest! So, additional effort is minimal. Not to mention that you get more chance to win:)
Why have you organised two challenges on static images? The Real Scene and the Born-Digital images seem to be very similar.
There are crucial differences between the two application domains. Real scene images are captured by high-resolution cameras, and might suffer from illumination problems, obtrusions and shadows. On the other hand born-digital images are designed directly on the computer, text is designed in situ and it might suffer from compression or anti-aliasing artefacts, the fonts used are very small and the resolution is 72dpi as these images are designed to be transfered online. There are more differences to list but the main point here is that algorithms that might work well in one domain will not necessarily work well in the other. The idea of hosting two challenges and addressing both domains in parallel is to try to qualify and quantify the simiarities and the differences and establish the state of the art in both domains.
How is Challenge 3 (videos) different from Challenge 2 (static images). Isn't the case of video equivalent to running the text extraction algorithm to all frames one by one?
A key aspect of video text extraction is the ability of the algorithm to track the text box over different frames. We therefore expect solutions that can demonstrate this ability, and our evaluation framework penalises algorithms with faults in the tracking part.
I found a mistake in the ground truth! What can I do?
Please let us know by sending us a note at email@example.com. After the end of the competition the datasets will be archived at the TC10 and TC11 Web sites, and we will correct any mistakes found in the ground truth at that point. We will refrain from publishing updates to the training set during the training period in order not to interfere with the competition process. We really appreciate your help!
Challenges 1 and 2
Your "Text Localisation" ground truth seems to be at the level of words, but my algorithm is made to locate whole text lines! Are you going to penalise my algorithm during evaluation?
We will do our best not to penalise such behaviour. This was actually one of the few issues reported by authors after past Robust Reading competitions. For the evaluation of this task we have implemented the methodology described in C. Wolf and J.M. Jolion, "Object Count / Area Graphs for the Evaluation of Object Detection and Segmentation Algorithms", International Journal of Document Analysis, vol. 8, no. 4, pp. 280-296, 2006. This methodology addresses the problem of one-to-many and many-to-one correspondences of detected areas in a satisfactory way, and algorithms that are not designed to work at the word level should not be penalised.
I see that not every piece of text in the images is ground truthed, is this an error?
We aim to ground truth every bit of text in the images, there are however cases when we consciously do not include certain text in the ground truth description. These are the following.
- Characters that are partially cut (see for example the cut line at the bottom of Figure 1a - this is not included in the ground truth). Cut text usually appears when a large image is split to a collage of many smaller ones; traditionally this practice was used to speed up the download of Web pages but it is not encountered a lot nowadays.
- Text that was not meant to be read but appears in the image accidentally as part of photographic content (see for example the names of the actors on the "The Ugly Truth" DVD in Figure 1b). The text there can only be infered because of the context; it was never meant to be read. On the contrary we do include text which is part of photographic content when it's presence is not accidental in the image (for example the names of the movies in Figure 1b are indeed included in the ground truth).
- Text that we cannot read in general. This can be because of very low resolution for example, but there are other cases as well. See for example the image of Figure 1c, the word "twitter" seems to be used as the background, behind "follow". This is treated as background and is not included in the ground truth.
In any other case, we probably have made a mistake, so please let us know!
Why are there two evaluation protocols ("ICDAR 2013" and "DetEval") for Text Localisation?
The "ICDAR 2013" evaluation protocol for the text localization task is as described in the report of the competition , and is based on the framework described in . The "ICDAR 2013" evaluation protocol is a custom implementation, tightly integrated to the competition Web portal in order to enable the advanced evaluation services offered through the competition Web, and as such it is not making use of the DetEval tool (code offered by the authors of ).
Over time, it has come to our attention that slight differences exist between the ICDAR 2013 evaluation protocol and the results obtained by using DetEval. These are due to a number of heuristics that are not document in the paper . These include the following:
- The DetEval tool implements two pass matching for one-to-one matches, where even if a one-to-one match is found in the beginning (according to the overlapping thresholds set), it is still considered as a possible one-to-many or many-to-one match if it overlaps with more regions. The decision as to what type of match to consider is taken at the end. This heuristic makes intuitive sense and in many cases produces results that are easier to interpret, especially for methods that consistently over- or under-segment. The ICDAR 2013 implementation considers the one-to-one matching rule first (as described in ) and does not consider any alternative interpretations if an one-to-one match is found.
- The DetEval tool looks for many-to-one matches before one-to-many matches. The ICDAR 2013 implementation follows the order described in  and looks for one-to-many matches before many-to-one matches. This actually has minimal impact in the results.
To ensure compatibility and to assist authors who make parallel use of the DetEval framework offline we have implemented an alternative evaluation protocol which is tested to be consistent to the DetEval tool and takes into account all undocumented heuristics. Any method submitted to Task 1 of Challenge 1 or 2, will be automatically evaluated using both evaluation framework, while results and ranking tables can be visualised for either.
Note that the final numerical results produced by either protocol are very similar, while the methods' ranking obtained by either evaluation protocol rarely changes.
1. D. Karatzas, F. Shafait, S. Uchida, M. Iwamura, L. Gomez, S. Robles, J. Mas, D. Fernandez, J. Almazan, L.P. de las Heras , "ICDAR 2013 Robust Reading Competition", In Proc. 12th International Conference of Document Analysis and Recognition, 2013, IEEE CPS, pp. 1115-1124.
2. C. Wolf and J.M. Jolion, "Object Count / Area Graphs for the Evaluation of Object Detection and Segmentation Algorithms", International Journal of Document Analysis, vol. 8, no. 4, pp. 280-296, 2006.
Your ground truth seems to be at the level of words, but my algorithm is made to locate whole text lines or paragraphs of text. Are you going to penalise my algorithm during evaluation?
It is very difficult, if not impossible, to decide what the right level for the ground truth should be in the case of real scenes, be it videos or static images. We have decided to create ground truth at the level of words, because they are the smallest common denominator. We are aware that this makes little sense in many real-life applications, but we also believe that this is a matter to be taken care of during the evaluation, and not during the ground truthing process.
Unfortunately, and unlike Challenges 1 and 2, the current evaluation framework we are using (based on CLEARMOT: K. Bernardin and R. Stiefelhagen. “Evaluating multiple object tracking performance: the CLEAR MOT metrics”, J. Image Video Process., 2008) cannot be easily adapted to take this ground-truth / result semantic level mismatch into account. Therefore, your method will be penalised if results are not given at the level of words. We plan to work on an updated evaluation scheme for the next edition of the competition.
How did you create your ground truth? Did you follow a particular protocol?
You can download the protocol we followed to create the ground truth for Challenge 3 from here. Important aspects to note is that the ground truth is made at the level of words, and that the quality attribute is used to control whether a region is good enough to take into account during the evaluation or not, in which case it is treated as a "don't care" region during the evaluation.