Scene texts contain rich semantic information which may be used in many vision-based applica-tions. With the rise of deep learning, significant advances in scene text detection and recognitionin natural images have been made. However, in real scenes, shapes may contain severe occlusions,hardening the identification of texts. Moreover, the lack of consistent real-world datasets, richerannotations, and evaluations in the specific occlusion problem make the severe impact threat tothe algorithm’s performance caused by occlusion still an open issue. Therefore, unlike previousworks in this field, our research addresses occlusions in scene text recognition. The goal is toassess how robust and efficient are the existing deep architectures for scene text detection andrecognition facing various occlusion levels. First, we investigated state-of-the-art scene textidentification (detection and recognition), choosing four algorithms for scene text detection andthe other four for scene text recognition. Then, we evaluated these current deep architecturesperformances on ICDAR 2015 dataset without any generated occlusion. Second, we created amethodology to generate large datasets of scene text in natural images with ranges of occlusionbetween 0 and 100%. From this methodology, we produced the ISTD-OC, a dataset derivatedfrom the ICDAR 2015 database that we used to evaluate the chose deep architectures underdifferent levels of occlusion. The results demonstrated that these existing deep architecturesthat have achieved state-of-the-art are still far from understanding text instances in a real-worldscenario. Unlike the human vision systems, which can comprehend occluded instances by con-textual reasoning and association, our extensive experimental evaluations show that current scenetext recognition models are inefficient when high occlusions exist in a scene. Nevertheless, forscene text detection, PSENet has shown robustness for high occlusion levels, presenting 87% ofprecision in text instances with around 70% of occlusion. At higher levels, the model learns onlyto detect the pattern of the occlusion employed instead of the text. Results provided insights onthe capabilities and limitations of the recent proposed deep models facing occlusion, which canreference future studies in complex and diverse scenes.