DEFESA DE DISSERTAÇÃO DE MESTRADO Nº 316

Aluno: João Soares da Silva Macedo

Título: “The Impact of Optical Character Recognition System Changes on Named Entity Recognition Models"

Orientador: Byron Leite Dantas Bezerra

Coorientador: Cleber Zanchettin (UFPE)

Examinador Externo: Flavio Arthur Oliveira Santos (UFPE)

Examinador Interno: Wellington Pinheiro dos Santos

Data-hora: 25 de fevereiro de 2025, às 14h.
Local: Formato híbrido - Miniauditório do PPGEC e Google Meet.


Resumo:

         "In the current data-driven world, data has become extremely valuable, so it is important to gather as much high-quality structured data. Data extraction is very important in this scenario, as it is able to extract valuable information from unstructured documents. The state-of-the-art way to implement data extraction pipelines is through an Optical Character Recognition (OCR) system followed by a layout-aware Named Entity Recognition (NER) model. Although this pipeline performs really well, it has a drawback in its architecture, as you have to utilize two different models. In this work, we analyzed their relation and if the NER model is dependent on the OCR system. To evaluate whether changing the OCR system does impact the NER model, we trained the NER models using four different OCRs (Ground Truth, PaddleOCR, Easy OCR, and Azure OCR) and then compared the original f1 score with the f1 score of the model on the other OCRs. With this test, we proved that there is a significant drop in performance. Afterwards, we proposed two techniques that successfully mitigated the impact of OCR changes, mixed OCR and data augmentation. In addition, we propose one novel dataset for data extraction. This is a Brazilian ID dataset, which is different from most current NER datasets, as it is in Portuguese, and tackles a new class of documents."

Defesa 316
Go to top Menú