Development of a Methodology for the Collection and Identification of Administrative Acts of Interest in the Official Gazettes of Jurisdictions Overseen by the Rio de Janeiro State Court of Auditors (TCE-RJ)
DOI:
https://doi.org/10.70690/m6zcsw72Keywords:
Official Gazettes, Natural Language Processing (NLP), Random Forest, Large Language Models (LLMs), Public AuditAbstract
The project "Development of a Methodology for the Collection and Identification of Administrative Acts of Interest in the Official Gazettes of Jurisdictions Overseen by the Rio de Janeiro State Court of Auditors (TCE-RJ)" addresses a critical challenge in the digital era: transforming the vast volume of unstructured data in Official Gazettes into accessible and actionable information. Using advanced techniques in data mining, machine learning, and natural language processing (NLP), the research aims to enhance control and oversight in the public sector. Official Gazettes, essential for administrative transparency, often present information in various formats (PDF, HTML, etc.), complicating access and analysis. This project proposed and tested an innovative methodology based on the CRISP-DM model, structuring the process from data collection to the classification of administrative acts such as appointments and dismissals. Two approaches were explored: the use of Random Forest for segmented and structured data, and a Large Language Model (LLM), such as Gemini, to analyze more complex contexts and full texts. The results revealed a high level of accuracy in both approaches. While Random Forest excelled in efficiency with structured data, the LLM demonstrated flexibility in handling varied texts, maintaining precision even in ambiguous cases. Additionally, the research demonstrated the feasibility of leveraging emerging technologies, such as large language models, to automate repetitive tasks, thereby facilitating the work of TCE-RJ auditors. The study not only confirmed the technical feasibility of automation in the public sector but also provided practical insights for future applications, such as analyzing contracts, tenders, and agreements. This methodology promises to not only modernize oversight but also to promote greater transparency and efficiency in administrative control, cementing the role of artificial intelligence as a strategic tool in public management.
References
ARAÚJO, Pedro H. Luz de; CAMPOS, Teófilo E. de; SOUSA, Marcelo M. S. de; Inferring the source of official texts: can SVM beat ULMFiT? In: INTERNATIONAL CONFERENCE ON THE COMPUTATIONAL PROCESSING OF PORTUGUESE (PROPOR), 2020, Évora, Portugal. Proceedings [...] Évora, Portugal: Springer, 2-4 mar. 2020. p. 76-86. DOI: https://doi.org/10.1007/978-3-030-41505-1_8
BERRAZEGA, Ines et al. A knowledge-based approach for provisions’ categorization in Arabic normative texts. In: SILHAVY, R. et al. Artificial Intelligence Perspectives in Intelligent Systems. Cham: Springer, 2016. v. 464, p. 415-425. Disponível em: https://doi.org/10.1007/978-3-319-33625-1_37. DOI: https://doi.org/10.1007/978-3-319-33625-1_37
BERRAZEGA, Ines et al. A semantic annotation model for Arabic legal texts. In: HELLENIC CONFERENCE ON ARTIFICIAL INTELLIGENCE (SETN), 9., 2016, Thessaloniki, Greece. Proceedings [...]. New York: ACM, 2016. Session: AI Applications, p. 1-8. Disponível em: https://doi.org/10.1145/2903220.2903244. DOI: https://doi.org/10.1145/2903220.2903244
BERRAZEGA, Ines et al. A linguistic method for Arabic normative provisions’ annotation based on contextual exploration. In: INTERNATIONAL CONFERENCE ON INFORMATION AND COMMUNICATION SYSTEMS (ICICS), 7., 2016, Irbid, Jordan. Proceedings [...]. New York: IEEE, 5-7 apr. 2016. p. 347–352. DOI: https://doi.org/10.1109/IACS.2016.7476076
BRANDÃO, Stainam et al. Knowledge representation of Brazilian official gazettes for chronological recovery of laws. In: CONFERENCE ON INFORMATION SYSTEMS, 2011, Rio de Janeiro. Proceedings [...] Rio de Janeiro: IADIS, 5-8 nov, 2011. p. 540–544.
BREIMAN, Leo; FRIEDMAN, Jerome; OLSHEN, R. A.; STONE, Charles J. Classification and regression trees. 1st. ed. Boca Raton: Chapman and Hall/CRC, 1984. Disponível em: https://doi.org/10.1201/9781315139470. Acesso em: 22 dez. 2025. DOI: https://doi.org/10.1201/9781315139470
BREIMAN, L. Random forests. Machine Learning, [s. l.], v. 45, p. 5–32, 2001. Disponível em: https://doi.org/10.1023/A:1010933404324. Acesso em: 22 dez. 2025. DOI: https://doi.org/10.1023/A:1010933404324
CONSTANTINO, Kattiana et al. Segmentação e classificação semântica de trechos de diários oficiais usando aprendizado ativo. In: SIMPÓSIO BRASILEIRO DE BANCOS DE DADOS (SBBD), 37., 2022, Búzios. Anais [...]. Porto Alegre: SBC, 19-23 set. 2022. p. 304–316. Disponível em: https://sol.sbc.org.br/index.php/sbbd/article/view/21815. Acesso em: 23 dez. 2025. DOI: https://doi.org/10.5753/sbbd.2022.224656
CONSTANTINO, Kattiana et al. Using active learning for segmentation and semantic classification of legal acts extracted from official diaries. Journal of Information and Data Management, Porto Alegre, v. 14, n. 1, 2023. Disponível em: https://doi.org/10.5753/jidm.2023.3181. Acesso em: 23 dez. 2025. DOI: https://doi.org/10.5753/jidm.2023.3181
DEVLIN, J.; CHANG, M.; LEE, K.; TOUTANOVA, K. BERT: pre-training of deep bidirectional transformers for language understanding. In: CONFERENCE OF THE NORTH AMERICAN CHAPTER OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS: HUMAN LANGUAGE TECHNOLOGIES, 2019, Minneapolis. Proceedings [...]. Minneapolis: Association for Computational Linguistics, 2019. p. 4171–4186.
GE, Yingqiang et al. OpenAGI: when LLM meets domain experts. Advances in Neural Information Processing Systems, v. 36, 2024. Disponível em: https://proceedings.neurips.cc/paper_files/paper/2023/file/1190733f217404edc8a7f4e15a57f301-Paper-Datasets_and_Benchmarks.pdf. Acesso em: 23 dez. 2025.
GREFENSTETTE, G. Tokenization. In: VAN HALTEREN, H. (ed.). Syntactic wordclass tagging. Dordrecht: Springer, 1999. p. 117–133. Disponível em: https://doi.org/10.1007/978-94-015-9273-4_9. DOI: https://doi.org/10.1007/978-94-015-9273-4_9
GUIMARÃES, Gabriel M. C. et al. DODFMiner: an automated tool for named entity recognition from official gazettes. Neurocomputing, London, v. 568, p. 1–10, feb. 2024. Disponível em: https://doi.org/10.1016/j.neucom.2023.127064. DOI: https://doi.org/10.1016/j.neucom.2023.127064
JI, Ziwei et al. Towards mitigating LLM hallucination via self reflection. In: FINDINGS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (EMNLP 2023), 2023, Singapure. Proceedings [...]. Kerrville, TX: Association for Computational Linguistics, 6-10 dec. 2023. p. 1827-1843. Disponível em: https://aclanthology.org/2023.findings-emnlp.123.pdf. Acesso em: 23 dez. 2025.
KAVLAKOGLU, Eda. O que é random forest?. Tradução de What is random forest?. New York: São Paulo: IBM Research, 25 jul. 2024. Disponível em: https://www.ibm.com/br-pt/topics/random-forest. Acesso em: 25 jul. 2024.
NEVES JUNIOR, R. B. das; MELO, W. F. D. M.; FAGUNDES, R. A. D. A.; MACIEL, A. M. A. Extração de informação e mineração de dados no diário oficial de Pernambuco. REPE: Revista de Engenharia e Pesquisa Aplicada, Pernambuco, v. 3, n. 3, p. 107-113, 2018. Disponível em: http://revistas.poli.br/index.php/repa/article/view/892/449. Acesso em: 5 dez. 2025. DOI: https://doi.org/10.25286/repa.v3i3.892
PINTO, Fernando A. D. G.; LIFSCHITZ, Sérgio; HAEUSLER, Edward H. A knowledge base of public acts based on the grammar of the official gazette. In: INTERNATIONAL CONFERENCE ON DIGITAL GOVERNMENT TECHNOLOGY AND INNOVATION (DGTi-CON), 2022. Proceedings [...]. Bangkok, Thailand: IEEE, 24-25 mar. 2022. p. 24–29. Disponível em: https://doi.org/10.1109/DGTi-CON53875.2022.9849196. Acesso em: 22 dez. 2025. DOI: https://doi.org/10.1109/DGTi-CON53875.2022.9849196
PINTO, Fernando A. D. G.; HAEUSLER, Edward H.; LIFSCHITZ, Sérgio. Transparência pública automatizada a partir da gramática do diário oficial. In: WORKSHOP DE COMPUTAÇÃO APLICADA EM GOVERNO ELETRÔNICO (WCGE 2021), 9., 2021. Anais eletrônicos [...]. Disponível em: https://sol.sbc.org.br/index.php/wcge/article/view/15977/15818. Acesso em: 5 dez. 2025. DOI: https://doi.org/10.5753/wcge.2021.15977
ROCHA, João Paulo L. Inteligência de fontes abertas: um estudo de caso sobre descoberta de conhecimento no diário oficial da união. 2011. Dissertação (Mestrado em Informática) – Universidade Católica de Brasília, Brasília. Disponível em: https://bdtd.ucb.br:8443/jspui/handle/123456789/1336. Acesso em: 5 dez. 2025.
RODRÍGUEZ, Marcia M.; BEZERRA, Byron L. D. Processamento de linguagem natural para reconhecimento de entidades nomeadas em textos jurídicos de atos administrativos (portarias). REPE: Revista de Engenharia e Pesquisa Aplicada, Pernambuco, v. 5, n. 1, p. 67-77, 2020. Disponível em: http://revistas.poli.br/index.php/repa/article/view/1204. Acesso em: 5 dez. 2025. DOI: https://doi.org/10.25286/repa.v5i1.1204
OPEN KNOWLEDGE BRASIL. Querido Diário. [S. l.]: OKBR, 2024. Disponível em: https://queridodiario.ok.org.br/sobre. Acesso em: 5 dez. 2025.
PEDREGOSA, Fabian et al. Scikit-learn: machine learning in Python. Journal of Machine Learning Research, [s. l.], v. 12, n. 8, p. 2825–2830, 2011. Disponível em: https://jmlr.org/papers/volume12/pedregosa11a/pedregosa11a.pdf. Acesso em: 23 dez. 2025.
OGAWA, Yasuhiro et al. Extraction of legal bilingual phrases from the Japanese official gazette, English edition. In: INTERNATIONAL CONFERENCE ON KNOWLEDGE AND SYSTEMS ENGINEERING (KSE), 8., 2016, Hanoi. Proceedings [...]. New York: IEEE, 6-8 oct. 2016. p. 258–263. DOI: https://doi.org/10.1109/KSE.2016.7758063
SANH, Victor; DEBUT, Lysandre; CHAUMOND, Julien; WOLF, Thomas. DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter. In: CONFERENCE ON NEURAL INFORMATION PROCESSING SYSTEMS, 33., 2019, Vancouver, Canada. Proceedings [...]. Vancouver, Canada: EMC2, 9-13 dec. 2019. Disponível em: https://www.emc2-ai.org/assets/docs/neurips-19/emc2-neurips19-paper-33.pdf. Acesso em: 23 dez. 2025.
SHEARER, C. The CRISP-DM model: the new blueprint for data mining. Journal of Data Warehousing, [s. l.], v. 5, n. 4, p. 13-22, 2000.
XAVIER, Bruno D.; SILVA, Alcione Dias da; GOMES, Georgia R. G. Uma arquitetura híbrida para a indexação de documentos do diário oficial do município de Cachoeiro de Itapemirim. Transinformação, Campinas, v. 27, n. 1, p. 83-95, jan./abr. 2015. Disponível em: https://periodicos.puc-campinas.edu.br/transinfo/article/view/6056. Acesso em: 5 dez. 2025. DOI: https://doi.org/10.1590/0103-37862015000100008
VASWANI, Aahish et al. Attention is all you need. In: CONFERENCE ON NEURAL INFORMATION PROCESSING SYSTEMS (NIPS 2017), 31., 2017, Long Beach, CA. Proceedings [...]. San Diego, CA: NeurIPS, 4-9 dec. 2017. Disponível em: https://arxiv.org/pdf/1706.03762. Acesso em: 23 dez. 2025.
Downloads
Published
Issue
Section
License
Copyright (c) 2025 Síntese

This work is licensed under a Creative Commons Attribution 4.0 International License.

