Automatic generation of Portuguese summary from audiovisual content in English

: method, validation, and "homemade" application

Authors

DOI:

https://doi.org/10.5380/atoz.v14.94711

Keywords:

Audiovisual synthesis, English-Portuguese translation, Artificial Intelligence, Machine learning models, Web application, Quantitative evaluation

Abstract

Introduction: Audiovisual synthesis is essential to democratize knowledge, facilitate research and learning, enhance user experience, and promote digital inclusion. However, manual summarization is laborious and not scalable. AI automates this process, but the lack of a complete, low-cost, and user-friendly automated solution remains. This work proposes a roadmap to create a "home-made" automated pipeline to generate summaries in Portuguese from videos in English. Method: To achieve the proposed goal, we implemented four algorithms in a pipeline to: (1) extract audio from the video, (2) transcribe it into text, (3) summarize the text in the original language, and (4) translate the summary into Portuguese. The algorithms use machine learning models and are validated with specific metrics for each step: WER, CER, ROUGE, BLEU. Results: The work presents the "Smart Summy," an architecture and integrated solution for automatic generation of Portuguese summaries from videos in English, with cloud execution, no need for installation or understanding of technologies from the user’s part, and a lightweight, simple, and intuitive interface. Quantitative evaluations of pipeline stages using established metrics demonstrate very high transcription quality, good quality of the English summary, and excellent translation power to Portuguese. Conclusions: The "Smart Summy" and its guided usage roadmap demonstrate the ability to fill an existing gap regarding the integration of Artificial Intelligence tools (or models) for automating the productivity of the "average" user.

Author Biographies

Keomas da Silva Monteiro, Universidade Federal de Sergipe, São Cristóvão, Sergipe

Programa de Pós Graduação em Ciência da Computação (Mestrado acadêmico)

Hendrik Teixeira Macedo, Universidade Federal de Sergipe, Aracaju, Sergipe

Departamento de Computação

Leonardo Nogueira Matos, Universidade Federal de Sergipe, São Cristóvão, Sergipe

Departamento de Computação

Kalil Araújo Bispo, Universidade Federal de Sergipe, São Cristóvão, Sergipe

Departamento de Computação

References

Abdul, Z., & Al-Talabani, A. (2022). Mel Frequency Cepstral Coefficient and its applications: a review. IEEE Access, 10, 122136-122158. 10.1109/ACCESS.2022.3223444

Chen, B. A. (2014). A systematic comparison of smoothing techniques for sentence-level BLEU. Proceedings of the ninth workshop on statistical machine translation, 362-367. 10.3115/v1/W14-3346

El-Kassas, W. S., Salama, C., Rafea, A., & Mohamed, H. K. (2021). Automatic text summarization: A comprehensive survey. Expert systems with applications, 165(4), 1-46. 10.1016/j.eswa.2020.113679

Eser, O. (2022). The quality of translation students’ transcriptions for subtitling in healthcare settings. The Interpreter and Translator Trainer, 16(4), 524-539. 10.1080/1750399X.2022.2082103

Graves, A., Mohamed, A. R., & Hinton, G. (2013). Speech recognition with deep recurrent neural networks. IEEE international conference on acoustics, speech and signal processing. IEEE, 6645-6649. 10.1109/ICASSP.2013.6638947

Graves, A., Mohamed, A. R., & Hinton, G. (2013). Speech recognition with deep recurrent neural networks. IEEE international conference on acoustics, speech and signal processing,pp. 6645-6649. 10.1109/ICASSP.2013.6638947

Gulati, A., Qin, J., Chiu, C. C., Parmar, N., Zhang, Y., Yu, J., Han, W., Wang, S., Zhang, Z., Wu, Y., & Pang, R. (2020). Conformer: convolution-augmented transformer for speech recognition. Interspeech, 5036-5040.

https://doi.org/10.48550/arXiv.2005.08100

Hinton, G., Deng, L., Yu, D., Dahl, G., Mohamed, A. R., Jaitly, N., & Kingsbury, B. (2012). Deep Neural Networks for Acoustic Modeling in Speech Recognition. IEEE Signal Processing Magazine, 29(6). 10.1109/MSP.2012.2205597

Lin, W., Li, S., Zhang, C., Ji, B., Yu, J., Ma, J., & Yi, Z. (2022). SummScore: a comprehensive evaluation metric for summary quality based on cross-encoder. ArXiv preprint, 69-84.

10.48550/arXiv.2207.04660

Mohamed, A., Okhonko, D., & Zettlemoyer, L. (2019). Transformers with convolutional context for asr. ArXiv preprint.

https://doi.org/10.48550/arXiv.1904.11660

Nallapati, R., Zhou, B., Gulcehre, C., Xiang, B., & Pascanu, R. (2016). Abstractive text summarization using sequence-to-sequence RNNs and beyond. The SIGNLL Conference on Computational Natural Language Learning, 280-290.

10.48550/arXiv.1602.06023

Raffel, C., Shazeer, N., Roberts, A., Lee, K., Narang, S., Matena, M., & Liu, P. J. (2020). Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of machine learning research, 21, 1-67.

10.5555/3455716.3455856

Rescigno, A. A., & Monti, J. (2023). Gender Bias in Machine Translation: a statistical evaluation of Google Translate and DeepL for English, Italian and German. International Conference on Human-informed Translation and Interpreting Technology. 10.26615/issn.2683-0078.2023_001

Rescigno, A. A., Vanmassenhove, E., Monti, J., & Way, A. (2020). A case study of natural gender phenomena in translation. A comparison of Google Translate, Bing Microsoft Translator and DeepL for English to Italian, French and Spanish. Computational Linguistics CLiC-it, 257-262. 10.4000/books.aaccademia.8844

Rivera-Trigueros, I. (2022). Rivera-Trigueros, Irene. Machine translation systems and quality assessment: a systematic review. Language Resources and Evaluation, 56(2), 593-619. 10.1007/s10579-021-09537-5

Scribe, H. (2022). (Happy Scribe). Happy scribe: audio transcription & video subtitles. https://www.happyscribe.com/

See, A., Liu, P. J., & Manning, C. D. (2017). Get to the point: Summarization with pointer-generator networks. 55th Annual Meeting of the Association for Computational Linguistics, 1, 1073-1083. 10.48550/arXiv.1704.04368

Shanahan, M. (2024). Talking about large language models. Communications of the ACM, 67(2), 68-79. 10.1145/3624724

Soni, M., & Wade, V. (2023). Comparing Abstractive Summaries Generated by ChatGPT to Real Summaries Through Blinded Reviewers and Text Classification Algorithms. arXiv preprint arXiv:2303.17650. 10.48550/arXiv.2303.17650

Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, L, & Polosukhin, I. (2017). Attention is all you need. Advances in neural information processing systems, 30. 10.48550/arXiv.1706.03762

Wollin-Giering, S., Hoffmann, M., Hofting, J., & Ventzke, C. (2023). Automatic transcription of qualitative interviews. Sociology of Science Discussion Papers. 10.13140/RG.2.2.14480.38404

Wu, Y., Schuster, M., Chen, Z., Le, Q. V., Norouzi, M., Macherey, W., & Klingner, J. (2016). Google's neural machine translation system: Bridging the gap between human and machine translation. arXiv preprint arXiv. 10.48550/arXiv.1609.08144

Yasunaga, M., Kasai, J., Zhang, R., Liu, Y., & Miyao, Y. (2021). Graph-based neural sentence ordering. Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, 1890-1906.

10.48550/arXiv.1912.07225

Yulianto, A., & Supriatnaningsih, R. (2021). Google translate vs. DeepL: a quantitative evaluation of close-language pair translation (french to english). Asian Journal of English Language and Pedagogy, 9(2), 109-127. 10.37134/ajelp.vol9.2.9.2021

Yusuf, B., Gandhe, A., & Sokolov, A. (2022). Usted: Improving asr with a unified speech and text encoder-decoder. IEEE International Conference on Acoustics, Speech and Signal Processing, 8297-8301. 10.48550/arXiv.2202.06045

Zhang, B., Haddow, B., & Sennrich, R. (2022). Revisiting end-to-end speech-to-text translation from scratch. Em PMLR (Ed.), International Conference on Machine Learning, 26193-26205. 10.48550/arXiv.2206.04571

Published

2026-04-04

How to Cite

Monteiro, K. da S., Macedo, H. T., Matos, L. N., & Bispo, K. A. (2026). Automatic generation of Portuguese summary from audiovisual content in English: : method, validation, and "homemade" application. AtoZ: Novas práticas Em informação E Conhecimento, 14, 1–15. https://doi.org/10.5380/atoz.v14.94711

Issue

Section

Papers