Automatic generation of Portuguese summary from audiovisual content in English
: method, validation, and "homemade" application
DOI:
https://doi.org/10.5380/atoz.v14.94711Keywords:
Audiovisual synthesis, English-Portuguese translation, Artificial Intelligence, Machine learning models, Web application, Quantitative evaluationAbstract
Introduction: Audiovisual synthesis is essential to democratize knowledge, facilitate research and learning, enhance user experience, and promote digital inclusion. However, manual summarization is laborious and not scalable. AI automates this process, but the lack of a complete, low-cost, and user-friendly automated solution remains. This work proposes a roadmap to create a "home-made" automated pipeline to generate summaries in Portuguese from videos in English. Method: To achieve the proposed goal, we implemented four algorithms in a pipeline to: (1) extract audio from the video, (2) transcribe it into text, (3) summarize the text in the original language, and (4) translate the summary into Portuguese. The algorithms use machine learning models and are validated with specific metrics for each step: WER, CER, ROUGE, BLEU. Results: The work presents the "Smart Summy," an architecture and integrated solution for automatic generation of Portuguese summaries from videos in English, with cloud execution, no need for installation or understanding of technologies from the user’s part, and a lightweight, simple, and intuitive interface. Quantitative evaluations of pipeline stages using established metrics demonstrate very high transcription quality, good quality of the English summary, and excellent translation power to Portuguese. Conclusions: The "Smart Summy" and its guided usage roadmap demonstrate the ability to fill an existing gap regarding the integration of Artificial Intelligence tools (or models) for automating the productivity of the "average" user.
References
Abdul, Z., & Al-Talabani, A. (2022). Mel Frequency Cepstral Coefficient and its applications: a review. IEEE Access, 10, 122136-122158. 10.1109/ACCESS.2022.3223444
Chen, B. A. (2014). A systematic comparison of smoothing techniques for sentence-level BLEU. Proceedings of the ninth workshop on statistical machine translation, 362-367. 10.3115/v1/W14-3346
El-Kassas, W. S., Salama, C., Rafea, A., & Mohamed, H. K. (2021). Automatic text summarization: A comprehensive survey. Expert systems with applications, 165(4), 1-46. 10.1016/j.eswa.2020.113679
Eser, O. (2022). The quality of translation students’ transcriptions for subtitling in healthcare settings. The Interpreter and Translator Trainer, 16(4), 524-539. 10.1080/1750399X.2022.2082103
Graves, A., Mohamed, A. R., & Hinton, G. (2013). Speech recognition with deep recurrent neural networks. IEEE international conference on acoustics, speech and signal processing. IEEE, 6645-6649. 10.1109/ICASSP.2013.6638947
Graves, A., Mohamed, A. R., & Hinton, G. (2013). Speech recognition with deep recurrent neural networks. IEEE international conference on acoustics, speech and signal processing,pp. 6645-6649. 10.1109/ICASSP.2013.6638947
Gulati, A., Qin, J., Chiu, C. C., Parmar, N., Zhang, Y., Yu, J., Han, W., Wang, S., Zhang, Z., Wu, Y., & Pang, R. (2020). Conformer: convolution-augmented transformer for speech recognition. Interspeech, 5036-5040.
https://doi.org/10.48550/arXiv.2005.08100
Hinton, G., Deng, L., Yu, D., Dahl, G., Mohamed, A. R., Jaitly, N., & Kingsbury, B. (2012). Deep Neural Networks for Acoustic Modeling in Speech Recognition. IEEE Signal Processing Magazine, 29(6). 10.1109/MSP.2012.2205597
Lin, W., Li, S., Zhang, C., Ji, B., Yu, J., Ma, J., & Yi, Z. (2022). SummScore: a comprehensive evaluation metric for summary quality based on cross-encoder. ArXiv preprint, 69-84.
10.48550/arXiv.2207.04660
Mohamed, A., Okhonko, D., & Zettlemoyer, L. (2019). Transformers with convolutional context for asr. ArXiv preprint.
https://doi.org/10.48550/arXiv.1904.11660
Nallapati, R., Zhou, B., Gulcehre, C., Xiang, B., & Pascanu, R. (2016). Abstractive text summarization using sequence-to-sequence RNNs and beyond. The SIGNLL Conference on Computational Natural Language Learning, 280-290.
10.48550/arXiv.1602.06023
Raffel, C., Shazeer, N., Roberts, A., Lee, K., Narang, S., Matena, M., & Liu, P. J. (2020). Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of machine learning research, 21, 1-67.
10.5555/3455716.3455856
Rescigno, A. A., & Monti, J. (2023). Gender Bias in Machine Translation: a statistical evaluation of Google Translate and DeepL for English, Italian and German. International Conference on Human-informed Translation and Interpreting Technology. 10.26615/issn.2683-0078.2023_001
Rescigno, A. A., Vanmassenhove, E., Monti, J., & Way, A. (2020). A case study of natural gender phenomena in translation. A comparison of Google Translate, Bing Microsoft Translator and DeepL for English to Italian, French and Spanish. Computational Linguistics CLiC-it, 257-262. 10.4000/books.aaccademia.8844
Rivera-Trigueros, I. (2022). Rivera-Trigueros, Irene. Machine translation systems and quality assessment: a systematic review. Language Resources and Evaluation, 56(2), 593-619. 10.1007/s10579-021-09537-5
Scribe, H. (2022). (Happy Scribe). Happy scribe: audio transcription & video subtitles. https://www.happyscribe.com/
See, A., Liu, P. J., & Manning, C. D. (2017). Get to the point: Summarization with pointer-generator networks. 55th Annual Meeting of the Association for Computational Linguistics, 1, 1073-1083. 10.48550/arXiv.1704.04368
Shanahan, M. (2024). Talking about large language models. Communications of the ACM, 67(2), 68-79. 10.1145/3624724
Soni, M., & Wade, V. (2023). Comparing Abstractive Summaries Generated by ChatGPT to Real Summaries Through Blinded Reviewers and Text Classification Algorithms. arXiv preprint arXiv:2303.17650. 10.48550/arXiv.2303.17650
Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, L, & Polosukhin, I. (2017). Attention is all you need. Advances in neural information processing systems, 30. 10.48550/arXiv.1706.03762
Wollin-Giering, S., Hoffmann, M., Hofting, J., & Ventzke, C. (2023). Automatic transcription of qualitative interviews. Sociology of Science Discussion Papers. 10.13140/RG.2.2.14480.38404
Wu, Y., Schuster, M., Chen, Z., Le, Q. V., Norouzi, M., Macherey, W., & Klingner, J. (2016). Google's neural machine translation system: Bridging the gap between human and machine translation. arXiv preprint arXiv. 10.48550/arXiv.1609.08144
Yasunaga, M., Kasai, J., Zhang, R., Liu, Y., & Miyao, Y. (2021). Graph-based neural sentence ordering. Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, 1890-1906.
10.48550/arXiv.1912.07225
Yulianto, A., & Supriatnaningsih, R. (2021). Google translate vs. DeepL: a quantitative evaluation of close-language pair translation (french to english). Asian Journal of English Language and Pedagogy, 9(2), 109-127. 10.37134/ajelp.vol9.2.9.2021
Yusuf, B., Gandhe, A., & Sokolov, A. (2022). Usted: Improving asr with a unified speech and text encoder-decoder. IEEE International Conference on Acoustics, Speech and Signal Processing, 8297-8301. 10.48550/arXiv.2202.06045
Zhang, B., Haddow, B., & Sennrich, R. (2022). Revisiting end-to-end speech-to-text translation from scratch. Em PMLR (Ed.), International Conference on Machine Learning, 26193-26205. 10.48550/arXiv.2206.04571
Downloads
Published
How to Cite
Issue
Section
License
Copyright (c) 2026 AtoZ: novas práticas em informação e conhecimento

This work is licensed under a Creative Commons Attribution 4.0 International License.
Atoz is a open access journal and the authors have permission and are encouraged to deposit their papers in personal web pages, institutional repositories or portals before (pre-print) or after (post-print) the publication at AtoZ. It is just asked, when and where possible, the mention, as a bibliographic reference (including the atributted URL), to the AtoZ Journal.
The authors license the AtoZ for the solely purpose of disseminate the published work (peer reviewed version/post-print) in aggregation, curation and indexing systems.
The AtoZ is a Diadorim/IBICT green academic journal.
All the journal content (including instructions, editorial policies and templates) - except where otherwise indicated - is under a Creative Commons Attribution 4.0 International, since October 2020.
When published by this journal, articles are free to share (copy and redistribute the material in any support or format for any purpose, even commercial) and adapt (remix, transform, and create from the material for any purpose , even if commercial). You must give appropriate credit , provide a link to the license, and indicate if changes were made
AtoZ does not apply any charges regarding manuscripts submission/processing and papers publication.
























