• Abstract Summarization: Abstract summarization
involves
generating a concise summary of the input document in a way that it may
not use the exact words or sentences from the original text. Instead, it
aims to capture the main ideas and concepts in a more abstract and
coherent manner. This approach often requires natural language
generation techniques and can be more flexible in conveying the essence
of the source text.
• Extractive Summarization: Extractive
summarization, on the
other hand, focuses on selecting and extracting the most important
sentences or phrases directly from the source document to form the
summary. These selected segments are typically taken verbatim from the
original text. Extractive summarization is simpler and doesn't involve
generating new sentences but may sometimes result in less coherent
summaries.
In this project, three models were trained and fine-tuned for abstractive summarization, with the Google/mT5-small model currently deployed. The model, trained on 'Xsum' in English and 'Mlsum' in French, excels in generating high-quality abstractive summaries. Notably, it emphasizes multilingual summarization, capable of summarizing content in both English and French. Evaluation metrics, such as Rouge scores, ensure the model consistently delivers precise and coherent summaries.
Technical information:
• Google/mT5-small: ~ 300 Million parameters
• Framework: PyTorch/HuggingFace
• Machine: GPU - NVIDIA RTX 4090
• Epochs: ~10 epochs | 22 hours
• Dataset: "Xsum" English & "MLsum" French
PS: The model can summarize in English and in French.