Terms
Natural Language Processing (NLP)
Computers understanding spoken or written natural language as humans do.
Composed of:
- natural-language understanding
- natural-language generation
Stop words
Unimportant terms e.g. a, and, from
Examples: stop-words/english.txt
Extractive Summarisation vs Abstractive Summarisation
Extractive
Selecting important content from a document and re-combining it into a shorter form.
The importance is based on statistical & linguistic features of sentences and phrases.
Abstractive
Developing an understanding of the main concepts in a document and then re-expressing them using clear natural language.
Based on linguistic methods.
Why use text summarization?
- Reduce “noise”.
- Reduce the time & effort required to understand the content of an article.
- Reduce bias.
Overview
In this article we will use extractive summarization based primarily on statistical analysis (with some heuristics applied).
For this, we will leverage the Catalyst NLP library. This library
- Is a fast, modern pure-C# NLP library, supporting .NET standard 2.0.
- Includes pre-built models.
- Has parity with spaCy, one of the NLP libraries commonly used by data scientists.
- Supports training, custom models, entity-recognition etc.
The Process
1.Data cleansing
Convert the document to plain text.
public static string NormaliseText(string sourceText) { const string delimiter = ". "; sourceText = HtmlConverters.ToText(sourceText, delimiter); // use plain text as a baseline sourceText = Regex.Replace(sourceText, @"(\ |\.)+\.+\ ", delimiter);// ' . ' or '.. ' return sourceText; }
Load the document and process it using the standard Catalyst English model.
English.Register(); Pipeline nlp = Pipeline.For(Language.English); var doc = new Document(sourceText, Language.English); nlp.ProcessSingle(doc);
Identify terms AKA tokens.
Using Catalyst here handles all of the standard delimiters.
var tokens = doc.ToTokenList();
Identify phrases AKA spans.
Once again, Catalyst greatly simplifies this.
var sentences = doc.Spans.ToList();
2.Statistical Analysis
A.Terms Score
Assign a relative importance score to each unique alphanumeric term.
Exclude stop words. These terms add no semantic value to the content and can be ignored in terms of analysis.
Assumption: more important terms will be used more frequently.
Score = #occurences of the term / #occurences of the most frequently occurring term
var stopWords = StopWords.English; var tokenFreq = new Dictionary<string, float>(); // collate raw token values foreach (var token in tokens) { var tokenValue = token.Value.ToLowerInvariant(); if (!stopWords.Contains(tokenValue) && !tokenValue.IsNonAlphaNumeric()) { if (tokenFreq.ContainsKey(tokenValue)) tokenFreq[tokenValue] = tokenFreq[tokenValue] + 1; else tokenFreq.Add(tokenValue, 1); } } if (tokenFreq.Count > 0) { var maxOccurence = tokenFreq.Max(x => x.Value); // update the token score foreach (var word in tokenFreq) { tokenFreq[word.Key] = word.Value / maxOccurence; } }
Future improvement: Lemmatization & stemming would make this significantly more accurate.
B.Phrase Score
De-duplicate the phrases.
If a phrase contains another phrase that is in the document, the shorter phrase is removed (this is useful for excluding sub-headings). Even a simple approach here is sufficient to reduce bias.
for (int i = 0; i < sentences.Count; i++) { if (sentences[i].TokensCount > 3 && sentences.Count(x => x.Value.Contains(sentences[i].Value.Trim('.'))) > 1) { sentences.RemoveAt(i); i--; } }
Score = ∑ term scores terms in the phrase.
var sentenceScore = new Dictionary<string, (int docOrder, float score)>(); var sentenceDocOrder = 0; // we record the order of the sentence foreach (var sentence in sentences) { var spanTokens = sentence.Tokens.Select(x => x.Value.ToLowerInvariant()).ToList(); foreach (var word in tokenFreq) { if (spanTokens.Contains(word.Key)) { if (sentenceScore.ContainsKey(sentence.Value)) { var scoreCard = (sentenceScore[sentence.Value]); scoreCard.score += word.Value; sentenceScore[sentence.Value] = scoreCard; } else { sentenceScore.Add(sentence.Value, (sentenceDocOrder, word.Value)); sentenceDocOrder++; } } } }
3.Summary generation
Extract a %age of the content.
Determine the number of sentences to return (minimum of 1).
If fewer than 6 sentences then:
- Assume individual ranked statements is more useful.
- Return the phrases from most important to least important, each on a separate line.
If more than 5 sentences then:
- Generate a readable paragraph.
- Assume that using the order of appearance in the original document will provide improved readability and context.
- Return only the most highly ranked phrases.
var selection = (int)Math.Round(sentenceScores.Count * (extractionPercentage / 100), 0); var summary = string.Empty; if (selection < 1) selection = 1; // ensure at least one sentence is returned if (selection <= 5) { // assume individual ranked statements is more useful var statementsToInclude = sentenceScores.OrderByDescending(x => x.Value.score) .Take(selection) .Select(x => x.Key.TrimEnd('.')); summary = string.Join(Environment.NewLine, statementsToInclude); } else { // attempt to build a readable paragraph from the most highly ranked statements var toInclude = sentenceScores.OrderByDescending(x => x.Value.score) .Take(selection) .OrderBy(x => x.Value.docOrder) // assume that the order of the content in the original document is from most important to least important .Select(x => x.Key.TrimEnd('.')); // lines don't always end in a full stop. So trim, then join, ensures they all do. summary = string.Concat(string.Join(". ", toInclude), "."); }
So there we have it, a simple way to tokenise, score and rank the phrases in a document.
Comments
Post a Comment