Simple Extractive Text Summarization using Natural Language Processing with .NET

Terms

Natural Language Processing (NLP)

Computers understanding spoken or written natural language as humans do.

Composed of:

natural-language understanding
natural-language generation

Stop words

Unimportant terms e.g. a, and, from

Examples: stop-words/english.txt

Extractive Summarisation vs Abstractive Summarisation

Extractive

Selecting important content from a document and re-combining it into a shorter form.

The importance is based on statistical & linguistic features of sentences and phrases.

Abstractive

Developing an understanding of the main concepts in a document and then re-expressing them using clear natural language.

Based on linguistic methods.

Why use text summarization?

Reduce “noise”.
Reduce the time & effort required to understand the content of an article.
Reduce bias.

Overview

In this article we will use extractive summarization based primarily on statistical analysis (with some heuristics applied).

For this, we will leverage the Catalyst NLP library. This library

Is a fast, modern pure-C# NLP library, supporting .NET standard 2.0.
Includes pre-built models.
Has parity with spaCy, one of the NLP libraries commonly used by data scientists.
Supports training, custom models, entity-recognition etc.

The Process

1.Data cleansing

Convert the document to plain text.

  public static string NormaliseText(string sourceText)
  {
      const string delimiter = ". ";
      sourceText = HtmlConverters.ToText(sourceText, delimiter); // use plain text as a baseline
      sourceText = Regex.Replace(sourceText, @"(\ |\.)+\.+\ ", delimiter);// ' . ' or '.. '            
      return sourceText;
  }

Load the document and process it using the standard Catalyst English model.

English.Register();
Pipeline nlp = Pipeline.For(Language.English);
var doc = new Document(sourceText, Language.English);
nlp.ProcessSingle(doc);

Identify terms AKA tokens.

Using Catalyst here handles all of the standard delimiters.

var tokens = doc.ToTokenList();

Identify phrases AKA spans.

Once again, Catalyst greatly simplifies this.

var sentences = doc.Spans.ToList();

2.Statistical Analysis

A.Terms Score

Assign a relative importance score to each unique alphanumeric term.

Exclude stop words. These terms add no semantic value to the content and can be ignored in terms of analysis.

Assumption: more important terms will be used more frequently.

Score = #occurences of the term / #occurences of the most frequently occurring term

var stopWords = StopWords.English;
var tokenFreq = new Dictionary<string, float>();

// collate raw token values
foreach (var token in tokens)
{
    var tokenValue = token.Value.ToLowerInvariant();
    if (!stopWords.Contains(tokenValue) && !tokenValue.IsNonAlphaNumeric())
    {
        if (tokenFreq.ContainsKey(tokenValue))
            tokenFreq[tokenValue] = tokenFreq[tokenValue] + 1;
        else
            tokenFreq.Add(tokenValue, 1);
    }
}

if (tokenFreq.Count > 0)
{
  var maxOccurence = tokenFreq.Max(x => x.Value); 
  
  // update the token score
  foreach (var word in tokenFreq)
  {
      tokenFreq[word.Key] = word.Value / maxOccurence;
  }
}

Future improvement: Lemmatization & stemming would make this significantly more accurate.

B.Phrase Score

De-duplicate the phrases.

If a phrase contains another phrase that is in the document, the shorter phrase is removed (this is useful for excluding sub-headings). Even a simple approach here is sufficient to reduce bias.

for (int i = 0; i < sentences.Count; i++)
{
    if (sentences[i].TokensCount > 3 && sentences.Count(x => x.Value.Contains(sentences[i].Value.Trim('.'))) > 1)
    {
        sentences.RemoveAt(i);
        i--;
    }
}

Score = ∑ term scores terms in the phrase.

  var sentenceScore = new Dictionary<string, (int docOrder, float score)>();
  var sentenceDocOrder = 0; // we record the order of the sentence

  foreach (var sentence in sentences)
  {
      var spanTokens = sentence.Tokens.Select(x => x.Value.ToLowerInvariant()).ToList();

      foreach (var word in tokenFreq)
      {
          if (spanTokens.Contains(word.Key))
          {
              if (sentenceScore.ContainsKey(sentence.Value))
              {
                  var scoreCard = (sentenceScore[sentence.Value]);
                  scoreCard.score += word.Value;
                  sentenceScore[sentence.Value] = scoreCard;
              }
              else
              {
                  sentenceScore.Add(sentence.Value, (sentenceDocOrder, word.Value));
                  sentenceDocOrder++;
              }
          }
      }
  }

3.Summary generation

Extract a %age of the content.

Determine the number of sentences to return (minimum of 1).

If fewer than 6 sentences then:

Assume individual ranked statements is more useful.
Return the phrases from most important to least important, each on a separate line.

If more than 5 sentences then:

Generate a readable paragraph.
Assume that using the order of appearance in the original document will provide improved readability and context.
Return only the most highly ranked phrases.

var selection = (int)Math.Round(sentenceScores.Count * (extractionPercentage / 100), 0);
var summary = string.Empty;
if (selection < 1)
    selection = 1; // ensure at least one sentence is returned

if (selection <= 5)
{
    // assume individual ranked statements is more useful
    var statementsToInclude = sentenceScores.OrderByDescending(x => x.Value.score)
    .Take(selection)
    .Select(x => x.Key.TrimEnd('.'));
    summary = string.Join(Environment.NewLine, statementsToInclude);
}
else
{
  // attempt to build a readable paragraph from the most highly ranked statements
  var toInclude = sentenceScores.OrderByDescending(x => x.Value.score)
      .Take(selection)
      .OrderBy(x => x.Value.docOrder) // assume that the order of the content in the original document is from most important to least important
      .Select(x => x.Key.TrimEnd('.')); // lines don't always end in a full stop.  So trim, then join, ensures they all do.
  summary = string.Concat(string.Join(". ", toInclude), ".");
}

So there we have it, a simple way to tokenise, score and rank the phrases in a document.

software project musings

Search This Blog