Simple Extractive Text Summarization using Natural Language Processing with .NET



Natural Language Processing (NLP)
Computers understanding spoken or written natural language as humans do.

Composed of:
  • natural-language understanding
  • natural-language generation

Stop words
Unimportant terms e.g. a, and, from

Extractive Summarisation vs Abstractive Summarisation

Selecting important content from a document and re-combining it into a shorter form.
The importance is based on statistical & linguistic features of sentences and phrases.

Developing an understanding of the main concepts in a document and then re-expressing them using clear natural language.
Based on linguistic methods. 

Why use text summarization?

  • Reduce “noise”.
  • Reduce the time & effort required to understand the content of an article.
  • Reduce bias.


In this article we will use extractive summarization based primarily on statistical analysis (with some heuristics applied).

For this, we will leverage the Catalyst NLP library.  This library
  • Is a fast, modern pure-C# NLP library, supporting .NET standard 2.0.
  • Includes pre-built models.
  • Has parity with spaCy, one of the NLP libraries commonly used by data scientists.
  • Supports training, custom models, entity-recognition etc.

The Process

1.Data cleansing

Convert the document to plain text.
  public static string NormaliseText(string sourceText)
      const string delimiter = ". ";
      sourceText = HtmlConverters.ToText(sourceText, delimiter); // use plain text as a baseline
      sourceText = Regex.Replace(sourceText, @"(\ |\.)+\.+\ ", delimiter);// ' . ' or '.. '            
      return sourceText;

Load the document and process it using the standard Catalyst English model.
Pipeline nlp = Pipeline.For(Language.English);
var doc = new Document(sourceText, Language.English);

Identify terms AKA tokens.
Using Catalyst here handles all of the standard delimiters.
var tokens = doc.ToTokenList();

Identify phrases AKA spans.
Once again, Catalyst greatly simplifies this.
var sentences = doc.Spans.ToList();

2.Statistical Analysis

A.Terms Score

Assign a relative importance score to each unique alphanumeric term.
Exclude stop words.  These terms add no semantic value to the content and can be ignored in terms of analysis.
Assumption: more important terms will be used more frequently.

Score =   #occurences of the term / #occurences of the most frequently occurring term

var stopWords = StopWords.English;
var tokenFreq = new Dictionary<string, float>();

// collate raw token values
foreach (var token in tokens)
    var tokenValue = token.Value.ToLowerInvariant();
    if (!stopWords.Contains(tokenValue) && !tokenValue.IsNonAlphaNumeric())
        if (tokenFreq.ContainsKey(tokenValue))
            tokenFreq[tokenValue] = tokenFreq[tokenValue] + 1;
            tokenFreq.Add(tokenValue, 1);

if (tokenFreq.Count > 0)
  var maxOccurence = tokenFreq.Max(x => x.Value); 
  // update the token score
  foreach (var word in tokenFreq)
      tokenFreq[word.Key] = word.Value / maxOccurence;

Future improvement: Lemmatization & stemming would make this significantly more accurate.

B.Phrase Score

De-duplicate the phrases.
If a phrase contains another phrase that is in the document, the shorter phrase is removed (this is useful for excluding sub-headings).  Even a simple approach here is sufficient to reduce bias.
for (int i = 0; i < sentences.Count; i++)
    if (sentences[i].TokensCount > 3 && sentences.Count(x => x.Value.Contains(sentences[i].Value.Trim('.'))) > 1)

Score = ∑ term scores terms in the phrase.

  var sentenceScore = new Dictionary<string, (int docOrder, float score)>();
  var sentenceDocOrder = 0; // we record the order of the sentence

  foreach (var sentence in sentences)
      var spanTokens = sentence.Tokens.Select(x => x.Value.ToLowerInvariant()).ToList();

      foreach (var word in tokenFreq)
          if (spanTokens.Contains(word.Key))
              if (sentenceScore.ContainsKey(sentence.Value))
                  var scoreCard = (sentenceScore[sentence.Value]);
                  scoreCard.score += word.Value;
                  sentenceScore[sentence.Value] = scoreCard;
                  sentenceScore.Add(sentence.Value, (sentenceDocOrder, word.Value));

3.Summary generation

Extract a %age of the content.

Determine the number of sentences to return (minimum of 1).

If fewer than 6 sentences then:
  • Assume individual ranked statements is more useful.
  • Return the phrases from most important to least important, each on a separate line.

If more than 5 sentences then:
  • Generate a readable paragraph.
  • Assume that using the order of appearance in the original document will provide improved readability and context.
  • Return only the most highly ranked phrases.
var selection = (int)Math.Round(sentenceScores.Count * (extractionPercentage / 100), 0);
var summary = string.Empty;
if (selection < 1)
    selection = 1; // ensure at least one sentence is returned

if (selection <= 5)
    // assume individual ranked statements is more useful
    var statementsToInclude = sentenceScores.OrderByDescending(x => x.Value.score)
    .Select(x => x.Key.TrimEnd('.'));
    summary = string.Join(Environment.NewLine, statementsToInclude);
  // attempt to build a readable paragraph from the most highly ranked statements
  var toInclude = sentenceScores.OrderByDescending(x => x.Value.score)
      .OrderBy(x => x.Value.docOrder) // assume that the order of the content in the original document is from most important to least important
      .Select(x => x.Key.TrimEnd('.')); // lines don't always end in a full stop.  So trim, then join, ensures they all do.
  summary = string.Concat(string.Join(". ", toInclude), ".");

So there we have it, a simple way to tokenise, score and rank the phrases in a document.
