Stemming

Reduce words to their root forms

Stemming refers to a text normalization technique in natural language processing that reduces words to their root forms. Stemming is done primarily by removing affixes of the words, which may result in an invalid dictionary word.

Stemming is commonly used for:

  • Information retrieval, where stemmed words are used as synonyms to expand search criteria
  • Engineering applications to reduce dimensionality, where stemming results in fewer words to be tracked and used in a model with machine learning algorithms

Porter’s Stemming Algorithm

The Porter stemmer algorithm is one of the most popular stemming approaches for the English language, and is based on simple heuristic rules. This stemming approach is fast but may not always be accurate. In subsequent years, many other algorithms were proposed, but Porter’s stemming algorithm remains popular due to its speed and simplicity.

Stemming vs. Lemmatization

A related, but more sophisticated approach, to stemming is lemmatization. Compared to stemming,

  • Lemmatization uses vocabulary and morphological analysis and stemming uses simple heuristic rules
  • Lemmatization returns dictionary forms of the words, whereas stemming may result in invalid words

The differences between lemmatization and stemming are shown below.

Actual Word Lemmatization Stemming
Requiring Require Requir
Required Require Requir
Requirement Requirement Requir

In MATLAB, stemming can be done using “normalizeWords” function with the default style option of ‘stem’. To learn more about stemming and building models with text data, see Text Analytics Toolbox™.

See also: natural language processing, sentiment analysis, word2vec, n-gram, text mining with MATLAB, data science, deep learning, Deep Learning Toolbox™, Statistics and Machine Learning Toolbox™