Extract features from text to build models for natural language processing (NLP) applications
The bag-of-words (BoW) model is one of the simplest feature extraction techniques, used in many natural language processing (NLP) applications such as text classification, sentiment analysis, and topic modeling. Bag-of-words is built by counting the number of occurrences of unique features such as words and symbols in a document.
Example
In this example, the MATLAB® function bagOfWords
creates a bag-of-words model from a collection of abstracts of math papers published on arXiv. One of the easiest ways to visualize the model is by plotting a word cloud using the MATLAB function wordcloud(bag)
. Words displayed in bigger fonts and in orange are the most dominant (frequent) in the bag-of-words model.
When to Use Bag-of-Words Models
Bag-of-words is easy to understand and implement. As a result, it is often the first method used to build models with text data. However, bag-of-words has several limitations, including:
- Lack of context: Bag-of-words models do not preserve the order of appearance of features in a document, which can remove important information in some cases. For example, “is this a good day” and “this is a good day” would be considered equivalent if context is not taken into account while analyzing the text data.
- Unpredictable model quality: Including all features from a document in a bag-of-words model can increase the model size, resulting in sparsity and numerical instabilities. Careful preprocessing of the document text is often required to build a useful bag-of-words model.
Alternatives to Bag-of-Words Models
Several good model alternatives don’t have the same inherent model limitations as bag-of-words:
- bag-of-n-grams: uses multiple features instead of single ones
- term frequency–inverse document frequency: reflects importance
- word embedding: creates distributed representations of features into numerical vectors such as word2vec, GloVe and FastText
- transformer models: uses pretrained deep learning models for transfer learning
However, bag-of-words is easy to understand and implement and is sufficient for many use cases. To learn more about bag-of-words and other modeling techniques for text data, see Text Analytics Toolbox™ for use with MATLAB.
Examples and How To
See also: natural language processing, text analytics, sentiment analysis, word2vec, text mining with MATLAB, lemmatization, stemming, n-gram, data science, deep learning, ngram