bagOfWords
Bag-of-words model
Description
A bag-of-words model (also known as a term-frequency counter) records the number of times that words appear in each document of a collection.
bagOfWords
does not split text into words. To create an array of
tokenized documents, see tokenizedDocument
.
Creation
Description
creates an empty
bag-of-words model.bag
= bagOfWords
counts the words appearing in bag
= bagOfWords(documents
)documents
and returns a
bag-of-words model.
creates a bag-of-words model using the words in bag
= bagOfWords(uniqueWords
,counts
)uniqueWords
and the corresponding frequency counts in counts
.
Input Arguments
Properties
Object Functions
encode | Encode documents as matrix of word or n-gram counts |
tfidf | Term Frequency–Inverse Document Frequency (tf-idf) matrix |
topkwords | Most important words in bag-of-words model or LDA topic |
addDocument | Add documents to bag-of-words or bag-of-n-grams model |
removeDocument | Remove documents from bag-of-words or bag-of-n-grams model |
removeEmptyDocuments | Remove empty documents from tokenized document array, bag-of-words model, or bag-of-n-grams model |
removeWords | Remove selected words from documents or bag-of-words model |
removeInfrequentWords | Remove words with low counts from bag-of-words model |
join | Combine multiple bag-of-words or bag-of-n-grams models |
wordcloud | Create word cloud chart from text, bag-of-words model, bag-of-n-grams model, or LDA model |
Examples
Tips
If you intend to use a held out test set for your work, then partition your text data before using
bagOfWords
. Otherwise, the bag-of-words model may bias your analysis.
Version History
Introduced in R2017b
See Also
bagOfNgrams
| addDocument
| removeDocument
| removeInfrequentWords
| removeWords
| removeEmptyDocuments
| topkwords
| encode
| tfidf
| tokenizedDocument