tfidf
Term Frequency–Inverse Document Frequency (tf-idf) matrix
Description
Examples
Create Tf-idf Matrix
Create a Term Frequency–Inverse Document Frequency (tf-idf) matrix from a bag-of-words model.
Load the example data. The file sonnetsPreprocessed.txt
contains preprocessed versions of Shakespeare's sonnets. The file contains one sonnet per line, with words separated by a space. Extract the text from sonnetsPreprocessed.txt
, split the text into documents at newline characters, and then tokenize the documents.
filename = "sonnetsPreprocessed.txt";
str = extractFileText(filename);
textData = split(str,newline);
documents = tokenizedDocument(textData);
Create a bag-of-words model using bagOfWords
.
bag = bagOfWords(documents)
bag = bagOfWords with properties: Counts: [154x3092 double] Vocabulary: ["fairest" "creatures" "desire" "increase" "thereby" "beautys" "rose" "might" "never" "die" "riper" "time" "decease" "tender" "heir" "bear" "memory" "thou" "contracted" ... ] NumWords: 3092 NumDocuments: 154
Create a tf-idf matrix. View the first 10 rows and columns.
M = tfidf(bag); full(M(1:10,1:10))
ans = 10×10
3.6507 4.3438 2.7344 3.6507 4.3438 2.2644 3.2452 3.8918 2.4720 2.5520
0 0 0 0 0 4.5287 0 0 0 0
0 0 0 0 0 0 0 0 0 2.5520
0 0 0 0 0 2.2644 0 0 0 0
0 0 0 0 0 2.2644 0 0 0 0
0 0 0 0 0 2.2644 0 0 0 0
0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 2.2644 0 0 0 2.5520
0 0 2.7344 0 0 0 0 0 0 0
Create tf-idf Matrix from New Documents
Create a Term Frequency-Inverse Document Frequency (tf-idf) matrix from a bag-of-words model and an array of new documents.
Load the example data. The file sonnetsPreprocessed.txt
contains preprocessed versions of Shakespeare's sonnets. The file contains one sonnet per line, with words separated by a space. Extract the text from sonnetsPreprocessed.txt
, split the text into documents at newline characters, and then tokenize the documents.
filename = "sonnetsPreprocessed.txt";
str = extractFileText(filename);
textData = split(str,newline);
documents = tokenizedDocument(textData);
Create a bag-of-words model from the documents.
bag = bagOfWords(documents)
bag = bagOfWords with properties: Counts: [154x3092 double] Vocabulary: ["fairest" "creatures" "desire" "increase" "thereby" "beautys" "rose" "might" "never" "die" "riper" "time" "decease" "tender" "heir" "bear" "memory" "thou" "contracted" ... ] NumWords: 3092 NumDocuments: 154
Create a tf-idf matrix for an array of new documents using the inverse document frequency (IDF) factor computed from bag
.
newDocuments = tokenizedDocument([ "what's in a name? a rose by any other name would smell as sweet." "if music be the food of love, play on."]); M = tfidf(bag,newDocuments)
M = (1,7) 3.2452 (1,36) 1.2303 (2,197) 3.4275 (2,313) 3.6507 (2,387) 0.6061 (1,1205) 4.7958 (1,1835) 3.6507 (2,1917) 5.0370
Specify TF Weight Formulas
Load the example data. The file sonnetsPreprocessed.txt
contains preprocessed versions of Shakespeare's sonnets. The file contains one sonnet per line, with words separated by a space. Extract the text from sonnetsPreprocessed.txt
, split the text into documents at newline characters, and then tokenize the documents.
filename = "sonnetsPreprocessed.txt";
str = extractFileText(filename);
textData = split(str,newline);
documents = tokenizedDocument(textData);
Create a bag-of-words model using bagOfWords
.
bag = bagOfWords(documents)
bag = bagOfWords with properties: Counts: [154x3092 double] Vocabulary: ["fairest" "creatures" "desire" "increase" "thereby" "beautys" "rose" "might" "never" "die" "riper" "time" "decease" "tender" "heir" "bear" "memory" "thou" "contracted" ... ] NumWords: 3092 NumDocuments: 154
Create a tf-idf matrix. View the first 10 rows and columns.
M = tfidf(bag); full(M(1:10,1:10))
ans = 10×10
3.6507 4.3438 2.7344 3.6507 4.3438 2.2644 3.2452 3.8918 2.4720 2.5520
0 0 0 0 0 4.5287 0 0 0 0
0 0 0 0 0 0 0 0 0 2.5520
0 0 0 0 0 2.2644 0 0 0 0
0 0 0 0 0 2.2644 0 0 0 0
0 0 0 0 0 2.2644 0 0 0 0
0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 2.2644 0 0 0 2.5520
0 0 2.7344 0 0 0 0 0 0 0
You can change the contributions made by the TF and IDF factors to the tf-idf matrix by specifying the TF and IDF weight formulas.
To ignore how many times a word appears in a document, use the binary option of 'TFWeight'
. Create a tf-idf matrix and set 'TFWeight'
to 'binary'
. View the first 10 rows and columns.
M = tfidf(bag,'TFWeight','binary'); full(M(1:10,1:10))
ans = 10×10
3.6507 4.3438 2.7344 3.6507 4.3438 2.2644 3.2452 1.9459 2.4720 2.5520
0 0 0 0 0 2.2644 0 0 0 0
0 0 0 0 0 0 0 0 0 2.5520
0 0 0 0 0 2.2644 0 0 0 0
0 0 0 0 0 2.2644 0 0 0 0
0 0 0 0 0 2.2644 0 0 0 0
0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 2.2644 0 0 0 2.5520
0 0 2.7344 0 0 0 0 0 0 0
Input Arguments
bag
— Input bag-of-words or bag-of-n-grams model
bagOfWords
object | bagOfNgrams
object
Input bag-of-words or bag-of-n-grams model, specified as a bagOfWords
object or a bagOfNgrams
object.
documents
— Input documents
tokenizedDocument
array | string array of words | cell array of character vectors
Input documents, specified as a tokenizedDocument
array, a string array of words, or a cell array of
character vectors. If documents
is not a
tokenizedDocument
array, then it must be a row vector representing
a single document, where each element is a word. To specify multiple documents, use a
tokenizedDocument
array.
Name-Value Arguments
Specify optional pairs of arguments as
Name1=Value1,...,NameN=ValueN
, where Name
is
the argument name and Value
is the corresponding value.
Name-value arguments must appear after other arguments, but the order of the
pairs does not matter.
Before R2021a, use commas to separate each name and value, and enclose
Name
in quotes.
Example: 'Normalized',true
specifies to normalize the frequency
counts.
TFWeight
— Method to set term frequency factor
'raw'
(default) | 'binary'
| 'log'
Method to set term frequency (TF) factor, specified as the
comma-separated pair consisting of 'TFWeight'
and one
of the following:
'raw'
– Set the TF factor to the unchanged term counts.'binary'
– Set the TF factor to the matrix of ones and zeros where the ones indicate whether a term is in a document.'log'
– Set the TF factor to1 + log(bag.Counts)
.
Example: 'TFWeight','binary'
Data Types: char
IDFWeight
— Method to compute inverse document frequency factor
'normal'
(default) | 'textrank'
| 'classic-bm25'
| 'unary'
| 'smooth'
| 'max'
| 'probabilistic'
Method to compute inverse document frequency factor, specified as the comma-separated pair
consisting of 'IDFWeight'
and one of the following:
'textrank'
– Use TextRank IDF weighting [1]. For each term, set the IDF factor tolog((N-NT+0.5)/(NT+0.5))
if the term occurs in more than half of the documents, whereN
is the number of documents in the input data andNT
is the number of documents in the input data containing each term.IDFCorrection*avgIDF
if the term occurs in half of the documents or f, whereavgIDF
is the average IDF of all tokens.
'classic-bm25'
– For each term, set the IDF factor tolog((N-NT+0.5)/(NT+0.5))
.'normal'
– For each term, set the IDF factor tolog(N/NT)
.'unary'
– For each term, set the IDF factor to 1.'smooth'
– For each term, set the IDF factor tolog(1+N/NT)
.'max'
– For each term, set the IDF factor tolog(1+max(NT)/NT)
.'probabilistic'
– For each term, set the IDF factor tolog((N-NT)/NT)
.
where N
is the number of documents in the input data and
NT
is the number of documents in the input data containing each
term.
Example: 'IDFWeight','smooth'
Data Types: char
IDFCorrection
— Inverse document frequency correction factor
0.25 (default) | nonnegative scalar
Inverse document frequency correction factor, specified as the comma-separated pair consisting of 'IDFCorrection'
and a nonnegative scalar.
This option only applies when 'IDFWeight'
is 'textrank'
.
Data Types: single
| double
| int8
| int16
| int32
| int64
| uint8
| uint16
| uint32
| uint64
Normalized
— Option to normalize term counts
false
(default) | true
Option to normalize term counts, specified as the comma-separated pair
consisting of 'Normalized'
and
true
or false
. If
true
, then the function normalizes each vector of
term counts in the Euclidean norm.
Example: 'Normalized',true
Data Types: logical
DocumentsIn
— Orientation of output documents
'rows'
(default) | 'columns'
Orientation of output documents in the frequency count matrix, specified as the
comma-separated pair consisting of 'DocumentsIn'
and one of the
following:
'rows'
– Return a matrix of frequency counts with rows corresponding to documents.'columns'
– Return a transposed matrix of frequency counts with columns corresponding to documents.
Data Types: char
ForceCellOutput
— Indicator for forcing output to be returned as cell array
false
(default) | true
Indicator for forcing output to be returned as cell array, specified as the comma separated pair consisting of 'ForceCellOutput'
and true
or false
.
Data Types: logical
Output Arguments
M
— Output Term Frequency-Inverse Document Frequency matrix
sparse matrix | cell array of sparse matrices
Output Term Frequency-Inverse Document Frequency matrix, specified as a sparse matrix or a cell array of sparse matrices.
If bag
is a non-scalar array or
'ForceCellOutput'
is true
, then
the function returns the outputs as a cell array of sparse matrices. Each
element in the cell array is the tf-idf matrix calculated from the
corresponding element of bag
.
References
[1] Barrios, Federico, Federico López, Luis Argerich, and Rosa Wachenchauzer. "Variations of the Similarity Function of TextRank for Automated Summarization." arXiv preprint arXiv:1602.03606 (2016).
Version History
Introduced in R2017b
See Also
bagOfWords
| bagOfNgrams
| topkwords
| topkngrams
| encode
| tokenizedDocument
Open Example
You have a modified version of this example. Do you want to open this example with your edits?
MATLAB Command
You clicked a link that corresponds to this MATLAB command:
Run the command by entering it in the MATLAB Command Window. Web browsers do not support MATLAB commands.
Select a Web Site
Choose a web site to get translated content where available and see local events and offers. Based on your location, we recommend that you select: .
You can also select a web site from the following list:
How to Get Best Site Performance
Select the China site (in Chinese or English) for best site performance. Other bat365 country sites are not optimized for visits from your location.
Americas
- América Latina (Español)
- Canada (English)
- United States (English)
Europe
- Belgium (English)
- Denmark (English)
- Deutschland (Deutsch)
- España (Español)
- Finland (English)
- France (Français)
- Ireland (English)
- Italia (Italiano)
- Luxembourg (English)
- Netherlands (English)
- Norway (English)
- Österreich (Deutsch)
- Portugal (English)
- Sweden (English)
- Switzerland
- United Kingdom (English)