correctSpelling
Syntax
Description
Use correctSpelling
to correct spelling of words in string
arrays or documents.
The function supports English, German, and Korean text.
corrects the spelling of the words in the updatedDocuments
= correctSpelling(documents
)tokenizedDocument
array
documents
.
corrects the spelling of the words in the updatedWords
= correctSpelling(words
)string
vector
words
.
also specifies the language of the words in the updatedWords
= correctSpelling(words
,'Language',language
)string
vector
words
.
[___,
also returns a vector of words in the input that were not found in the dictionary and for
which no suggestion was found.unknownWords
] = correctSpelling(___)
___ = correctSpelling(___,
specifies additional options using one or more name-value pair arguments.Name,Value
)
Examples
Correct Spelling of Words in Documents
Create a tokenized document array.
str = [ "A documnent containing some misspelled worrds." "Another documnent cntaining typos."]; documents = tokenizedDocument(str);
Correct the spelling of the words in the documents using the correctSpelling
function.
updatedDocuments = correctSpelling(documents)
updatedDocuments = 2x1 tokenizedDocument: 7 tokens: A document containing some misspelled words . 5 tokens: Another document containing typos .
Correct Spelling of Words in String Array
Create a string array of words.
words = ["A" "strng" "array" "containing" "misspelled" "worrds" "."];
Correct the spelling of the words in the string array using the correctSpelling
function.
updatedWords = correctSpelling(words)
updatedWords = 1x7 string
"A" "string" "array" "containing" "misspelled" "words" "."
Specify Known Words
Create a tokenized document array.
str = [ "Analyze text data using MATLAB." "Another documnent cntaining typos."]; documents = tokenizedDocument(str);
Correct the spelling of the words in the documents using the correctSpelling
function.
updatedDocuments = correctSpelling(documents)
updatedDocuments = 2x1 tokenizedDocument: 7 tokens: Analyze text data using MAT LAB . 5 tokens: Another document containing typos .
Notice that the word "MATLAB" gets split into the two words "MAT" and "LAB".
Correct the spelling of the documents and specify "MATLAB" as a known word using the 'KnownWords'
option.
updatedDocuments = correctSpelling(documents,'KnownWords',"MATLAB")
updatedDocuments = 2x1 tokenizedDocument: 6 tokens: Analyze text data using MATLAB . 5 tokens: Another document containing typos .
Input Arguments
documents
— Input documents
tokenizedDocument
array
Input documents, specified as a tokenizedDocument
array.
words
— Input words
string vector | character vector | cell array of character vectors
Input words, specified as a string vector, character vector, or cell array of character vectors. If you specify words
as a character vector, then the function treats the argument as a single word.
Data Types: string
| char
| cell
language
— Word language
'en'
| 'de'
| 'ko'
Word language, specified as one of the following:
'en'
– English language'de'
– German language'ko'
– Korean language
If you do not specify language, then the software detects the language automatically.
Data Types: char
| string
Name-Value Arguments
Specify optional pairs of arguments as
Name1=Value1,...,NameN=ValueN
, where Name
is
the argument name and Value
is the corresponding value.
Name-value arguments must appear after other arguments, but the order of the
pairs does not matter.
Before R2021a, use commas to separate each name and value, and enclose
Name
in quotes.
Example: correctSpelling(documents,'KnownWords',["bat365"
"MATLAB"])
corrects the spelling of the words in documents
and treats the words "bat365" and "MATLAB" as correctly spelled words.
KnownWords
— Words to be treated as correct
[]
(default) | string array | cell array of character vectors
Words to be treated as correct, specified as the comma-separated pair consisting
of 'KnownWords'
and a string array or a cell array of character
vectors.
If you specify a list of known words, then these words remain unchanged when the function corrects spelling. The software may also substitute misspelled words with words from the list of known words.
Example: ["bat365" "MATLAB"]
Data Types: char
| string
| cell
ExtensionDictionary
— Hunspell extension dictionary file
''
(default) | file path
Hunspell extension dictionary file (also known as personal dictionary file),
specified as the comma-separated pair consisting of
'ExtensionDictionary'
and a file path of a Hunspell extension
dictionary file.
A Hunspell extension dictionary file is a .dic
file containing
the number of words in the dictionary followed by a list of the words in the following format:
word1/affixWord1 word2/affixWord2 ... wordN/affixWordN *forbiddenWord1 *forbiddenWord2 ... *forbiddenWordM
word1
,word2
, …,wordN
is a list words to extend the Hunspell dictionary with.affixWord1
,affixWord2
, …,affixWordN
(optional) indicate words in the Hunspell dictionary that share affixes. Indicate affixes by concatenating them to the corresponding word with a forward slash (/
). For example, the entryexxxtreme/extreme
indicates that affixes that apply to the word"extreme"
also apply to the custom word"exxxtreme"
.forbiddenWord1
,forbiddenWord2
, …,forbiddenWordN
is a list of forbidden words to use for spelling correction. Indicate forbidden words using an asterisk (*
).
The entries in the Hunspell extension dictionary file can appear in any order.
When you specify words in a Hunspell dictionary file, you must specify words in their
base form. For example, to ensure that the correctSpelling
function does not convert the string "decrese"
to
"decrees"
using an extension dictionary, specify the base word
"decree"
as a forbidden word.
For example, to create a Hunspell extension dictionary file specifying:
The words
"bat365"
,"MATLAB"
, and"exxxtreme"
.The affixes that apply to the word
"extreme"
also apply to the word"exxxtreme"
.The word
"NaN"
is a forbidden word.
use:
bat365 MATLAB exxxtreme/extreme *NaN
For an example showing how to create Hunspell extension dictionary files, see Create Extension Dictionary for Spelling Correction. For more information about the options of Hunspell dictionary files, see https://manpages.ubuntu.com/manpages/trusty/en/man4/hunspell.4.html.
Data Types: char
| string
Dictionary
— Hunspell dictionary file
''
(default) | file path
Hunspell dictionary file, specified as the comma-separated pair consisting of
'Dictionary'
and a file path of a Hunspell dictionary
file.
A Hunspell dictionary file is a .dic
file containing the number
of words in the dictionary followed by a list of the words in the following
format:
N word1/flags1 word2/flags2 ... wordN/flagsN
where N
is the number of words in the dictionary file,
word1
, word2
, …, wordN
are
the N
words in the dictionary, and flags1
, …,
flagsN
specify optional flags corresponding to the words
word1
, word2
, …, wordN
,
respectively. Use flags to specify word attributes, for example affixes. To specify a
Hunspell affix file, use the 'Affixes'
option.
For example, a to create a Hunspell dictionary file containing the 4 words
"bat365"
, "MATLAB"
,
"correctSpelling"
, and "tokenizedDocument"
,
use:
4 bat365 MATLAB correctSpelling tokenizedDocument
For more information about the options of Hunspell dictionary files, see https://manpages.ubuntu.com/manpages/trusty/en/man4/hunspell.4.html.
Data Types: char
| string
Affixes
— Hunspell affix file
''
(default) | file path
Hunspell affix file, specified as the comma-separated pair consisting of
'Affixes'
and a file path of a Hunspell affix file.
A Hunspell affix file is a .aff
file containing the number of
words in the dictionary followed by a list of the words in the following
format:
option1 values1 option2 values2 ... optionM valuesM
where M
is the number of options in the affix file,
option1
, option2
, …,
optionM
are the M
options, and
values1
, …, valuesN
specify the values
corresponding to the options option1
, option2
,
…, optionM
, respectively. Use these options to specify
affixes.
Prefixes
To define a prefix rule, use the PFX
option with the
format:
PFX flag crossProduct K PFX flag stripping1 prefix1 condition1 ... PFX flag strippingK prefixK conditionK
flag
corresponds to the flags used in the Hunspell dictionary file.crossProduct
indicates whether prefixes and suffixes can be mixed, specified asY
orN
.K
is the number of prefixes defined for the specified flag.stripping1
,stripping2
, …,strippingK
indicate characters to be stripped from the word when applying prefix. If the stripping value is0
, then no stripping takes place.prefix1
,prefix2
, …,prefixK
specify the prefixes to use.condition1
,condition2
, …,conditionK
specify the optional conditions for which to apply the prefixesprefix1
,prefix2
, …,prefixK
, respectively. For the trivial condition, specify"."
.
Suffixes
To define a suffix rule, use the SFX
option with the
format:
SFX flag crossProduct K SFX flag stripping1 suffix1 condition1 ... SFX flag strippingK suffixK conditionK
suffix1
, suffix2
, …,
suffixK
specify the prefixes to use, and the flag, cross
product, K
, stripping, and condition values are the same as the
prefix format.
Example
Create a Hunspell affix file defining the following affix rules:
Flag A:
prefix words with
"re"
Flag B:
suffix words not ending with
"y"
with"ed"
.suffix words ending with
"y"
with"ied"
, removing"y"
.
use the Hunspell affix file:
PFX A Y 1 PFX A 0 re . SFX B Y 1 SFX B 0 ed [^y] SFX B y ied y
To use these flags in a Hunspell dictionary file, append the appropriate flags
to the words using the "/"
. For each word, you can specify
multiple flags. For example, to specify a dictionary file containing:
The words
"ptest"
and"ptry"
.For the word
"ptest"
only, also include the prefix"re"
using flagA
.For both words, also include the suffixes
"ed"
or"ied"
where appropriate using flagB
For more information about the options of Hunspell affix files, see https://manpages.ubuntu.com/manpages/trusty/en/man4/hunspell.4.html.
Data Types: char
| string
RetokenizeMethod
— Method to retokenize documents
'split'
(default) | 'none'
Method to retokenize documents, specified as the comma-separated pair consisting
of 'RetokenizeMethod'
and one of the following:
'split'
– Correct spelling by splitting tokens. For example, split the incorrectly spelled token"twowords"
into the correctly spelled tokens"two"
and"words"
.'none'
– Do not split tokens for spelling correction.
Output Arguments
updatedDocuments
— Corrected documents
tokenizedDocument
array
Corrected documents, returned as a tokenizedDocument
array. If the 'RetokenizeMethod'
option is 'split'
, then the number
of words in each updated document may be different to the corresponding input
document.
If there are multiple candidates for corrected words, then the function automatically selects a single word for correction.
updatedWords
— Corrected words
string vector
Corrected words, returned as a string vector. If the 'RetokenizeMethod'
option is 'split'
, then the number
of updated words may be different the number of input words.
If there are multiple candidates for corrected words, then the function automatically selects a single word for correction.
unknownWords
— Unknown words
string vector
Unknown words, returned as a string vector. The string vector
unknownWords
contains the input words that are not in the
spelling correction dictionary and for which no suggestions are found.
Version History
Introduced in R2020a
See Also
Open Example
You have a modified version of this example. Do you want to open this example with your edits?
MATLAB Command
You clicked a link that corresponds to this MATLAB command:
Run the command by entering it in the MATLAB Command Window. Web browsers do not support MATLAB commands.
Select a Web Site
Choose a web site to get translated content where available and see local events and offers. Based on your location, we recommend that you select: .
You can also select a web site from the following list:
How to Get Best Site Performance
Select the China site (in Chinese or English) for best site performance. Other bat365 country sites are not optimized for visits from your location.
Americas
- América Latina (Español)
- Canada (English)
- United States (English)
Europe
- Belgium (English)
- Denmark (English)
- Deutschland (Deutsch)
- España (Español)
- Finland (English)
- France (Français)
- Ireland (English)
- Italia (Italiano)
- Luxembourg (English)
- Netherlands (English)
- Norway (English)
- Österreich (Deutsch)
- Portugal (English)
- Sweden (English)
- Switzerland
- United Kingdom (English)