Korean Language Support
This topic summarizes the Text Analytics Toolbox™ features that support Korean text.
Tokenization
The tokenizedDocument
function automatically detects Korean input.
Alternatively, set the 'Language'
option in tokenizedDocument
to 'ko'
. This option specifies the
language details of the tokens. To view the language details of the tokens, use
tokenDetails
. These language details determine the behavior of the removeStopWords
,
addPartOfSpeechDetails
, normalizeWords
, addSentenceDetails
, and addEntityDetails
functions on the tokens.
To specify additional MeCab options for tokenization, create a mecabOptions
object. To
tokenize using the specified MeCab tokenization options, use the 'TokenizeMethod'
option of tokenizedDocument
.
Part of Speech Details
The tokenDetails
function, by default, includes part of speech details with
the token details.
Named Entity Recognition
The tokenDetails
function, by default, includes entity details with the
token details.
Stop Words
To remove stop words from documents according to the token language details, use
removeStopWords
.
For a list of Korean stop words set the 'Language'
option in
stopWords
to 'ko'
.
Lemmatization
To lemmatize tokens according to the token language details, use normalizeWords
and set the 'Style'
option to
'lemma'
.
Language-Independent Features
Word and N-Gram Counting
The bagOfWords
and bagOfNgrams
functions support tokenizedDocument
input regardless of language. If you have a tokenizedDocument
array containing your data, then you can use these functions.
Modeling and Prediction
The fitlda
and fitlsa
functions support bagOfWords
and bagOfNgrams
input regardless of language. If you have a bagOfWords
or bagOfNgrams
object containing your data, then you can use these functions.
The trainWordEmbedding
function supports tokenizedDocument
or file input regardless of language. If you have a tokenizedDocument
array or a file containing your data in the correct format, then you can use this function.
See Also
tokenizedDocument
| removeStopWords
| stopWords
| addPartOfSpeechDetails
| tokenDetails
| normalizeWords
| addLanguageDetails
| addEntityDetails