bat365 Homepage

Korean Language Support

This topic summarizes the Text Analytics Toolbox™ features that support Korean text.

Tokenization

The tokenizedDocument function automatically detects Korean input. Alternatively, set the 'Language' option in tokenizedDocument to 'ko'. This option specifies the language details of the tokens. To view the language details of the tokens, use tokenDetails. These language details determine the behavior of the removeStopWords, addPartOfSpeechDetails, normalizeWords, addSentenceDetails, and addEntityDetails functions on the tokens.

To specify additional MeCab options for tokenization, create a mecabOptions object. To tokenize using the specified MeCab tokenization options, use the 'TokenizeMethod' option of tokenizedDocument.

Part of Speech Details

The tokenDetails function, by default, includes part of speech details with the token details.

Named Entity Recognition

The tokenDetails function, by default, includes entity details with the token details.

Stop Words

To remove stop words from documents according to the token language details, use removeStopWords. For a list of Korean stop words set the 'Language' option in stopWords to 'ko'.

Lemmatization

To lemmatize tokens according to the token language details, use normalizeWords and set the 'Style' option to 'lemma'.

Language-Independent Features

Word and N-Gram Counting

The bagOfWords and bagOfNgrams functions support tokenizedDocument input regardless of language. If you have a tokenizedDocument array containing your data, then you can use these functions.

Modeling and Prediction

The fitlda and fitlsa functions support bagOfWords and bagOfNgrams input regardless of language. If you have a bagOfWords or bagOfNgrams object containing your data, then you can use these functions.

The trainWordEmbedding function supports tokenizedDocument or file input regardless of language. If you have a tokenizedDocument array or a file containing your data in the correct format, then you can use this function.