Japanese Language Support
This topic summarizes the Text Analytics Toolbox™ features that support Japanese text. For an example showing how to analyze Japanese text data, see Analyze Japanese Text Data.
Tokenization
The tokenizedDocument
function automatically detects Japanese input.
Alternatively, set the 'Language'
option in tokenizedDocument
to 'ja'
. This option specifies the
language details of the tokens. To view the language details of the tokens, use
tokenDetails
. These language details determine the behavior of the removeStopWords
,
addPartOfSpeechDetails
, normalizeWords
, addSentenceDetails
, and addEntityDetails
functions on the tokens.
To specify additional MeCab options for tokenization, create a mecabOptions
object. To
tokenize using the specified MeCab tokenization options, use the 'TokenizeMethod'
option of tokenizedDocument
.
Tokenize Japanese Text
Tokenize Japanese text using tokenizedDocument
. The function automatically detects Japanese text.
str = [ "恋に悩み、苦しむ。" "恋の悩みで苦しむ。" "空に星が輝き、瞬いている。" "空の星が輝きを増している。"]; documents = tokenizedDocument(str)
documents = 4x1 tokenizedDocument: 6 tokens: 恋 に 悩み 、 苦しむ 。 6 tokens: 恋 の 悩み で 苦しむ 。 10 tokens: 空 に 星 が 輝き 、 瞬い て いる 。 10 tokens: 空 の 星 が 輝き を 増し て いる 。
Part of Speech Details
The tokenDetails
function, by default, includes part of speech details with
the token details.
Get Part of Speech Details of Japanese Text
Tokenize Japanese text using tokenizedDocument
.
str = [ "恋に悩み、苦しむ。" "恋の悩みで 苦しむ。" "空に星が輝き、瞬いている。" "空の星が輝きを増している。" "駅までは遠くて、歩けない。" "遠くの駅まで歩けない。" "すもももももももものうち。"]; documents = tokenizedDocument(str);
For Japanese text, you can get the part-of-speech details using tokenDetails
. For English text, you must first use addPartOfSpeechDetails
.
tdetails = tokenDetails(documents); head(tdetails)
Token DocumentNumber LineNumber Type Language PartOfSpeech Lemma Entity _______ ______________ __________ ___________ ________ ____________ _______ __________ "恋" 1 1 letters ja noun "恋" non-entity "に" 1 1 letters ja adposition "に" non-entity "悩み" 1 1 letters ja verb "悩む" non-entity "、" 1 1 punctuation ja punctuation "、" non-entity "苦しむ" 1 1 letters ja verb "苦しむ" non-entity "。" 1 1 punctuation ja punctuation "。" non-entity "恋" 2 1 letters ja noun "恋" non-entity "の" 2 1 letters ja adposition "の" non-entity
Named Entity Recognition
The tokenDetails
function, by default, includes entity details with the
token details.
Add Named Entity Tags to Japanese Text
Tokenize Japanese text using tokenizedDocument
.
str = [ "マリーさんはボストンからニューヨークに引っ越しました。" "駅へ鈴木さんを迎えに行きます。" "東京は大阪より大きいですか?" "東京に行った時、新宿や渋谷などいろいろな所を訪れました。"]; documents = tokenizedDocument(str);
For Japanese text, the software automatically adds named entity tags, so you do not need to use the addEntityDetails
function. This software detects person names, locations, organizations, and other named entities. To view the entity details, use the tokenDetails
function.
tdetails = tokenDetails(documents); head(tdetails)
Token DocumentNumber LineNumber Type Language PartOfSpeech Lemma Entity ____________ ______________ __________ _______ ________ ____________ ____________ __________ "マリー" 1 1 letters ja proper-noun "マリー" person "さん" 1 1 letters ja noun "さん" person "は" 1 1 letters ja adposition "は" non-entity "ボストン" 1 1 letters ja proper-noun "ボストン" location "から" 1 1 letters ja adposition "から" non-entity "ニューヨーク" 1 1 letters ja proper-noun "ニューヨーク" location "に" 1 1 letters ja adposition "に" non-entity "引っ越し" 1 1 letters ja verb "引っ越す" non-entity
View the words tagged with entity "person"
, "location"
, "organization"
, or "other"
. These words are the words not tagged "non-entity"
.
idx = tdetails.Entity ~= "non-entity";
tdetails(idx,:).Token
ans = 11x1 string
"マリー"
"さん"
"ボストン"
"ニューヨーク"
"鈴木"
"さん"
"東京"
"大阪"
"東京"
"新宿"
"渋谷"
Stop Words
To remove stop words from documents according to the token language details, use
removeStopWords
.
For a list of Japanese stop words set the 'Language'
option in
stopWords
to 'ja'
.
Remove Japanese Stop Words
Tokenize Japanese text using tokenizedDocument
. The function automatically detects Japanese text.
str = [ "ここは静かなので、とても穏やかです" "企業内の顧客データを利用し、今年の売り上げを調べることが出来た。" "私は先生です。私は英語を教えています。"]; documents = tokenizedDocument(str);
Remove stop words using removeStopWords
. The function uses the language details from documents
to determine which language stop words to remove.
documents = removeStopWords(documents)
documents = 3x1 tokenizedDocument: 4 tokens: 静か 、 とても 穏やか 10 tokens: 企業 顧客 データ 利用 、 今年 売り上げ 調べる 出来 。 5 tokens: 先生 。 英語 教え 。
Lemmatization
To lemmatize tokens according to the token language details, use normalizeWords
and set the 'Style'
option to
'lemma'
.
Lemmatize Japanese Text
Tokenize Japanese text using the tokenizedDocument
function. The function automatically detects Japanese text.
str = [ "空に星が輝き、瞬いている。" "空の星が輝きを増している。" "駅までは遠くて、歩けない。" "遠くの駅まで歩けない。"]; documents = tokenizedDocument(str);
Lemmatize the tokens using normalizeWords
.
documents = normalizeWords(documents)
documents = 4x1 tokenizedDocument: 10 tokens: 空 に 星 が 輝く 、 瞬く て いる 。 10 tokens: 空 の 星 が 輝き を 増す て いる 。 9 tokens: 駅 まで は 遠い て 、 歩ける ない 。 7 tokens: 遠く の 駅 まで 歩ける ない 。
Language-Independent Features
Word and N-Gram Counting
The bagOfWords
and bagOfNgrams
functions support tokenizedDocument
input regardless of language. If you have a tokenizedDocument
array containing your data, then you can use these functions.
Modeling and Prediction
The fitlda
and fitlsa
functions support bagOfWords
and bagOfNgrams
input regardless of language. If you have a bagOfWords
or bagOfNgrams
object containing your data, then you can use these functions.
The trainWordEmbedding
function supports tokenizedDocument
or file input regardless of language. If you have a tokenizedDocument
array or a file containing your data in the correct format, then you can use this function.
See Also
tokenizedDocument
| removeStopWords
| stopWords
| addPartOfSpeechDetails
| tokenDetails
| normalizeWords
| addLanguageDetails
| addEntityDetails