Stanza - A Python NLP Library for Many Human Languages

I tested out Stanza. English tokenizer and definately works. I ran quick test with Japanese lang and output was somewhat unexpected.

import stanza

# japanese "ja", for english model "en"

stanza.download("ja")
nlp = stanza.Pipeline("ja")
doc = nlp("皆さんおはようございます！　ご機嫌いかがですか？")

for i, sentence in enumerate(doc.sentences):
    print(f"===== Sentence {i+1} tokens =====")
    print(*[f"word: {word.text}\t upos: {word.upos} xpos: {word.xpos}" for word in  sentence.words], sep="\n")

The output is:

===== Sentence 1 tokens =====
word: 皆さん	 upos: PRON xpos: NP
word: おは	 upos: VERB xpos: VV
word: よう	 upos: AUX xpos: AV
word: ござい	 upos: VERB xpos: VV
word: ます	 upos: AUX xpos: AV
word: ！	 upos: PUNCT xpos: SYM
===== Sentence 2 tokens =====
word: ご	 upos: NOUN xpos: XP
word: 機	 upos: NOUN xpos: XS
word: 嫌い	 upos: NOUN xpos: NN
word: か	 upos: PART xpos: PF
word: が	 upos: ADP xpos: PS
word: です	 upos: AUX xpos: AV
word: か	 upos: PART xpos: PE
word: ？	 upos: PUNCT xpos: SYM

I’m not qualified to evaluate accuracy of POS etc but at least as far as tokenization goes, I would expect

["皆さん", "おはよう", "ござい", "ます", "！"]

and

["ご", "機嫌", "いかが", "です", "か". "？"]

Cheers!

Stanza - A Python NLP Library for Many Human Languages

April 12, 2020

Running Post Hysterectomy

Refinance

Smile every time you sudo