I tested out Stanza. English tokenizer and definately works. I ran quick test with Japanese lang and output was somewhat unexpected.
import stanza
# japanese "ja", for english model "en"
stanza.download("ja")
nlp = stanza.Pipeline("ja")
doc = nlp("皆さんおはようございます! ご機嫌いかがですか?")
for i, sentence in enumerate(doc.sentences):
print(f"===== Sentence {i+1} tokens =====")
print(*[f"word: {word.text}\t upos: {word.upos} xpos: {word.xpos}" for word in sentence.words], sep="\n")
The output is:
===== Sentence 1 tokens =====
word: 皆さん upos: PRON xpos: NP
word: おは upos: VERB xpos: VV
word: よう upos: AUX xpos: AV
word: ござい upos: VERB xpos: VV
word: ます upos: AUX xpos: AV
word: ! upos: PUNCT xpos: SYM
===== Sentence 2 tokens =====
word: ご upos: NOUN xpos: XP
word: 機 upos: NOUN xpos: XS
word: 嫌い upos: NOUN xpos: NN
word: か upos: PART xpos: PF
word: が upos: ADP xpos: PS
word: です upos: AUX xpos: AV
word: か upos: PART xpos: PE
word: ? upos: PUNCT xpos: SYM
I’m not qualified to evaluate accuracy of POS etc but at least as far as tokenization goes, I would expect
["皆さん", "おはよう", "ござい", "ます", "!"]
and
["ご", "機嫌", "いかが", "です", "か". "?"]
Cheers!