Stanza - A Python NLP Library for Many Human Languages
I tested out Stanza. English tokenizer and definately works. I ran quick test with Japanese lang and output was somewhat unexpected. import stanza # japanese "ja", for english model "en" stanza.download("ja") nlp = stanza.Pipeline("ja") doc = nlp("皆さんおはようございます! ご機嫌いかがですか?") for i, sentence in enumerate(doc.sentences): print(f"===== Sentence {i+1} tokens =====") print(*[f"word: {word.text}\t upos: {word.upos} xpos: {word.xpos}" for word in sentence.words], sep="\n") The output is: ===== Sentence 1 tokens ===== word: 皆さん upos: PRON xpos: NP word: おは upos: VERB xpos: VV word: よう upos: AUX xpos: AV word: ござい upos: VERB xpos: VV word: ます upos: AUX xpos: AV word: ! upos: PUNCT xpos: SYM ===== Sentence 2 tokens ===== word: ご upos: NOUN xpos: XP word: 機 upos: NOUN xpos: XS word: 嫌い upos: NOUN xpos: NN word: か upos: PART xpos: PF word: が upos: ADP xpos: PS word: です upos: AUX xpos: AV word: か upos: PART xpos: PE word: ? upos: PUNCT xpos: SYM I’m not qualified to evaluate accuracy of POS etc but at least as far as tokenization goes, I would expect ...