Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Question]: how dose llama_index work with Chinese text? #14618

Open
1 task done
lizeyubuaa opened this issue Jul 8, 2024 · 1 comment
Open
1 task done

[Question]: how dose llama_index work with Chinese text? #14618

lizeyubuaa opened this issue Jul 8, 2024 · 1 comment
Labels
question Further information is requested

Comments

@lizeyubuaa
Copy link

Question Validation

  • I have searched both the documentation and discord for an answer.

Question

how dose llama_index work with Chinese text?
in documentation, code is as follows:
image
as for my data is Chinese, i translate the entities into Chinese, however, it doesn't work. I can't get any output. If i use the default settings without changing the entities, there is an output graph.
image

so, does llama_index work with Chinese text well?

@lizeyubuaa lizeyubuaa added the question Further information is requested label Jul 8, 2024
Copy link

dosubot bot commented Jul 8, 2024

@lizeyubuaa 你好! 我是Dosu,一个可以帮助你解决错误、回答问题并成为贡献者的机器人。我会在等待人类维护者的过程中帮助你。🤖

Yes, llama_index supports Chinese text well. The provided test fixtures include both English and Chinese text, indicating that the framework is designed to handle multiple languages, including Chinese.

If you are experiencing issues with your translated schema, it might be due to differences in tokenization between languages. The comment in the code mentions a significant difference in token counts between GPT2 and cl100k_base for non-English text:

# There's a pretty big difference between GPT2 and cl100k_base for non-English
# The same text goes from 1178 tokens to 665 tokens.

This suggests that the tokenization process can vary significantly depending on the model and the language, which might be causing issues with your translated schema.

Additionally, a user in the LlamaIndex community mentioned encountering issues with the built-in sentence text splitter not fitting languages without white spaces, like Chinese. They had to preprocess the document with a Chinese word splitter before creating the index:

"content": "@jerryjliu98 hi, jerry. I saw the feature on new Sentence Text splitter. it will be called automatically during the operation of creating new index? another question is: if it can split words in languages not using white space between words, like Chinese? I am using 0.4.32 mainly, and I saw error message about over length term (longer than max_chunk_limit), so I have to process document by a Chinese word splitter before creating index, thus I think the built-in splitter not fits languages without white space...",
"author": {
    "id": "1071293300629311528",
    "name": "heihei",
    "discriminator": "9614",
    "nickname": "heihei",
    "color": null
}

This indicates that the built-in splitter may not handle languages without white spaces effectively, and preprocessing with a language-specific splitter might be necessary.

Moreover, the SemanticDoubleMergingSplitterNodeParser class in llama_index currently supports only a few languages (English, German, and Spanish) as indicated by the LANGUAGES list:

# TODO test more languages
LANGUAGES: List[str] = ["english", "german", "spanish"]

This means that the built-in language models and configurations might not be optimized for Chinese text. You may need to implement or integrate a custom sentence splitter and language model that better supports Chinese text processing [1][2][3][4].

To continue talking to Dosu, mention @dosu.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
1 participant