Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug]: UnstructuredReader does not respect encoding and errors in the SimpleDirectoryReader inputs #14635

Open
Alavi1412 opened this issue Jul 8, 2024 · 1 comment
Labels
bug Something isn't working triage Issue needs to be triaged/prioritized

Comments

@Alavi1412
Copy link

Bug Description

I am reading some files from my local using this code:

loader = SimpleDirectoryReader(
    local_folder,
    encoding='utf-8',
    file_extractor={
        '.pdf': UnstructuredReader(),
        '.png': UnstructuredReader(),
        '.jpg': UnstructuredReader(),
        '.pptx': UnstructuredReader(),
        '.ppt': UnstructuredReader(),
        '.docx': UnstructuredReader(),
        '.doc': UnstructuredReader(),
        '.xlsx': UnstructuredReader(),
        '.xls': UnstructuredReader(),
        '.txt': UnstructuredReader(),
    },
    required_exts=required_exts,
    recursive=True
)
documents = loader.load_data()
return documents

The problem is, inside the partition_text method of unstructured the encoding is null, and even the errors parameter is not even passed into the place where the file is being opened.

I have attached the debugger from unstructured.partition.text.partition_text which indicates the encoding is none

image

Version

0.10.42

Steps to Reproduce

Set the encoding in the SimpleDirectoryReader and use UnstructuredReader as file_extractor, and you'll see that the encoding wouldn't even be used.

Relevant Logs/Tracbacks

No response

@Alavi1412 Alavi1412 added bug Something isn't working triage Issue needs to be triaged/prioritized labels Jul 8, 2024
@logan-markewich
Copy link
Collaborator

These should be args on the actual UnstructuredReader class, but currently, it's not supported. Would need a PR to add that

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working triage Issue needs to be triaged/prioritized
2 participants