ValueError: TextEncodeInput must be Union[TextInputSequence, Tuple[InputSequence, InputSequence]] - Tokenizing BERT / Distilbert Error

Question

def split_data(path):
  df = pd.read_csv(path)
  return train_test_split(df , test_size=0.1, random_state=100)

train, test = split_data(DATA_DIR)
train_texts, train_labels = train['text'].to_list(), train['sentiment'].to_list() 
test_texts, test_labels = test['text'].to_list(), test['sentiment'].to_list() 

train_texts, val_texts, train_labels, val_labels = train_test_split(train_texts, train_labels, test_size=0.1, random_state=100)

from transformers import DistilBertTokenizerFast
tokenizer = DistilBertTokenizerFast.from_pretrained('distilbert-base-uncased

train_encodings = tokenizer(train_texts, truncation=True, padding=True)
valid_encodings = tokenizer(valid_texts, truncation=True, padding=True)
test_encodings = tokenizer(test_texts, truncation=True, padding=True)

When I tried to split from the dataframe using BERT tokenizers I got an error us such.

The reason is that tokenizer is trying to tokenize a thing that is not string, this can be because tokenize function is passed None, or any other object that is not string. — Aayush Neupane, Commented Feb 9, 2023 at 13:56
I had NaN's in my data, which I quickly 'fixed' with pandas' data.dropna(). — Herbert, Commented Feb 8 at 13:29

MarkusOdenthal · Accepted Answer · 2020-09-13 12:43:29Z

I had the same error. The problem was that I had None in my list, e.g:

from transformers import DistilBertTokenizerFast

tokenizer = DistilBertTokenizerFast.from_pretrained('distilbert-base-german-cased')

# create test dataframe
texts = ['Vero Moda Damen Übergangsmantel Kurzmantel Chic Business Coatigan SALE',
         'Neu Herren Damen Sportschuhe Sneaker Turnschuhe Freizeit 1975 Schuhe Gr. 36-46',
         'KOMBI-ANGEBOT Zuckerpaste STRONG / SOFT / ZUBEHÖR -Sugaring Wachs Haarentfernung',
         None]

labels = [1, 2, 3, 1]

d = {'texts': texts, 'labels': labels} 
test_df = pd.DataFrame(d)

So, before I converted the Dataframe columns to list I remove all None rows.

test_df = test_df.dropna()
texts = test_df["texts"].tolist()
texts_encodings = tokenizer(texts, truncation=True, padding=True)

This worked for me.

Similarly, I didn't have null values, but I had empty strings, which causes the same error. — Alexandre Daly, Commented Nov 3, 2022 at 0:30

Ahmad · Accepted Answer · 2021-06-26 11:48:52Z

16

In my case I had to set is_split_into_words=True

https://huggingface.co/transformers/main_classes/tokenizer.html

The sequence or batch of sequences to be encoded. Each sequence can be a string or a list of strings (pretokenized string). If the sequences are provided as list of strings (pretokenized), you must set is_split_into_words=True (to lift the ambiguity with a batch of sequences).

answered Jun 26, 2021 at 11:48

Ahmad

9,51812 gold badges86 silver badges157 bronze badges

1

Can confirm this also solved the problem in my case.
– Timbus Calin
Commented Oct 1, 2021 at 9:08

Add a comment |

Msalman · Accepted Answer · 2021-11-28 09:20:01Z

8

Similar to MarkusOdenthal I had a non string type in my list. I fixed it by converting the column to string, then converting it to a list, before splitting it into train and test segments. So you would do

train_texts = train['text'].astype(str).values.to_list()

answered Nov 28, 2021 at 9:20

Msalman

831 silver badge3 bronze badges

excuse if i need to encode the items in the list how can i do that ?
– user
Commented Sep 28, 2022 at 23:41

Add a comment |

Raoof Naushad · Accepted Answer · 2020-08-21 06:00:39Z

def split_data(path):
  df = pd.read_csv(path)
  return train_test_split(df , test_size=0.2, random_state=100)

train, test = split_data(DATA_DIR)
train_texts, train_labels = train['text'].to_list(), train['sentiment'].to_list() 
test_texts, test_labels = test['text'].to_list(), test['sentiment'].to_list() 

train_texts, val_texts, train_labels, val_labels = train_test_split(train_texts, train_labels, test_size=0.2, random_state=100)

from transformers import DistilBertTokenizerFast
tokenizer = DistilBertTokenizerFast.from_pretrained('distilbert-base-uncased

train_encodings = tokenizer(train_texts, truncation=True, padding=True)
valid_encodings = tokenizer(valid_texts, truncation=True, padding=True)
test_encodings = tokenizer(test_texts, truncation=True, padding=True)

Try changing the size of the split. It will work. Which means that the split data wasn't enough for tokenizer to tokenize

train_texts just needs to be a list of strings?
– Evan Zamir
Commented Jan 20, 2021 at 20:23 — Evan Zamir, Commented Jan 20, 2021 at 20:23

mazyar fanaeipour · Accepted Answer · 2023-03-08 07:32:08Z

0

in tokenizer first text must be STR for exaample: train_encodings = tokenizer(str(train_texts), truncation=True, padding=True)

answered Mar 8, 2023 at 7:32

mazyar fanaeipour

111 bronze badge

Add a comment |

Collectives™ on Stack Overflow

ValueError: TextEncodeInput must be Union[TextInputSequence, Tuple[InputSequence, InputSequence]] - Tokenizing BERT / Distilbert Error

5 Answers 5

Not the answer you're looking for? Browse other questions tagged
tokenize
bert-language-model
huggingface-transformers
huggingface-tokenizers
distilbert
or ask your own question.

Linked

Hot Network Questions

Collectives™ on Stack Overflow

5 Answers 5

Not the answer you're looking for? Browse other questions tagged tokenizebert-language-modelhuggingface-transformershuggingface-tokenizersdistilbert or ask your own question.

Linked

Related

Not the answer you're looking for? Browse other questions tagged
tokenize
bert-language-model
huggingface-transformers
huggingface-tokenizers
distilbert
or ask your own question.