3

I am getting the following error : attributeerror: 'dataframe' object has no attribute 'data_type'" . I am trying to recreate the code from this link which is based on this article with my own dataset which is similar to the article

from sklearn.model_selection import train_test_split

X_train, X_val, y_train, y_val = train_test_split(df.index.values, 
                                                  df.label.values, 
                                                  test_size=0.15, 
                                                  random_state=42, 
                                                  stratify=df.label.values)

df['data_type'] = ['not_set']*df.shape[0]

df.loc[X_train, 'data_type'] = 'train'
df.loc[X_val, 'data_type'] = 'val'

df.groupby(['Conference', 'label', 'data_type']).count()
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased',
                                          do_lower_case=True)

encoded_data_train = tokenizer.batch_encode_plus(
    df[df.data_type=='train'].example.values,
    add_special_tokens=True,
    return_attention_mask=True,
    pad_to_max_length=True,
    max_length=256,
    return_tensors='pt'
)

and this is the error I get:

---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
~\AppData\Local\Temp/ipykernel_24180/2662883887.py in <module>
      3 
      4 encoded_data_train = tokenizer.batch_encode_plus(
----> 5     df[df.data_type=='train'].example.values,
      6     add_special_tokens=True,
      7     return_attention_mask=True,

C:\ProgramData\Anaconda3\lib\site-packages\pandas\core\generic.py in __getattr__(self, name)
   5485         ):
   5486             return self[name]
-> 5487         return object.__getattribute__(self, name)
   5488 
   5489     def __setattr__(self, name: str, value) -> None:

AttributeError: 'DataFrame' object has no attribute 'data_type'

I am using python: 3.9; pytorch :1.10.1; pandas: 1.3.5; transformers: 4.15.0

1 Answer 1

1

The error means you have no data_type column in your dataframe because you missed this step

from sklearn.model_selection import train_test_split

X_train, X_val, y_train, y_val = train_test_split(df.index.values, 
                                                  df.label.values, 
                                                  test_size=0.15, 
                                                  random_state=42, 
                                                  stratify=df.label.values)

df['data_type'] = ['not_set']*df.shape[0]  # <- HERE

df.loc[X_train, 'data_type'] = 'train'  # <- HERE
df.loc[X_val, 'data_type'] = 'val'  # <- HERE

df.groupby(['Conference', 'label', 'data_type']).count()

Demo

  1. Setup
import pandas as pd
from sklearn.model_selection import train_test_split

# The Data
df = pd.read_csv('data/title_conference.csv')
df['label'] = pd.factorize(df['Conference'])[0]

# Train and Validation Split
X_train, X_val, y_train, y_val = train_test_split(df.index.values, 
                                                  df.label.values, 
                                                  test_size=0.15, 
                                                  random_state=42, 
                                                  stratify=df.label.values)

df['data_type'] = ['not_set']*df.shape[0]

df.loc[X_train, 'data_type'] = 'train'
df.loc[X_val, 'data_type'] = 'val'
  1. Code
from transformers import BertTokenizer

tokenizer = BertTokenizer.from_pretrained('bert-base-uncased', 
                                          do_lower_case=True)

encoded_data_train = tokenizer.batch_encode_plus(
    df[df.data_type=='train'].Title.values, 
    add_special_tokens=True, 
    return_attention_mask=True, 
    pad_to_max_length=True, 
    max_length=256, 
    return_tensors='pt'
)

Output:

>>> encoded_data_train
{'input_ids': tensor([[  101,  8144,  1999,  ...,     0,     0,     0],
        [  101,  2152,  2836,  ...,     0,     0,     0],
        [  101, 22454, 25806,  ...,     0,     0,     0],
        ...,
        [  101,  1037,  2047,  ...,     0,     0,     0],
        [  101, 13229,  7375,  ...,     0,     0,     0],
        [  101,  2006,  1996,  ...,     0,     0,     0]]), 'token_type_ids': tensor([[0, 0, 0,  ..., 0, 0, 0],
        [0, 0, 0,  ..., 0, 0, 0],
        [0, 0, 0,  ..., 0, 0, 0],
        ...,
        [0, 0, 0,  ..., 0, 0, 0],
        [0, 0, 0,  ..., 0, 0, 0],
        [0, 0, 0,  ..., 0, 0, 0]]), 'attention_mask': tensor([[1, 1, 1,  ..., 0, 0, 0],
        [1, 1, 1,  ..., 0, 0, 0],
        [1, 1, 1,  ..., 0, 0, 0],
        ...,
        [1, 1, 1,  ..., 0, 0, 0],
        [1, 1, 1,  ..., 0, 0, 0],
        [1, 1, 1,  ..., 0, 0, 0]])}
Sign up to request clarification or add additional context in comments.

7 Comments

I do have the exact step as well, I was following the article
It works, I tested the code. What is the output of print(df.columns) after the step 1 (Setup)
Index(['layer1', 'example', 'label', 'data_type'], dtype='object')
This df['data_type'] raises an error?
nope it does not why doesn't raise any error now it thanks I cleared one of the issues.
|

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.