In python adding empty list to dataframe column by using lambda raises valueError

Question

I was able to reproduce the error on synthetic data:

import pandas as pd
from datetime import datetime

df1 = pd.DataFrame({'A': ['A0', 'A1', 'A2', 'A3'],
                    'B': ['B0', 'B1', 'B2', 'B3'],
                    'C': [datetime.now(), datetime.now(), datetime.now(), datetime.now()],
                    'D': ['D0', 'D1', 'D2', 'D3']},
                   index=[0, 1, 2, 3]);
df2 = pd.DataFrame({'A': ['A1', 'A2', 'A3', 'A4'],
                    'E': ['E1', 'E2', 'E3', 'E4']},
                   index=[0,1,2,3]);

df = pd.merge(df1, df2, how='left', on=['A', 'A']);

def getList(row):
    r = [];
    if row["A"] == "A1": r.append("test-01");
    if row["B"] == "B1": r.append("test-02");
    if row["B"] == "B2": r.append("test-03");
    return r;

df["NEW_COLUMN"] = df.apply(lambda row: getList(row), axis = 1);

Original post: I would like to create a new column in pandas dataframe based on multiple conditions. The value of the new column should be list. However I got "ValueError: Empty data passed with indices specified." in case of empty list.

def getList(p_row):
  r = [];
  if p_row["field1"] > 0: r.append("x");
  ...
  return r;

df["new_list_field"] = df.apply(lambda row: getList(row), axis = 1);

The complete error:

ValueError Traceback (most recent call last) C:\ProgramData\Anaconda3\lib\site-packages\pandas\core\internals.py in create_block_manager_from_arrays(arrays, names, axes) 4636 try: -> 4637 blocks = form_blocks(arrays, names, axes) 4638 mgr = BlockManager(blocks, axes)

C:\ProgramData\Anaconda3\lib\site-packages\pandas\core\internals.py in form_blocks(arrays, names, axes) 4728 if len(object_items) > 0: -> 4729 object_blocks = _simple_blockify(object_items, np.object_) 4730 blocks.extend(object_blocks)

C:\ProgramData\Anaconda3\lib\site-packages\pandas\core\internals.py in _simple_blockify(tuples, dtype) 4758 """ -> 4759 values, placement = _stack_arrays(tuples, dtype) 4760

C:\ProgramData\Anaconda3\lib\site-packages\pandas\core\internals.py in _stack_arrays(tuples, dtype) 4822 for i, arr in enumerate(arrays): -> 4823 stacked[i] = _asarray_compat(arr) 4824

ValueError: could not broadcast input array from shape (2) into shape (195)

During handling of the above exception, another exception occurred:

ValueError Traceback (most recent call last) in () ----> 1 df["new_list_field"] = df.apply(lambda row: getList(row), axis = 1);

C:\ProgramData\Anaconda3\lib\site-packages\pandas\core\frame.py in apply(self, func, axis, broadcast, raw, reduce, args, **kwds) 4875 f, axis, 4876 reduce=reduce, -> 4877 ignore_failures=ignore_failures) 4878 else: 4879 return self._apply_broadcast(f, axis)

C:\ProgramData\Anaconda3\lib\site-packages\pandas\core\frame.py in _apply_standard(self, func, axis, ignore_failures, reduce) 4988 index = None 4989 -> 4990 result = self._constructor(data=results, index=index) 4991 result.columns = res_index 4992

C:\ProgramData\Anaconda3\lib\site-packages\pandas\core\frame.py in init(self, data, index, columns, dtype, copy) 328 dtype=dtype, copy=copy) 329 elif isinstance(data, dict): --> 330 mgr = self._init_dict(data, index, columns, dtype=dtype) 331 elif isinstance(data, ma.MaskedArray): 332 import numpy.ma.mrecords as mrecords

C:\ProgramData\Anaconda3\lib\site-packages\pandas\core\frame.py in _init_dict(self, data, index, columns, dtype) 459 arrays = [data[k] for k in keys] 460 --> 461 return _arrays_to_mgr(arrays, data_names, index, columns, dtype=dtype) 462 463 def _init_ndarray(self, values, index, columns, dtype=None, copy=False):

C:\ProgramData\Anaconda3\lib\site-packages\pandas\core\frame.py in _arrays_to_mgr(arrays, arr_names, index, columns, dtype) 6171 axes = [_ensure_index(columns), _ensure_index(index)] 6172 -> 6173 return create_block_manager_from_arrays(arrays, arr_names, axes) 6174 6175

C:\ProgramData\Anaconda3\lib\site-packages\pandas\core\internals.py in create_block_manager_from_arrays(arrays, names, axes) 4640
return mgr 4641 except ValueError as e: -> 4642 construction_error(len(arrays), arrays[0].shape, axes, e) 4643 4644

C:\ProgramData\Anaconda3\lib\site-packages\pandas\core\internals.py in construction_error(tot_items, block_shape, axes, e) 4604
raise e 4605 if block_shape[0] == 0: -> 4606 raise ValueError("Empty data passed with indices specified.")

4607 raise ValueError("Shape of passed values is {0}, indices imply {1}".format( 4608 passed, implied))

ValueError: Empty data passed with indices specified.

welcome to SO!, can you provide the dataframe in a Minimal, Complete, and Verifiable manner. — moshevi
– moshevi, Commented Jul 26, 2018 at 12:13
Unfortunately cannot share the original data. It is a million records dataset merged from several datasources and contains very sensitive information. Having index on 2 columns. Unfortunately cannot reproduce it with sample data because that code works properly. Also tried to create a copy of a slice of the large dataset, but got the same error message. — spyder
– spyder, Commented Jul 26, 2018 at 12:43
Data comes from rdbms via odbc, also from csv files. Made a small chunk with copy() and kept only few fields. ID int64 PAYMENT_ID float64 CREDIT_FL object TRX_MSG object CALENDAR_DATE datetime64[ns] dtype: object Object type fields should be strings. With CALENDAR_DATE field raises ValueError Traceback (most recent call last). Without that fields it works. Something might be wrong with the datatypes although I could already create few fields with a simple condition and with int and string type. — spyder
– spyder, Commented Jul 26, 2018 at 13:21
One sincere advice: can you try not to use semicolons in python scripts? They don't do anything in Python, don't throw errors but make code clumsier... — xyzjayne
– xyzjayne, Commented Jul 26, 2018 at 14:18

xyzjayne · Accepted Answer · 2018-07-26 14:24:10Z

3

The length of the output of this function varies from row to row, but you cannot assign lists of unequal lengths to a new pandas column. You can verify this through:

for idx,row in df.iterrows():
    print(getList(row))

An alternative would be converting the output to a string instead:

def getListString(row):
    r = ''
    if row["A"] == "A1": r+="test-01"
    if row["B"] == "B1": r+="test-02"
    if row["B"] == "B2": r+="test-03"
    return r

answered Jul 26, 2018 at 14:24

xyzjayne

1,38710 silver badges25 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

spyder Over a year ago

df["C"] = df['C'].dt.date fixed the issue. No clue why :-(

arturomp · Accepted Answer · 2019-01-25 20:43:41Z

0

ended up making a list of lists, turning that into a pd.Series() and assigning it to a new column. a dictionary key2list returns variable-length lists as values:

new_col_list = []

for _, row in my_df.iterrows():
    new_col_list.append(key2list[row[u'key']])

my_df[u'new_col'] = pd.Series(new_col_list)

answered Jan 25, 2019 at 20:43

arturomp

29.8k11 gold badges48 silver badges74 bronze badges

Collectives™ on Stack Overflow

In python adding empty list to dataframe column by using lambda raises valueError

2 Answers 2

1 Comment

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

1 Comment

Comments

Your Answer

Sign up or log in

Post as a guest

Related