0

I have a df named df4,you can get it buy following code:

df4s = """
contract    RB  BeginDate   ValIssueDate    EndDate Valindex0   48  46  47  49  50
2   A00118  46  19850100    19880901    99999999    50  1   2   3   7   7
3   A00118  47  19000100    19880901    19831231    47  1   2   3   7   7
5   A00118  47  19850100    19880901    99999999    50  1   2   3   7   7
6   A00253  48  19000100    19820101    19811231    47  1   2   3   7   7
7   A00253  48  19820100    19820101    19841299    47  1   2   3   7   7
8   A00253  48  19850100    19820101    99999999    50  1   2   3   7   7
9   A00253  50  19000100    19820101    19781231    47  1   2   3   7   7
10  A00253  50  19790100    19820101    19841299    47  1   2   3   7   7
11  A00253  50  19850100    19820101    99999999    50  1   2   3   7   7

"""

df4 = pd.read_csv(StringIO(df4s.strip()), sep='\s+', 
                  dtype={"RB": int, "BeginDate": int, "EndDate": int,'ValIssueDate':int,'Valindex0':int})

Out put would be:

contract    RB  BeginDate   ValIssueDate    EndDate Valindex0   48  46  47  49  50
2   A00118  46  19850100    19880901    99999999    50  1   2   3   7   7
3   A00118  47  19000100    19880901    19831231    47  1   2   3   7   7
5   A00118  47  19850100    19880901    99999999    50  1   2   3   7   7
6   A00253  48  19000100    19820101    19811231    47  1   2   3   7   7
7   A00253  48  19820100    19820101    19841299    47  1   2   3   7   7
8   A00253  48  19850100    19820101    99999999    50  1   2   3   7   7
9   A00253  50  19000100    19820101    19781231    47  1   2   3   7   7
10  A00253  50  19790100    19820101    19841299    47  1   2   3   7   7
11  A00253  50  19850100    19820101    99999999    50  1   2   3   7   7

I'm trying to build a new column by following logic,the value of new column will base on 2 existed columns' values :

def test(RB):
    n=1
    for i in np.arange(RB,50):
        n = n * df4[str(i)].values
    return  n


vfunc=np.vectorize(test)
df4['n']=vfunc(df4['RB'].values)

And then received error:

    res = array(outputs, copy=False, subok=True, dtype=otypes[0])

ValueError: setting an array element with a sequence.
3
  • df4[str(i)].values is an array so your return of n (assuming RB is low enough that you do loop) is an array like: [6 6 6 6 6 6 6 6 6] vectorize is attempting to assign this back to a 1D array. Are you looking to create a 2d array here? Commented Aug 27, 2021 at 14:44
  • yes , I think so ,thank you for your reply Commented Aug 27, 2021 at 14:45
  • @HenryEcker, my answer shows that the error occurs in vectorize, not the asisgnment to the dataframe column. Commented Aug 27, 2021 at 15:43

1 Answer 1

1

Reconstructing your dataframe (thanks for using the StringIO approach)

In [82]: df4['RB'].values
Out[82]: array([46, 47, 47, 48, 48, 48, 50, 50, 50])
In [83]: test(46)
Out[83]: array([42, 42, 42, 42, 42, 42, 42, 42, 42])
In [84]: test(50)
Out[84]: 1
In [85]: [test(i) for i in df4['RB'].values]
Out[85]: 
[array([42, 42, 42, 42, 42, 42, 42, 42, 42]),
 array([21, 21, 21, 21, 21, 21, 21, 21, 21]),
 array([21, 21, 21, 21, 21, 21, 21, 21, 21]),
 array([7, 7, 7, 7, 7, 7, 7, 7, 7]),
 array([7, 7, 7, 7, 7, 7, 7, 7, 7]),
 array([7, 7, 7, 7, 7, 7, 7, 7, 7]),
 1,
 1,
 1]
In [86]: vfunc=np.vectorize(test)
In [87]: vfunc(df4['RB'].values)
TypeError: only size-1 arrays can be converted to Python scalars

The above exception was the direct cause of the following exception:
Traceback (most recent call last):
  File "<ipython-input-87-8db8cd5dc5ab>", line 1, in <module>
    vfunc(df4['RB'].values)
  File "/usr/local/lib/python3.8/dist-packages/numpy/lib/function_base.py", line 2163, in __call__
    return self._vectorize_call(func=func, args=vargs)
  File "/usr/local/lib/python3.8/dist-packages/numpy/lib/function_base.py", line 2249, in _vectorize_call
    res = asanyarray(outputs, dtype=otypes[0])
ValueError: setting an array element with a sequence.

Note the full traceback. vectorize is having trouble creating the return array from this set of mixed size arrays. It 'guessed, based on a trial calculation that it should return an int` dtype.

If we tell it to return a object dtype array, we get:

In [88]: vfunc=np.vectorize(test, otypes=['object'])
In [89]: vfunc(df4['RB'].values)
Out[89]: 
array([array([42, 42, 42, 42, 42, 42, 42, 42, 42]),
       array([21, 21, 21, 21, 21, 21, 21, 21, 21]),
       array([21, 21, 21, 21, 21, 21, 21, 21, 21]),
       array([7, 7, 7, 7, 7, 7, 7, 7, 7]),
       array([7, 7, 7, 7, 7, 7, 7, 7, 7]),
       array([7, 7, 7, 7, 7, 7, 7, 7, 7]), 1, 1, 1], dtype=object)

We can assign that to a df column:

In [90]: df4['n']=_
In [91]: df4
Out[91]: 
   contract  RB  BeginDate  ...  49  50                                     n
2    A00118  46   19850100  ...   7   7  [42, 42, 42, 42, 42, 42, 42, 42, 42]
3    A00118  47   19000100  ...   7   7  [21, 21, 21, 21, 21, 21, 21, 21, 21]
5    A00118  47   19850100  ...   7   7  [21, 21, 21, 21, 21, 21, 21, 21, 21]
6    A00253  48   19000100  ...   7   7           [7, 7, 7, 7, 7, 7, 7, 7, 7]
7    A00253  48   19820100  ...   7   7           [7, 7, 7, 7, 7, 7, 7, 7, 7]
8    A00253  48   19850100  ...   7   7           [7, 7, 7, 7, 7, 7, 7, 7, 7]
9    A00253  50   19000100  ...   7   7                                     1
10   A00253  50   19790100  ...   7   7                                     1
11   A00253  50   19850100  ...   7   7                                     1

We could just as well assign the Out[85] list

df4['n']=Out[85]

Time is about the same:

In [94]: timeit vfunc(df4['RB'].values)
211 µs ± 5.4 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
In [95]: timeit [test(i) for i in df4['RB'].values]
217 µs ± 6.06 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

Usually vectorize is slower, but test itself may be slow enough, and iteration method doesn't make much difference. Remember (reread the docs if necessary), vectorize is not a performance tool. It does not 'compile' your function or make it run faster.

An alternative for returning an object dtype array:

In [96]: vfunc=np.frompyfunc(test,1,1)
In [97]: vfunc(df4['RB'].values)
Out[97]: 
array([array([42, 42, 42, 42, 42, 42, 42, 42, 42]),
       array([21, 21, 21, 21, 21, 21, 21, 21, 21]),
       array([21, 21, 21, 21, 21, 21, 21, 21, 21]),
       array([7, 7, 7, 7, 7, 7, 7, 7, 7]),
       array([7, 7, 7, 7, 7, 7, 7, 7, 7]),
       array([7, 7, 7, 7, 7, 7, 7, 7, 7]), 1, 1, 1], dtype=object)
In [98]: timeit vfunc(df4['RB'].values)
202 µs ± 6.2 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.