Numpy String Partitioning: Perform Multiple Splits

Question

I have an array of strings, each containing one or more words. I want to split / partition the array on a separator (blank in my case) with as many splits as there are separators in the element containing the most separators. numpy.char.partition however only performs a single split, regardless of how often the separator appears:

I've got:

>>> a = np.array(['word', 'two words', 'and three words'])
>>> np.char.partition(a, ' ')

>>> array([['word', '', ''],
       ['two', ' ', 'words'],
       ['and', ' ', 'three words']], dtype='<U8')

I'd like to have:

>>> array([['word', '', '', '', ''],
       ['two', ' ', 'words', '', ''],
       ['and', ' ', 'three', ' ', 'words']], dtype='<U8')

Divakar · Accepted Answer · 2019-07-23 10:45:56Z

Approach #1

Those partition functions doesn't seem to partition for all the occurrences. To solve for our case, we can use np.char.split to get the split strings and then masking,array-assignment, like so -

def partitions(a, sep):
    # Split based on sep
    s = np.char.split(a,sep)

    # Get concatenated split strings
    cs = np.concatenate(s)

    # Get params
    N = len(a)
    l = np.array(list(map(len,s)))
    el = 2*l-1
    ncols = el.max()

    out = np.zeros((N,ncols),dtype=cs.dtype)

    # Setup valid mask that starts at fist col until the end for each row
    mask = el[:,None] > np.arange(el.max())

    # Assign sepeter into valid ones
    out[mask] = sep

    # Setup valid mask that has True at postions where words are to be assigned
    mask[:,1::2] = 0

    # Assign words
    out[mask] = cs
    return out

Sample runs -

In [32]: a = np.array(['word', 'two words', 'and three words'])

In [33]: partitions(a, sep=' ')
Out[33]: 
array([['word', '', '', '', ''],
       ['two', ' ', 'words', '', ''],
       ['and', ' ', 'three', ' ', 'words']], dtype='<U5')

In [44]: partitions(a, sep='ord')
Out[44]: 
array([['w', 'ord', ''],
       ['two w', 'ord', 's'],
       ['and three w', 'ord', 's']], dtype='<U11')

Approach #2

Here's another with a loop, to save on memory -

def partitions_loopy(a, sep):
    # Get params
    N = len(a)
    l = np.char.count(a, sep)+1
    ncols = 2*l.max()-1
    out = np.zeros((N,ncols),dtype=a.dtype)
    for i,(a_i,L) in enumerate(zip(a,l)):
        ss = a_i.split(sep)
        out[i,1:2*L-1:2] = sep
        out[i,:2*L:2] = ss
    return out

ascripter · Accepted Answer · 2019-07-23 09:32:44Z

1

I came up with my own recursive solution that uses np.char.partition. However, when timing it, it turns out to be less performant. The time is similar to @Divakar's solution for a single split, but then multiplies with the number of splits necessary.

def partitions(a, sep):
    if np.any(np.char.count(a, sep) >= 1):
        a2 = np.char.partition(a, sep)
        return np.concatenate([a2[:, 0:2], partitions(a2[:, 2], sep)], axis=1)
    return a.reshape(-1, 1)

answered Jul 23, 2019 at 9:32

ascripter

6,31512 gold badges54 silver badges74 bronze badges

2 Comments

Divakar Over a year ago

Can you test out my Approach #2 too? Thanks! Would like to know if it's any better than #1 and if so, by what margin.

ascripter Over a year ago

The execution time of @Divakar's Approach #2 turns out to be ~80% of Approach #1 on my example vector (size 3), but ~200% on a size 300 vector: a = np.array(['word', 'two', 'and three words'] * 100). Same tendency if the number of splits increase (~200% for 4 splits instead of 2). However my recursive partitions then quickly loses ground with up to 20 times the execution time of Approach #1.

Akshay Sehgal · Accepted Answer · 2019-07-23 11:42:56Z

1

The function based approaches are great but seem too complex. You can solve this just using data structure transforms and the re.split in a single line of code.

a = np.array(['word', 'two words', 'and three words'])

#Use the re.split to get partitions then transform to dataframe, fillna, transform back!

np.array(pd.DataFrame([re.split('( )', i) for i in a]).fillna(''))

#You can change the '( )' to '(\W)' if you want it to separate on all non-word characters!

array([['word', '', '', '', ''],
       ['two', ' ', 'words', '', ''],
       ['and', ' ', 'three', ' ', 'words']], dtype=object)

edited Jul 23, 2019 at 11:42

answered Jul 23, 2019 at 11:36

Akshay Sehgal

19.4k3 gold badges26 silver badges57 bronze badges

1 Comment

ascripter Over a year ago

Elegant solution :). Timing: Turns out to be similar to @Divakar's Approach #2 for large arrays (i.e. slower than Approach #1 by factor ~2). For small ones there's a big overhead in creating the DataFrame however.

Collectives™ on Stack Overflow

Numpy String Partitioning: Perform Multiple Splits

3 Answers 3

Comments

2 Comments

1 Comment

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

Comments

2 Comments

1 Comment

Your Answer

Sign up or log in

Post as a guest

Related