2

I have an array of strings, each containing one or more words. I want to split / partition the array on a separator (blank in my case) with as many splits as there are separators in the element containing the most separators. numpy.char.partition however only performs a single split, regardless of how often the separator appears:

I've got:

>>> a = np.array(['word', 'two words', 'and three words'])
>>> np.char.partition(a, ' ')

>>> array([['word', '', ''],
       ['two', ' ', 'words'],
       ['and', ' ', 'three words']], dtype='<U8')

I'd like to have:

>>> array([['word', '', '', '', ''],
       ['two', ' ', 'words', '', ''],
       ['and', ' ', 'three', ' ', 'words']], dtype='<U8')

3 Answers 3

2

Approach #1

Those partition functions doesn't seem to partition for all the occurrences. To solve for our case, we can use np.char.split to get the split strings and then masking,array-assignment, like so -

def partitions(a, sep):
    # Split based on sep
    s = np.char.split(a,sep)

    # Get concatenated split strings
    cs = np.concatenate(s)

    # Get params
    N = len(a)
    l = np.array(list(map(len,s)))
    el = 2*l-1
    ncols = el.max()

    out = np.zeros((N,ncols),dtype=cs.dtype)

    # Setup valid mask that starts at fist col until the end for each row
    mask = el[:,None] > np.arange(el.max())

    # Assign sepeter into valid ones
    out[mask] = sep

    # Setup valid mask that has True at postions where words are to be assigned
    mask[:,1::2] = 0

    # Assign words
    out[mask] = cs
    return out

Sample runs -

In [32]: a = np.array(['word', 'two words', 'and three words'])

In [33]: partitions(a, sep=' ')
Out[33]: 
array([['word', '', '', '', ''],
       ['two', ' ', 'words', '', ''],
       ['and', ' ', 'three', ' ', 'words']], dtype='<U5')

In [44]: partitions(a, sep='ord')
Out[44]: 
array([['w', 'ord', ''],
       ['two w', 'ord', 's'],
       ['and three w', 'ord', 's']], dtype='<U11')

Approach #2

Here's another with a loop, to save on memory -

def partitions_loopy(a, sep):
    # Get params
    N = len(a)
    l = np.char.count(a, sep)+1
    ncols = 2*l.max()-1
    out = np.zeros((N,ncols),dtype=a.dtype)
    for i,(a_i,L) in enumerate(zip(a,l)):
        ss = a_i.split(sep)
        out[i,1:2*L-1:2] = sep
        out[i,:2*L:2] = ss
    return out
Sign up to request clarification or add additional context in comments.

Comments

1

I came up with my own recursive solution that uses np.char.partition. However, when timing it, it turns out to be less performant. The time is similar to @Divakar's solution for a single split, but then multiplies with the number of splits necessary.

def partitions(a, sep):
    if np.any(np.char.count(a, sep) >= 1):
        a2 = np.char.partition(a, sep)
        return np.concatenate([a2[:, 0:2], partitions(a2[:, 2], sep)], axis=1)
    return a.reshape(-1, 1)

2 Comments

Can you test out my Approach #2 too? Thanks! Would like to know if it's any better than #1 and if so, by what margin.
The execution time of @Divakar's Approach #2 turns out to be ~80% of Approach #1 on my example vector (size 3), but ~200% on a size 300 vector: a = np.array(['word', 'two', 'and three words'] * 100). Same tendency if the number of splits increase (~200% for 4 splits instead of 2). However my recursive partitions then quickly loses ground with up to 20 times the execution time of Approach #1.
1

The function based approaches are great but seem too complex. You can solve this just using data structure transforms and the re.split in a single line of code.

a = np.array(['word', 'two words', 'and three words'])

#Use the re.split to get partitions then transform to dataframe, fillna, transform back!

np.array(pd.DataFrame([re.split('( )', i) for i in a]).fillna(''))

#You can change the '( )' to '(\W)' if you want it to separate on all non-word characters!
array([['word', '', '', '', ''],
       ['two', ' ', 'words', '', ''],
       ['and', ' ', 'three', ' ', 'words']], dtype=object)

1 Comment

Elegant solution :). Timing: Turns out to be similar to @Divakar's Approach #2 for large arrays (i.e. slower than Approach #1 by factor ~2). For small ones there's a big overhead in creating the DataFrame however.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.