Representing a ragged array in numpy by padding

Question

I have a 1-dimensional numpy array scores of scores associated with some objects. These objects belong to some disjoint groups, and all the scores of the items in the first group are first, followed by the scores of the items in the second group, etc.

I'd like to create a 2-dimensional array where each row corresponds to a group, and each entry is the score of one of its items. If all the groups are of the same size I can just do:

scores.reshape((numGroups, groupSize))

Unfortunately, my groups may be of varying size. I understand that numpy doesn't support ragged arrays, but it is fine for me if the resulting array simply pads each row with a specified value to make all rows the same length.

To make this concrete, suppose I have set A with 3 items, B with 2 items, and C with four items.

scores = numpy.array([f(a[0]), f(a[1]), f(a[2]), f(b[0]), f(b[1]), 
                       f(c[0]), f(c[1]), f(c[2]), f(c[3])])
rowStarts = numpy.array([0, 3, 5])
paddingValue = -1.0
scoresByGroup = groupIntoRows(scores, rowStarts, paddingValue)

The desired value of scoresByGroup would be:

 [[f(a[0]), f(a[1]), f(a[2]), -1.0], 
    [f(b[0]), f(b[1]), -1.0, -1.0]
    [f(c[0]), f(c[1]), f(c[2]), f(c[3])]]

Is there some numpy function or composition of functions I can use to create groupIntoRows?

Background:

This operation will be used in calculating the loss for a minibatch for a gradient descent algorithm in Theano, so that's why I need to keep it as a composition of numpy functions if possible, rather than falling back on native Python.
It's fine to assume there is some known maximum row size
The original objects being scored are vectors and the scoring function is a matrix multiplication, which is why we flatten things out in the first place. It would be possible to pad everything to the maximum item set size before doing the matrix multiplication, but the biggest set is over ten times bigger than the average set size, so this is undesirable for speed reasons.

so you have groups that have items that other groups don't which is why lengths may be different? Where are you reading the info into the array from? — Ryan Saxe
– Ryan Saxe, Commented May 2, 2013 at 20:30
@RyanSaxe: The "items" are numeric vector representations of the noun phrases in a corpus of text. They are grouped by which noun phrases occur in the same sentence, which is why the sizes of the groups varies. — Ryan Gabbard
– Ryan Gabbard, Commented May 3, 2013 at 14:22

Jaime · Accepted Answer · 2013-05-02 20:33:34Z

9

Try this:

scores = np.random.rand(9)
row_starts = np.array([0, 3, 5])
row_ends = np.concatenate((row_starts, [len(scores)]))
lens = np.diff(row_ends)
pad_len = np.max(lens) - lens
where_to_pad = np.repeat(row_ends[1:], pad_len)
padding_value = -1.0
padded_scores = np.insert(scores, where_to_pad,
                          padding_value).reshape(-1, np.max(lens))

>>> padded_scores
array([[ 0.05878244,  0.40804443,  0.35640463, -1.        ],
       [ 0.39365072,  0.85313545, -1.        , -1.        ],
       [ 0.133687  ,  0.73651147,  0.98531828,  0.78940163]])

answered May 2, 2013 at 20:33

Jaime

67.7k19 gold badges128 silver badges164 bronze badges

Sign up to request clarification or add additional context in comments.

Collectives™ on Stack Overflow

Representing a ragged array in numpy by padding

1 Answer 1

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related