7

I have a 1-dimensional numpy array scores of scores associated with some objects. These objects belong to some disjoint groups, and all the scores of the items in the first group are first, followed by the scores of the items in the second group, etc.

I'd like to create a 2-dimensional array where each row corresponds to a group, and each entry is the score of one of its items. If all the groups are of the same size I can just do:

scores.reshape((numGroups, groupSize))

Unfortunately, my groups may be of varying size. I understand that numpy doesn't support ragged arrays, but it is fine for me if the resulting array simply pads each row with a specified value to make all rows the same length.

To make this concrete, suppose I have set A with 3 items, B with 2 items, and C with four items.

scores = numpy.array([f(a[0]), f(a[1]), f(a[2]), f(b[0]), f(b[1]), 
                       f(c[0]), f(c[1]), f(c[2]), f(c[3])])
rowStarts = numpy.array([0, 3, 5])
paddingValue = -1.0
scoresByGroup = groupIntoRows(scores, rowStarts, paddingValue)

The desired value of scoresByGroup would be:

 [[f(a[0]), f(a[1]), f(a[2]), -1.0], 
    [f(b[0]), f(b[1]), -1.0, -1.0]
    [f(c[0]), f(c[1]), f(c[2]), f(c[3])]]

Is there some numpy function or composition of functions I can use to create groupIntoRows?

Background:

  • This operation will be used in calculating the loss for a minibatch for a gradient descent algorithm in Theano, so that's why I need to keep it as a composition of numpy functions if possible, rather than falling back on native Python.
  • It's fine to assume there is some known maximum row size
  • The original objects being scored are vectors and the scoring function is a matrix multiplication, which is why we flatten things out in the first place. It would be possible to pad everything to the maximum item set size before doing the matrix multiplication, but the biggest set is over ten times bigger than the average set size, so this is undesirable for speed reasons.
2
  • so you have groups that have items that other groups don't which is why lengths may be different? Where are you reading the info into the array from? Commented May 2, 2013 at 20:30
  • @RyanSaxe: The "items" are numeric vector representations of the noun phrases in a corpus of text. They are grouped by which noun phrases occur in the same sentence, which is why the sizes of the groups varies. Commented May 3, 2013 at 14:22

1 Answer 1

9

Try this:

scores = np.random.rand(9)
row_starts = np.array([0, 3, 5])
row_ends = np.concatenate((row_starts, [len(scores)]))
lens = np.diff(row_ends)
pad_len = np.max(lens) - lens
where_to_pad = np.repeat(row_ends[1:], pad_len)
padding_value = -1.0
padded_scores = np.insert(scores, where_to_pad,
                          padding_value).reshape(-1, np.max(lens))

>>> padded_scores
array([[ 0.05878244,  0.40804443,  0.35640463, -1.        ],
       [ 0.39365072,  0.85313545, -1.        , -1.        ],
       [ 0.133687  ,  0.73651147,  0.98531828,  0.78940163]])
Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.