5

My question is how to efficiently expand an array, by copying itself many times. I am trying to expand my survey samples to the full size dataset, by copying every sample N times. N is the influence factor that signed to the sample. So I wrote two loops to do this task (script pasted below). It works, but is slow. My sample size is 20,000, and try to expand it into 3 million full size.. is there any function I can try? Thank you for your help!

----My script----

lines = np.asarray(person.read().split('\n'))
df_array = np.asarray(lines[0].split(' '))
for j in range(1,len(lines)-1):
    subarray = np.asarray(lines[j].split(' '))
    factor = int(round(float(subarray[-1]),0))
    for i in range(1,factor):
        df_array = np.vstack((df_array, subarray))
print len(df_array)

3 Answers 3

2

First, you can try to load data all together with numpy.loadtxt.

Then, to repeat according to the last column, use numpy.repeat:

>>> data = np.array([[1, 2, 3],
...                  [4, 5, 6]])
>>> np.repeat(data, data[:,-1], axis=0)
array([[1, 2, 3],
       [1, 2, 3],
       [1, 2, 3],
       [4, 5, 6],
       [4, 5, 6],
       [4, 5, 6],
       [4, 5, 6],
       [4, 5, 6],
       [4, 5, 6]])

Finally, if you need to round data[:,-1], replace it with np.round(data[:,-1]).astype(int).

Sign up to request clarification or add additional context in comments.

Comments

1

Stacking numpy arrays over and over is not very efficient, because they're not really optimized for dynamic growth like that. Every time you vstack, it's allocating a whole new chunk of memory for the size of your data at that point.

Use lists then build your array right at the end, maybe something with a generator like this:

def upsample(stream):
    for line in stream:
        rec = line.strip().split()
        factor = int(round(float(rec[-1]),0))
        for i in xrange(factor):
            yield rec

df_array = np.array(list(upsample(person)))

Comments

1

The concept you are looking for is called broadcasting. It allows you to fill an n dimensional array with an n-1 dimensional array's contents.

Looking at your code example, you are calling np.vstack() in a loop. Broadcasting will eliminate the loop.

For example, if you have a 1D array of n elements,

>>> n = 5
>>> df_array = np.arange(n)
>>> df_array
array([0, 1, 2, 3, 4])

you can then create a new n x 10 array:

>>> bigger_array = np.empty([10,n])
>>> bigger_array[:] = df_array
>>> bigger_array
array([[ 0.,  1.,  2.,  3.,  4.],
       [ 0.,  1.,  2.,  3.,  4.],
       [ 0.,  1.,  2.,  3.,  4.],
       [ 0.,  1.,  2.,  3.,  4.],
       [ 0.,  1.,  2.,  3.,  4.],
       [ 0.,  1.,  2.,  3.,  4.],
       [ 0.,  1.,  2.,  3.,  4.],
       [ 0.,  1.,  2.,  3.,  4.],
       [ 0.,  1.,  2.,  3.,  4.],
       [ 0.,  1.,  2.,  3.,  4.]])

So with a single line of code, you can fill it with the contents of the smaller array.

bigger_array[:] = df_array

NB. Avoid using python lists. They are far, far slower than the Numpy ndarray.

2 Comments

Thank you. If my understand is right, you are saying apply bigger_array[:] to expand the small sample. After expanding them one by one, I also need to combine all of them into a big data set. At that stage, it is not expanding, is combine.. is there any efficient way than using np.vstack()?
The most efficient way is likely to use 'np.empty()' to allocate the space/memory for your end dataset and then load data & broadcast within that using slice indexing. This will be inherently faster than using loops in Python.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.