2

I have a large text file with three elements in each row - user, question, value. I would like to create a 2d numpy array from this data. The data sample is something like this:

114250 3 1
124400 7 4
111304 1 1

Unfortunately I don't know the size of the resulting matrix beforehand and thus cannot initialize it.

I managed to read the data into a list of 3-tuples with this code (converting the arbitrary user ids to linear 1,2,3... representation):

users = dict()
data = list()

for line in fileinput.input( args[0] ):
    tokens = line.split("\t")
    tokens = [ t.strip("\r").strip("\n") for t in tokens ]
    user = tokens[0]
    question = tokens[1]
    response = tokens[2]

    if user in users.keys():
        user_id = users.get( user )     # existing user
    else:
        user_counter = user_counter + 1 # add new user
        users[user] = user_counter
        user_id = user_counter

    data.append( (int(user_id), int(question), int(response)) )

I am not sure how to convert this list of tuples to a 2D numpy array. I would love to know how to do this in pythonic way.

There should be some method which will read every tuple, get user_id and question as column,row and put the response value in that 2D numpy array. For example a tuple like

(10,3,1)

means that I would like to put the value 1 into a 2D matrix row 10, column 3.

2
  • you have 3 values to store, how do you want the structure of the numpy array Commented Aug 30, 2015 at 7:51
  • @ZdaR I would like to use the first two values from each tuple as indices (column and row) and put the third value from the tuple at the indexed location. The values in the tuple are integers. Commented Aug 30, 2015 at 7:52

2 Answers 2

1

Simply generate the matrix afterwards:

import numpy as np

data = numpy.array(data)
result = numpy.zeros(shape=(data[:,0].max()+1, data[:,1].max()+1), dtype=int)
result[data[:,0], data[:,1]] = data[:,2] 
Sign up to request clarification or add additional context in comments.

Comments

1
import numpy

data = []
with open('filename', 'r') as f:
    for line in f:
        data.append(map(int, line.strip().split()))

r, c = max(data, key=lambda x: x[0]), max(data, key=lambda x: x[1])
A = numpy.zeros(shape = (r+1, c+1))
for i,j, val in data:
    A[i][j] = val

I haven't tried this, but should work. Note that the indexing starts from 0.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.