0

I have data in the form of sets and i want convert it into 2D numpy array. Data is like

term = which contains the words
document_number= which has the doc number
tf-idf= which contain the tf-idf of each word with respect to doc in ordered manner

I want it should be in 2D numpy array like this

            doc1    doc2   doc3....
term1        1        5      6
term2        0        4      1
term3        6        8      10
.
.

How should I implement it?

3

1 Answer 1

1

Your description of the structure of tf-idf is not clear. So I have to make some assumptions about your data structure.

term_len = len(term)
doc_len = len(document_number)

So assuming that tf-idf is a flat list (not list of lists) where the frequency of the first term in all the documents is in there, then for the second term, and so on.

term_freq = numpy.zeros((term_len, doc_len), dtype=int)
for (i, freq) in enumerate(tf_ids):
    term_freq[i // term_len, i % doc_len] = freq

If the opposite is true, just turn the modulo and division operation around.

Sign up to request clarification or add additional context in comments.

5 Comments

Your assumption is right but I'm not getting the modulo and division operations used for. Actually I am new to python. Is it for the 2D array titles??
A 2D array is the same as a matrix. So you have N rows and M columns. It's dimensions are N x M. You have a list that contains N * M elements. enumerate creates a running index from 0 to N * M - 1. We want to map that index to a column and row index. So the modulo lets you cycle through the index quickly whereas the division steps more slowly.
Thank you for the explanation but I am getting ValueError: invalid literal for float(): 0.0,0.1524,0.0,0.45678 error
Is that a string that contains all those numbers? It's really hard to help without knowing your data exactly. If it is a string you can split it on the comma and then transform each to a float: [float(num) for num in data_string.split(",")].
I want output as [['TERM' 'TF-IDF1 ' 'TF-IDF2 ' ..., 'TF-IDF11 ' '' ''] ['acquire' '0.0' '0.0' ..., '0.027882503172' '' ''] ['act' '0.0' '0.0' ..., '0.0' '' ''] ..., ['year' '0.0' '0.0' ..., '0.0' '' ''] ['yet' '0.0' '0.0' ..., '0.0' '' ''] ['york' '0.0757230086146' '0.0' ..., '0.0' '' '']] You can see how my data is in the output

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.