0

I have the following file like this:

2 qid:1 1:0.32 2:0.50 3:0.78 4:0.02 10:0.90
5 qid:2 2:0.22 5:0.34 6:0.87 10:0.56 12:0.32 19:0.24 20:0.55
...

he structure is follwoing like that:

output={} rel=2 qid=1 features={} # the feature list "1:0.32 2:0.50 3:0.78 4:0.02 10:0.90" output.append([rel,qid,features]) ... How can I write my python code to load the data, thanks

1
  • 4
    It would be helpful if you describe the desired output data structure. Commented Mar 7, 2010 at 9:42

3 Answers 3

1

For reading use something like this (data is in file 'fname'):

f = open(fname)
lines = f.readlines(f)
for line in lines:
    elements = line.split(' ')
    rel = int(elements[0])
    qid = int(elements[1].split(':')[1])
    featurelist = elements[2:]
    # get the various features again with splitting at ':'
    # you get the idea ...
Sign up to request clarification or add additional context in comments.

Comments

0

The following should work nicely and leaves your data in a handy format:

regexp = r"(\d+)\s+qid:(\d+)\s+(.+)"
data = np.fromregex(file_name, regexp, 
                    dtype=[('rel', int), ('qid', int), ('features', object)])

From here you can select rel, qid or features by calling:

>>> data['rel']
array([2, 5])
>>> data['qid']
array([1, 2])
>>> data['features']
array(['1:0.32 2:0.50 3:0.78 4:0.02 10:0.90',
       '2:0.22 5:0.34 6:0.87 10:0.56 12:0.32 19:0.24 20:0.55'], dtype=object)

Comments

0

It looks like your input files are in svmlight format. If this is true, then there's a parser included as part of scikit-learn that might be handy to use -- see the source at:

https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/datasets/svmlight_format.py#L32

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.