Python data extraction and search

Question

I have data in a file in this format:

+1 1:4 2:11 3:3 4:11 5:1 6:13 7:4 8:2 9:2 10:13
-1 1:2 2:7 3:4 4:12 5:3 6:4 7:3 8:12 9:2 10:12
+1 1:4 2:6 3:3 4:2 5:3 6:5 7:4 8:2 9:3 10:6

and so on....

where the numbers on the left of the colon is an 'index' and numbers on the right of the colon are integers that describe a certain attribute. For each line, if the number on the right of the colon is the same for the same index on another line, I want to store the total amount of +1's and -1's in two separate variables. This is my code so far:

for i in lines:
   for word in i:
        if word.find(':')!=-1:
            att = word.split(':', 1)[-1]
            idx = word.split(':', 1)[0]
            for j in lines:
                clas = j.split(' ', 1)[0]
                if word.find(':')!=-1:
                        if idx ==word.split(':', 1)[0]:
                            if att ==word.split(':', 1)[0]:
                                if clas>0:
                                    ifattandyes = ifattandyes+1
                                else:
                                    ifattandno = ifattandno+1

My problem is att and idx do not seem to update as I think word.find(':') just finds the first instance of a colon and runs with it. Can anyone help?

EDIT:

The above explanation has been confusing. I'm a bit stubborn about how the count of 1s and -1s is acquired. As each pair on each line is read, I want to search through the data for the number of +1s and -1s that the pair is involved in and store them into 2 separate variables. The reason for doing so is to calculate probabilities of each pair leading to a +1 or -1.

line 'if att ==word.split(':', 1)[0]:' should read 'if att ==word.split(':', 1)[-1]:' — user2951046
– user2951046, Commented Dec 10, 2013 at 20:48
Is the 'index' guaranteed to be 1 2 3 4 5 .. 10 in that order on each line? — damienfrancois
– damienfrancois, Commented Dec 10, 2013 at 20:52
@damienfracois yes, but different files will have different indices. But the indices will be the same for each line in a file. — user2951046
– user2951046, Commented Dec 10, 2013 at 20:55
Erm -- 2 is the number after '1:4', and 2 is the number after '3:3' in what I wrote above. Am I missing something? — DSM
– DSM, Commented Dec 10, 2013 at 21:20

damienfrancois · Accepted Answer · 2013-12-10 21:38:54Z

3

Here is a suggestion (provided I understand the question correctly) :

#!/bin/env python
from collections import defaultdict

positives=defaultdict(int)
negatives=defaultdict(int)

for line in open('data'):
    theclass = line[0:2] == '+1'
    for pair in line[2:].split():
        positives[pair]+=theclass
        negatives[pair]+=not theclass

for key in positives.keys():
    print key, "\t+1:",  positives[key], "\t-1:", negatives[key]

Applied to the following data:

$ cat data
+1 1:4 2:11 3:3 4:11 5:1 6:13 7:4 8:2 9:2 10:13
-1 1:2 2:7 3:4 4:12 5:3 6:4 7:3 8:12 9:2 10:12
+1 1:4 2:6 3:3 4:2 5:3 6:5 7:4 8:2 9:3 10:6

it gives:

$ python t.py 
9:2     +1: 1   -1: 1
9:3     +1: 1   -1: 0
8:2     +1: 2   -1: 0
10:6    +1: 1   -1: 0
6:13    +1: 1   -1: 0
10:13   +1: 1   -1: 0
10:12   +1: 0   -1: 1
2:7     +1: 0   -1: 1
2:6     +1: 1   -1: 0
4:11    +1: 1   -1: 0
4:12    +1: 0   -1: 1
4:2     +1: 1   -1: 0
1:2     +1: 0   -1: 1
1:4     +1: 2   -1: 0
3:3     +1: 2   -1: 0
5:1     +1: 1   -1: 0
3:4     +1: 0   -1: 1
5:3     +1: 1   -1: 1
8:12    +1: 0   -1: 1
7:4     +1: 2   -1: 0
7:3     +1: 0   -1: 1
2:11    +1: 1   -1: 0
6:5     +1: 1   -1: 0
6:4     +1: 0   -1: 1

edited Dec 10, 2013 at 21:38

answered Dec 10, 2013 at 21:23

damienfrancois

60.4k9 gold badges116 silver badges128 bronze badges

Sign up to request clarification or add additional context in comments.

4 Comments

user2951046 Over a year ago

Is there a way to obtain the +1s and -1s of each pair as you read them on each line? I understand it might not be efficient but as long as the code runs in under 3 mins it is alright.

damienfrancois Over a year ago

Would adding a print positives[pair], negatives[pair] in the loop do what you are asking?

damienfrancois Over a year ago

Anyway, @DSM's answer is more generic and uses less memory.

user2951046 Over a year ago

I don't want to print it but I want to store only the quantity(number) of +1's and -1's of each pair as I read them on the line.

Geoff Gerrietts · Accepted Answer · 2013-12-10 21:24:57Z

1

I'm not sure if I've got this or not.

tot_up = {}; tot_dn = {}
for line in input_file:
    parts = line.split()   # split on whitespace
    up_or_down = parts[0]
    parts = parts[1:]
    if up_or_down == '-1':
        store = tot_dn
    else:
        store = tot_up
    for part in parts:
        store[part] = store.get(part, 0) + 1
print "Total +1s: ", sum(tot_up.values())
print "Total -1s: ", sum(tot_dn.values())

What this does not do, but could be done easily enough, is strip out the att:val pairs where no match was found.

But I'm not sure I've understood your requirements properly.

answered Dec 10, 2013 at 21:24

Geoff Gerrietts

6765 silver badges9 bronze badges

Comments

DSM · Accepted Answer · 2013-12-10 21:36:06Z

0

I'll make this community wiki because it's too close (in spirit, anyway) to an answer already posted, but it has a few advantages:

from collections import Counter
with open("datafile.dat") as fp:
    counts = {}
    for line in fp:
        parts = line.split()
        sign, keys = parts[0], parts[1:]
        counts.setdefault(sign, Counter()).update(keys)

all_keys = set().union(*counts.values())
for key in sorted(all_keys):
    print '{:8}'.format(key), 
    print ' '.join('{}: {}'.format(c, counts[c].get(key, 0)) for c in counts)

which produces

10:12    +1: 0 -1: 1
10:13    +1: 1 -1: 0
10:6     +1: 1 -1: 0
1:2      +1: 0 -1: 1
1:4      +1: 2 -1: 0
[etc.]

Note that nowhere is there any reference to +1 or -1; sign can really be anything.

answered Dec 10, 2013 at 21:36

community wiki

DSM

5 Comments

user2951046 Over a year ago

This could be useful, but I would rather the number of +1s and -1s of each pair is stored as a variable as the data is read line by line.....

DSM Over a year ago

I'm not sure what you mean. Are you referring to the cumulative counts?

user2951046 Over a year ago

Yes. Meaning when 10:12 is read, it reads through all the data and stores the number of occurrences of 10:12 and -1 that are on one line in one variable and number of occurrences of 10:12 and +1 that are on one line in another variable. I can then use these variables in a probability calculation.

DSM Over a year ago

I'm afraid I still don't follow: although you said "yes", what you're referring to isn't a cumulative count. It sounds like you want to read through the entire data file every time you see 10:12 for some reason, rather than doing it all in one pass. That will be very slow, because instead of reading through the file once, you'll have to read through it N times, where N is the number of "10:12"-like objects. Hopefully someone else will be able to help you!

user2951046 Over a year ago

Sorry, I may have not gotten what you meant by cumulative count correctly. I completely understand that this process will have a high runtime but if the runtime is less than 3 minutes then it is fine. There are 1024 lines. The reason it is worse to do one pass, is that I would then have to store every count of every single pair to a variable to be able to use them in a probability calculation. If it is does N times, the calculations can be made while reusing only 2 variables.... Does that make sense...?

Has QUIT--Anony-Mousse · Accepted Answer · 2013-12-11 08:19:10Z

0

Your first error is in the second line:

for word in i:

this iterates over each character.

You meant to use:

for word in i.split():

answered Dec 11, 2013 at 8:19

Has QUIT--Anony-Mousse

77.8k14 gold badges146 silver badges198 bronze badges

Collectives™ on Stack Overflow

Python data extraction and search

4 Answers 4

4 Comments

Comments

5 Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

4 Answers 4

4 Comments

Comments

5 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related