0

I have data in a file in this format:

+1 1:4 2:11 3:3 4:11 5:1 6:13 7:4 8:2 9:2 10:13
-1 1:2 2:7 3:4 4:12 5:3 6:4 7:3 8:12 9:2 10:12
+1 1:4 2:6 3:3 4:2 5:3 6:5 7:4 8:2 9:3 10:6

and so on....

where the numbers on the left of the colon is an 'index' and numbers on the right of the colon are integers that describe a certain attribute. For each line, if the number on the right of the colon is the same for the same index on another line, I want to store the total amount of +1's and -1's in two separate variables. This is my code so far:

for i in lines:
   for word in i:
        if word.find(':')!=-1:
            att = word.split(':', 1)[-1]
            idx = word.split(':', 1)[0]
            for j in lines:
                clas = j.split(' ', 1)[0]
                if word.find(':')!=-1:
                        if idx ==word.split(':', 1)[0]:
                            if att ==word.split(':', 1)[0]:
                                if clas>0:
                                    ifattandyes = ifattandyes+1
                                else:
                                    ifattandno = ifattandno+1

My problem is att and idx do not seem to update as I think word.find(':') just finds the first instance of a colon and runs with it. Can anyone help?

EDIT:

The above explanation has been confusing. I'm a bit stubborn about how the count of 1s and -1s is acquired. As each pair on each line is read, I want to search through the data for the number of +1s and -1s that the pair is involved in and store them into 2 separate variables. The reason for doing so is to calculate probabilities of each pair leading to a +1 or -1.

11
  • line 'if att ==word.split(':', 1)[0]:' should read 'if att ==word.split(':', 1)[-1]:' Commented Dec 10, 2013 at 20:48
  • Is the 'index' guaranteed to be 1 2 3 4 5 .. 10 in that order on each line? Commented Dec 10, 2013 at 20:52
  • @damienfracois yes, but different files will have different indices. But the indices will be the same for each line in a file. Commented Dec 10, 2013 at 20:55
  • What is the +1 in the begging of the line? Commented Dec 10, 2013 at 20:59
  • 1
    Erm -- 2 is the number after '1:4', and 2 is the number after '3:3' in what I wrote above. Am I missing something? Commented Dec 10, 2013 at 21:20

4 Answers 4

3

Here is a suggestion (provided I understand the question correctly) :

#!/bin/env python
from collections import defaultdict

positives=defaultdict(int)
negatives=defaultdict(int)

for line in open('data'):
    theclass = line[0:2] == '+1'
    for pair in line[2:].split():
        positives[pair]+=theclass
        negatives[pair]+=not theclass

for key in positives.keys():
    print key, "\t+1:",  positives[key], "\t-1:", negatives[key]

Applied to the following data:

$ cat data
+1 1:4 2:11 3:3 4:11 5:1 6:13 7:4 8:2 9:2 10:13
-1 1:2 2:7 3:4 4:12 5:3 6:4 7:3 8:12 9:2 10:12
+1 1:4 2:6 3:3 4:2 5:3 6:5 7:4 8:2 9:3 10:6

it gives:

$ python t.py 
9:2     +1: 1   -1: 1
9:3     +1: 1   -1: 0
8:2     +1: 2   -1: 0
10:6    +1: 1   -1: 0
6:13    +1: 1   -1: 0
10:13   +1: 1   -1: 0
10:12   +1: 0   -1: 1
2:7     +1: 0   -1: 1
2:6     +1: 1   -1: 0
4:11    +1: 1   -1: 0
4:12    +1: 0   -1: 1
4:2     +1: 1   -1: 0
1:2     +1: 0   -1: 1
1:4     +1: 2   -1: 0
3:3     +1: 2   -1: 0
5:1     +1: 1   -1: 0
3:4     +1: 0   -1: 1
5:3     +1: 1   -1: 1
8:12    +1: 0   -1: 1
7:4     +1: 2   -1: 0
7:3     +1: 0   -1: 1
2:11    +1: 1   -1: 0
6:5     +1: 1   -1: 0
6:4     +1: 0   -1: 1
Sign up to request clarification or add additional context in comments.

4 Comments

Is there a way to obtain the +1s and -1s of each pair as you read them on each line? I understand it might not be efficient but as long as the code runs in under 3 mins it is alright.
Would adding a print positives[pair], negatives[pair] in the loop do what you are asking?
Anyway, @DSM's answer is more generic and uses less memory.
I don't want to print it but I want to store only the quantity(number) of +1's and -1's of each pair as I read them on the line.
1

I'm not sure if I've got this or not.

tot_up = {}; tot_dn = {}
for line in input_file:
    parts = line.split()   # split on whitespace
    up_or_down = parts[0]
    parts = parts[1:]
    if up_or_down == '-1':
        store = tot_dn
    else:
        store = tot_up
    for part in parts:
        store[part] = store.get(part, 0) + 1
print "Total +1s: ", sum(tot_up.values())
print "Total -1s: ", sum(tot_dn.values())

What this does not do, but could be done easily enough, is strip out the att:val pairs where no match was found.

But I'm not sure I've understood your requirements properly.

Comments

0

I'll make this community wiki because it's too close (in spirit, anyway) to an answer already posted, but it has a few advantages:

from collections import Counter
with open("datafile.dat") as fp:
    counts = {}
    for line in fp:
        parts = line.split()
        sign, keys = parts[0], parts[1:]
        counts.setdefault(sign, Counter()).update(keys)

all_keys = set().union(*counts.values())
for key in sorted(all_keys):
    print '{:8}'.format(key), 
    print ' '.join('{}: {}'.format(c, counts[c].get(key, 0)) for c in counts)

which produces

10:12    +1: 0 -1: 1
10:13    +1: 1 -1: 0
10:6     +1: 1 -1: 0
1:2      +1: 0 -1: 1
1:4      +1: 2 -1: 0
[etc.]

Note that nowhere is there any reference to +1 or -1; sign can really be anything.

5 Comments

This could be useful, but I would rather the number of +1s and -1s of each pair is stored as a variable as the data is read line by line.....
I'm not sure what you mean. Are you referring to the cumulative counts?
Yes. Meaning when 10:12 is read, it reads through all the data and stores the number of occurrences of 10:12 and -1 that are on one line in one variable and number of occurrences of 10:12 and +1 that are on one line in another variable. I can then use these variables in a probability calculation.
I'm afraid I still don't follow: although you said "yes", what you're referring to isn't a cumulative count. It sounds like you want to read through the entire data file every time you see 10:12 for some reason, rather than doing it all in one pass. That will be very slow, because instead of reading through the file once, you'll have to read through it N times, where N is the number of "10:12"-like objects. Hopefully someone else will be able to help you!
Sorry, I may have not gotten what you meant by cumulative count correctly. I completely understand that this process will have a high runtime but if the runtime is less than 3 minutes then it is fine. There are 1024 lines. The reason it is worse to do one pass, is that I would then have to store every count of every single pair to a variable to be able to use them in a probability calculation. If it is does N times, the calculations can be made while reusing only 2 variables.... Does that make sense...?
0

Your first error is in the second line:

for word in i:

this iterates over each character.

You meant to use:

for word in i.split():

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.