1

I'm very new to Python. I've searched extensively for a solution to my problem, but I'm hitting dead ends left and right.

I've produced a series of arrays using following code:

fh = open(short_seq, 'r')
line_counter = 0
pos = [0]
array = [0.0 for x in range(101)]
for line in fh:
    line_counter += 1.0   
    for i in line:
        score = ord(i) - 33.0
        array[pos] += score
        pos += 1

After printing inside the loop I get a large series of arrays.

[1,2,3,4.....]
[2,3,4,5,6.....]
[3,4,5,6,7,8.....100]
...

I'd like to use NumPy to run stats on each column, in the specific alignment that they are printed out in, but once I'm outside of the loop I can only call the sum of entire loop. I tried np.concatenate, but that still left me with the sum of the arrays. If I use NumPy in the loop then I can only run stats on each column, one iteration at a time, rather than the whole series. My next idea was to ad each iteration into a two-dimensional matrix, but I couldn't figure how to keep the alignment.

Any help would be greatly appreciated.

EDIT: Here is a sample of my data (each of the four strings are right underneath on another in a text editor). I'm trying to convert a few thousand lines of ascii to numerical values. Each line has to be in an array 100 characters long, and then I need to run stats on each column.

CCCFFFFFHHHHHIJJJJJJIJJJJJJJJIJJJIJJJJJJJIJJIJJGIIIHIIIFGIGFHFGIIIHIHHGEHHFDFFFFFDDDDDBDDDDDDDDEDEEDD CCCFFFFFHHHHHJJJJJJJJJJIIIJJIGJJJJJJJJJJIJJJJJIJJJJJJIJIJJIJJIJJIJJHGHHHHFFCEFFFEEDAEEEFEEDDDB:ADDDD: CCCFFFFFHHHHHJIJJJIJJJIJJIJJIIJIIJJJJJJJJJJJJJIIJJJJJJJJJGHHHHFFFFFFEEEEEEEDDDDDEDDDDDDDDDDDDDDDDD>9< BCCFFFDFHHHHHJJJJJJJJJJJIIJJJI@HGIIIJJJJJIJJIJIIJJJJJJJJJHHHHHHFFFDDDDDDDDDDDDDDDD?BDDDD@CDDDDDBDDDDD

1
  • Try numpy.sum(array, axis=0). Commented Jul 3, 2016 at 23:10

1 Answer 1

1
array = [0.0 for x in range(101)]

is a list. array = np.zeros((101,),float) is an array of the same size.

With for line in fh: you get a line, a string. I expect for i in line: to iterate over the characters in that string. Is that really what you want?

for i in line:
    score = ord(i) - 33.0
    array[pos] += score
    pos += 1

Usually when people read a text file they want the values of columns separated by spaces or commas, e.g.

 123, 345, 344, 233
 343, 342, 343, 343

We use lines.split(',') to split such as string into substrings. and float or int to turn those into numbers, eg.

 data = [float(substring) for substring in line.split(',')]

Show us some of your data file, or a simplified version. It will be easier to help. A key question is, are the number of 'columns' consistent across lines.

Often when we iterate over the lines of an array, we collect the line values in a list. If the number of elements in the sublists is consistent we can turn it into a 2d array.

 lines = []
 for line in fh:
     data = [float(i) for i in line.split(',')]
     lines.append(data)
 print(lines)
 # A = np.array(lines) 

===============================

With your sample lines I can do:

In [258]: with open('stack38175089.txt') as f:
    lines=f.readlines()
   .....:     

In [259]: [len(l) for l in lines]
Out[259]: [102, 102, 102, 102]

In [260]: data=np.array([[ord(i) for i in l.strip()] for l in lines])

In [261]: data.shape
Out[261]: (4, 101)

In [262]: data
Out[262]: 
array([[67, 67, 67, 70, 70, 70, 70, 70, 72, 72, 72, 72, 72, 73, 74, 74, 74,
        74, 74, 74, 73, 74, 74, 74, 74, 74, 74, 74, 74, 73, 74, 74, 74, 73,
        74, 74, 74, 74, 74, 74, 74, 73, 74, 74, 73, 74, 74, 71, 73, 73, 73,
        72, 73, 73, 73, 70, 71, 73, 71, 70, 72, 70, 71, 73, 73, 73, 72, 73,
        72, 72, 71, 69, 72, 72, 70, 68, 70, 70, 70, 70, 70, 68, 68, 68, 68,
        68, 66, 68, 68, 68, 68, 68, 68, 68, 68, 69, 68, 69, 69, 68, 68],
       ...
       [66, 67, 67, 70, 70, 70, 68, 70, 72, 72, 72, 72, 72, 74, 74, 74, 74,
        74, 74, 74, 74, 74, 74, 74, 73, 73, 74, 74, 74, 73, 64, 72, 71, 73,
        73, 73, 74, 74, 74, 74, 74, 73, 74, 74, 73, 74, 73, 73, 74, 74, 74,
        74, 74, 74, 74, 74, 74, 72, 72, 72, 72, 72, 72, 70, 70, 70, 68, 68,
        68, 68, 68, 68, 68, 68, 68, 68, 68, 68, 68, 68, 68, 68, 63, 66, 68,
        68, 68, 68, 64, 67, 68, 68, 68, 68, 68, 66, 68, 68, 68, 68, 68]])

With a 2d array like this I can easily shift the values (-33), and apply statistical calculations over rows or columns.

I could have read the lines individually and collected the values in a list of lists. But this sample, and I suspect your whole file, is small enough to use readlines.

Sign up to request clarification or add additional context in comments.

2 Comments

Thank you for the response. The raw data (ascii characters) is consistent across the lines in the file, however, when I convert the characters and start to populate the array in a loop it's skewed, but only in the beginning.
I loaded your sample into a 2d array.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.