1

Is there an efficient way to store each column of a tab-delimited file in a separate dictionary using python?

A sample input file: (Real input file contains thousands of lines and hundreds of columns. Number of columns is not fixed, it changes frequently.)

A B C
1 4 7
2 5 8
3 6 9

I need to print values in column A:

for cell in mydict["A"]:
    print cell

and to print values in the same row:

for i in range(1, numrows):
    for key in keysOfMydict:
        print mydict[key][i]
5
  • Why don't you just store the rows and use a dictionary to map column names to their index? Commented Aug 26, 2014 at 4:58
  • If the number of columns is not fixed, what would you expect to print in a row where the column is missing ? Commented Aug 26, 2014 at 5:05
  • 1
    Depending on what else you're doing with your data you might find interesting the pandas library: pandas.pydata.org/pandas-docs/stable/10min.html#getting Commented Aug 26, 2014 at 5:06
  • @GWW, the main computation is on columns. It may be inefficient to retrieve a row, since one cell within this row will be used, other cells will not be used. Commented Aug 26, 2014 at 5:19
  • @alfasin, the number of cells in each row is same. I meant that I do not want solutions which contain hard-coded column count and column names, because these codes are not manageable when the number of columns frequently changes. Commented Aug 26, 2014 at 5:23

2 Answers 2

1

The simplest way is to use DictReader from the csv module:

with open('somefile.txt', 'r') as f:
   reader = csv.DictReader(f, delimiter='\t')
   rows = list(reader) # If your file is not large, you can
                       # consume it entirely

   # If your file is large, you might want to 
   # step over each row:
   #for row in reader:
   #    print(row['A'])

for row in rows:
   print(row['A'])

@Marius made a good point - that you might be looking to collect all columns separately by their header.

If that's the case, you'll have to adjust your reading logic a bit:

from collections import defaultdict
by_column = defaultdict(list)

for row in rows:
   for k,v in row.iteritems():
       by_column[k].append(v)

Another option is pandas:

>>> import pandas as pd
>>> i = pd.read_csv('foo.csv', sep=' ')
>>> i
   A  B  C
0  1  4  7
1  2  5  8
2  3  6  9
>>> i['A']
0    1
1    2
2    3
Name: A, dtype: int64
Sign up to request clarification or add additional context in comments.

1 Comment

I think OP wants a dict that looks like {'A': [all vals in column A], 'B': [all vals in column B]}, not individual dicts for each row like DictReader provides.
0

Not sure this is relevant, but you can do this using rpy2.

from rpy2 import robjects
dframe = robjects.DataFrame.from_csvfile('/your/csv/file.csv', sep=' ')
d = dict([(k, list(v)) for k, v in dframe.items()])

output:

{'A': [1, 2, 3], 'C': [7, 8, 9], 'B': [4, 5, 6]}

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.