Structuring .csv in Python

Question

I'm wondering how I could build a .csv file with a proper structure. As an example, my data has the form:

(indice, latitude, longitude, value)

- 0 - lat=-51.490000 lon=264.313000 value=7.270077
- 1 - lat=-51.490000 lon=264.504000 value=7.231014
- 2 - lat=-51.490000 lon=264.695000 value=21.199764
- 3 - lat=-51.490000 lon=264.886000 value=49.176327
- 4 - lat=-51.490000 lon=265.077000 value=91.160702
- 5 - lat=-51.490000 lon=265.268000 value=147.152889
- 6 - lat=-51.490000 lon=265.459000 value=217.152889
- 7 - lat=-51.490000 lon=265.650000 value=301.160702
- 8 - lat=-51.490000 lon=265.841000 value=399.176327
- 9 - lat=-51.490000 lon=266.032000 value=511.199764
- 10 - lat=-51.490000 lon=266.223000 value=637.231014
- 11 - lat=-51.490000 lon=266.414000 value=777.270077
- 12 - lat=-51.490000 lon=266.605000 value=931.316952
- 13 - lat=-51.490000 lon=266.796000 value=1099.371639
- 14 - lat=-51.490000 lon=266.987000 value=1281.434139
- 15 - lat=-51.490000 lon=267.178000 value=1477.504452
- 16 - lat=-51.490000 lon=267.369000 value=1687.582577
- 17 - lat=-51.490000 lon=267.560000 value=1911.668514
- 18 - lat=-51.490000 lon=267.751000 value=2149.762264
- 19 - lat=-51.490000 lon=267.942000 value=2401.863827
- 20 - lat=-51.490000 lon=268.133000 value=2667.973202
- 21 - lat=-51.490000 lon=268.324000 value=2948.090389

I would like to be able to save this data in .csv file with the format:

         | longitude | 
latitude |   value   |

That is, all the values with the same latitude would be in the same line and all the values with the same longitude would be in the same column. I know how to write a .csv file in Python, I'm wondering how could I perform this transformation properly.

Thank you in advance.

Thank you.

You will first have to loop over the data to collect all longitudes. Those will be your columns. Then I would probably create a dictionary for each latitude which contains longitude/value pairs. Then you can write a line for each latitude.. you should take a look at the csv.DictWriter class. — rje
– rje, Commented Sep 16, 2014 at 15:33
I'd break up the lines with a regex and then use nested dicts to record the values mydict[latitude][longitude] = value. I'd also make a set of longitudes. The size of this set is the number of columns, make it a list and sort it to get an indexer into the nested list. Sort the latitude keys and off you go. — tdelaney
– tdelaney, Commented Sep 16, 2014 at 15:36
What happens if there are more values pre lat/lon pair? What if there are two latitudes or longitudes which are almost the same but not exactly? — Krab
– Krab, Commented Sep 16, 2014 at 16:17

rje · Accepted Answer · 2014-09-16 16:53:10Z

1

I wrote a little program for you :) see below.

I'm assuming for now that your data is stored as a list of dicts, but if it is a list of lists the code shouldn't be too hard to fix.

#!/usr/bin/env python

import csv

data = [
    dict(lat=1, lon=1, val=10),
    dict(lat=1, lon=2, val=20),
    dict(lat=2, lon=1, val=30),
    dict(lat=2, lon=2, val=40),
    dict(lat=3, lon=1, val=50),
    dict(lat=3, lon=2, val=60),
]

# get a unique list of all longitudes
headers = list({d['lon'] for d in data})
headers.sort()

# make a dict of latitudes
data_as_dict = {}
for item in data:
    # default value: a list of empty strings
    lst = data_as_dict.setdefault(item['lat'], ['']*len(headers))
    # get the longitute for this item
    lon = item['lon']
    # where in the line should it be?
    idx = headers.index(lon)
    # save value in the list
    lst[idx]=item['val']


# in the actual file, we start with an extra header for the latitude
headers.insert(0,'latitude')

with open('latitude.csv', 'w') as csvfile:
    writer = csv.writer(csvfile, delimiter=' ',
                            quotechar='|', quoting=csv.QUOTE_MINIMAL)
    writer.writerow(headers)
    lats = data_as_dict.keys()
    lats.sort()
    for latitude in lats:
        # a line starts with the latitude, followed by list of values
        l = data_as_dict[latitude]
        l.insert(0, latitude)
        writer.writerow(l)

output:

latitude 1 2
1 10 20
2 30 40
3 50 60

Granted, it's not the prettiest code, but I hope you get the idea

edited Sep 16, 2014 at 16:53

answered Sep 16, 2014 at 15:54

rje

6,5161 gold badge23 silver badges42 bronze badges

Sign up to request clarification or add additional context in comments.

4 Comments

pceccon Over a year ago

Hi @rje. Thank you for your answer. A little thing that I forget to ask... It is possible to order by lat and long? My data is ordered but with this code the result isn't. Thank you.

rje Over a year ago

Nope, ordering is not necessary, it'll work with unordered data too!

pceccon Over a year ago

Yes, but I guess I wasn't clear. It worker, however, my data isn't ordered in the output file as it was in the input. Trying to manage this here.

rje Over a year ago

Ah, I see. Changed the code a bit to sort the headers and keys :)

Adam Smith · Accepted Answer · 2014-09-16 15:45:38Z

I'm assuming you have this data in a text file. Let's use regular expressions to parse the data (though string splitting looks like it could work if your format stays the same).

import re

data = list()

with open('path/to/data/file','r') as infile:
    for line in infile:
        matches = re.match(r".*(?<=lat=)(?P<lat>(?:\+|-)?[\d.]+).*(?<=value=)(?P<longvalue>(?:\+|-)?[\d.]+)", line)
        data.append((matches.group('lat'), matches.group('longvalue'))

To unroll that nasty regex:

pat = re.compile(r"""
  .*                         Match anything any number of times
  (?<=lat=)                  assert that the last 4 characters are "lat="
  (?P<lat>                   begin named capturing group "lat"
      (?:\+|-)?                allow one or none of either + or -
      [\d.]+                   and one or more digits or decimal points
  )                          end named capturing group "lat"
  .*                         Another wildcard
  (?<=value=)                assert that the last 6 characters are "value="
  (?P<longvalue>             begin named capturing group "longvalue"
      (?:\+|-)?                allow one or none of either + or -
      [\d.]+                   and one or more digits or decimal points
  )                          end named capturing group "longvalue"
""", re.X)

# and a terser way of writing the code, since we've compiled the pattern above:

with open('path/to/data/file', 'r') as infile:
    data = [(matches.group('lat'), matches.group('longvalue')) for line in infile for
            matches in (re.match(pat, line),)]

exhuma · Accepted Answer · 2014-09-16 16:03:12Z

Given your input data, I came up with the following:

from __future__ import print_function


def decode(line):
    line = line.replace('- ', ' ')
    fields = line.split()
    index = fields[0]
    data = dict([_.split('=') for _ in fields[1:]])
    return index, data


def transform(filename):
    transformed = {}
    columns = set()
    for line in open(filename):
        index, data = decode(line.strip())
        element = transformed.setdefault(data['lat'], {})
        element[data['lon']] = data['value']
        columns.add(data['lon'])
    return columns, transformed


def main(filename):
    columns, transformed = transform(filename)
    columns = sorted(columns)
    print(',', ','.join(columns))
    for lat, data in transformed.items():
        print(lat, ',', ', '.join([data.get(_, 'NULL') for _ in columns]))

if __name__ == '__main__':
    main('so.txt')

Just in case, where the data contains more than only one latitude, I had added one additional line to the example, so my input data (so.txt) contained this:

- 0 - lat=-51.490000 lon=264.313000 value=7.270077
- 1 - lat=-51.490000 lon=264.504000 value=7.231014
- 2 - lat=-51.490000 lon=264.695000 value=21.199764
- 3 - lat=-51.490000 lon=264.886000 value=49.176327
- 4 - lat=-51.490000 lon=265.077000 value=91.160702
- 5 - lat=-51.490000 lon=265.268000 value=147.152889
- 6 - lat=-51.490000 lon=265.459000 value=217.152889
- 7 - lat=-51.490000 lon=265.650000 value=301.160702
- 8 - lat=-51.490000 lon=265.841000 value=399.176327
- 9 - lat=-51.490000 lon=266.032000 value=511.199764
- 10 - lat=-51.490000 lon=266.223000 value=637.231014
- 11 - lat=-51.490000 lon=266.414000 value=777.270077
- 12 - lat=-51.490000 lon=266.605000 value=931.316952
- 13 - lat=-51.490000 lon=266.796000 value=1099.371639
- 14 - lat=-51.490000 lon=266.987000 value=1281.434139
- 15 - lat=-51.490000 lon=267.178000 value=1477.504452
- 16 - lat=-51.490000 lon=267.369000 value=1687.582577
- 17 - lat=-51.490000 lon=267.560000 value=1911.668514
- 18 - lat=-51.490000 lon=267.751000 value=2149.762264
- 19 - lat=-51.490000 lon=267.942000 value=2401.863827
- 20 - lat=-51.490000 lon=268.133000 value=2667.973202
- 21 - lat=-51.490000 lon=268.324000 value=2948.090389
- 22 - lat=-52.490000 lon=268.324000 value=2948.090389

(note the last line)

With that input file, the above program creates the following output:

, 264.313000,264.504000,264.695000,264.886000,265.077000,265.268000,265.459000,265.650000,265.841000,266.032000,266.223000,266.414000,266.605000,266.796000,266.987000,267.178000,267.369000,267.560000,267.751000,267.942000,268.133000,268.324000
-51.490000 , 7.270077, 7.231014, 21.199764, 49.176327, 91.160702, 147.152889, 217.152889, 301.160702, 399.176327, 511.199764, 637.231014, 777.270077, 931.316952, 1099.371639, 1281.434139, 1477.504452, 1687.582577, 1911.668514, 2149.762264, 2401.863827, 2667.973202, 2948.090389
-52.490000 , NULL, NULL, NULL, NULL, NULL, NULL, NULL, NULL, NULL, NULL, NULL, NULL, NULL, NULL, NULL, NULL, NULL, NULL, NULL, NULL, NULL, 2948.090389

tdelaney · Accepted Answer · 2014-09-16 16:33:15Z

YOu can pull lat/lon/value from each line using a regex. You'll want to lookup lat and lon later, so use a nested dict of the form d[lat][lon]=value to track it all. Add a set to keep track of the unique longitudes you see, and its pretty straight forward to generate the csv.

I sorted it in the example, but you may not care about that.

import re
import collections

data = """- 0 - lat=-51.490000 lon=264.313000 value=7.270077
- 1 - lat=-51.490000 lon=264.504000 value=7.231014
- 2 - lat=-51.490000 lon=264.695000 value=21.199764
- 3 - lat=-51.490000 lon=264.886000 value=49.176327
- 4 - lat=-51.490000 lon=265.077000 value=91.160702"""

regex = re.compile(r'- \d+ - lat=([\+\-]?[\d\.]+) lon=([\+\-]?[\d\.]+) value=([\+\-]?[\d\.]+)')

# lat/lon index will hold lats[latitude][longitude] = value
lats = collections.defaultdict(dict)
# longitude columns
lonset = set()

for line in data.split('\n'):
    match = regex.match(line)
    if match:
        lat, lon, val = match.groups()
        lats[lat][lon] = val
        lonset.add(lon)

latkeys = sorted(lats.keys())
lonkeys = sorted(list(lonset))

header = ['latitude'] + lonkeys
print header

for lat in latkeys:
    lons = lats[lat]
    row = [lat] + [lons.get(lon, '') for lon in lonkeys]
    print row

Collectives™ on Stack Overflow

Structuring .csv in Python

4 Answers 4

4 Comments

Comments

Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

4 Answers 4

4 Comments

Comments

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related