Parsing two different types of csv format in python

Question

I have several csv files in which each of them has different formats. Here an sample of two different csv files. Please look at the format not values.

 csv_2   "xxxx-0147-xxxx-194443,""Jan 1, 2017"",7:43:43 AM PST,,Google fee,,Smart Plan (Calling & Texting),com.yuilop,1,unlimited_usca_tariff_and,mimir,US,TX,76501,USD,-3.00,0.950210,EUR,-2.85"
 csv_2  "1305-xxxx-0118-54476..1,""Jan 1, 2017"",7:17:31 AM PST,,Google fee,,Smart Plan (Calling & Texting),com.yuilop,1,unlimited_usca_tariff_and,htc_a13wlpp,US,TX,79079,USD,-3.00,0.950210,EUR,-2.85"
 csv_1 GPA.xxxx-2612-xxxx-44448..0,2017-02-01,1485950845,Charged,m1,Freedom Plan (alling & Texting),com.yuilop,subscription,basic_usca_tariff_and,USD,2.99,0.00,2.99,,,07605,US
 csv:1 GPA.xxxx-6099-9725-56125,2017-02-01,1485952917,Charged,athene_f,Buy 100 credits (Calling & Texting),com.yuilop,inapp,100_credits,INR,138.41,0.00,138.41,Kolkata,West Bengal,700007,IN

As u see csv_2 is included " and sometimes "", however csv_1 is a simple format. I get all csvs on the demand and they are a lot and huge. I tried to use sniffer in order to recognise dialect automatically. But this is not enough and I don't get the reasonable response for the one that has "" . Is there anybody who can guid me how to solve this problem?

Python code 2.7

With open(file, 'rU') as csvfile:
     dialect = csv.Sniffer().sniff(csvfile.read(2024))
     csvfile.seek(0)
     reader = csv.reader(csvfile, dialect)
     for line in reader:
      print line

Parameter Values:

 dialect.escapechar     None
 dialect.quotechar      "
 dialect.quoting        0
 dialect.escapechar     None
 dialect.delimiter      ,
 dialect.doublequote    False

result

csv_1 ['GPA.13xx-xxxx-9725-5xxx', '2017-02-01', '1485952917', 'Charged', 'athene_f', 'Buy 100 credits (Calling & Texting)', 'com.yuilop', 'inapp', '100_credits', 'INR', '138.41', '0.00', '138.41', 'Kolkata', 'West Bengal', '700007', 'IN']
csv_2  ['1330-xxxx-5560-xxxx,"Jan 1', ' 2017""', '12:35:13 AM PST', '', 'Google fee', '', 'Smart Plan (Calling & Texting)', 'com.yuilop', '1', 'unlimited_usca_tariff_and', 'astar-y3', 'US', 'NC', '27288', 'USD', '-3.00', '0.950210', 'EUR', '-2.85"']

In csv_2 , you see a mess . date is separated by comma specially date field and also all the row considered as a string. How can I change my code in order to have the same result as csv_1?

Antonio Beamud · Accepted Answer · 2017-02-27 17:36:59Z

0

Why not pre-process csv to clean " and normalize it, and then load the data like the other csv?

answered Feb 27, 2017 at 17:36

Antonio Beamud

2,3511 gold badge16 silver badges27 bronze badges

Sign up to request clarification or add additional context in comments.

3 Comments

pm1359 Over a year ago

there is a problem which I don't know what is csv format of each of them. there are around 1000 csv file, so opening each of them its a time consuming work do u have any suggestion for it?

Antonio Beamud Over a year ago

you need to know how many formats have the 1000 csv files, after all you need to process that information after parsing all the csv files, no?

pm1359 Over a year ago

Ok, u know I don't I receive all files instantly so I don't know what would be come next!! so as i get you mean something like having exception and figure out different csv format and behave with them separately.. I though sniffer can do this job automatically and we don't need to take care about this part.@Antonio Beamud

Community · Accepted Answer · 2017-05-23 12:25:19Z

0

You're one step from working code. All you've got to do is first replace the "s in csvfile, then your current approach will work just fine.

EDIT: However, if you're interested in merging the date-strings that were separated after reading in the CSV file, your best bet is a Regex match. I've included some code into my original answer. I've copied most of the Regex code (with edits) from this older answer.

import re
import csv

with open(file, 'rU') as csvfile:
    data = csvfile.read(2024)
    # Remove the pesky double-quotes
    no_quotes_data = data.replace('"', '')

    dialect = csv.Sniffer().sniff(no_quotes_data);

    csv_data = csv.reader(no_quotes_data.splitlines(), dialect)

    pattern = r'(?i)(%s) +(%s)'

    thirties = pattern % (
        "Sep|Apr|Jun|Nov",
        r'[1-9]|[12]\d|30')

    thirtyones = pattern % (
        "Jan|Mar|May|Jul|Aug|Oct|Dec",
        r'[1-9]|[12]\d|3[01]')

    feb = r'(Feb) +(?:%s)' % (
        r'(?:([1-9]|1\d|2[0-9]))') # 1-29 any year (including potential leap years)

    result = '|'.join('(?:%s)' % x for x in (thirties, thirtyones, feb))
    r = re.compile(result)

    for ind, phrase in enumerate(csv_data):
        if r.match(phrase):
            # If you've found a date string, a year string will follow
            new_data[ind] = ", ".join(csv_data[ind:ind+2])
            del csv_data[ind+1]

    for line in csv_data: print line

edited May 23, 2017 at 12:25

CommunityBot

11 silver badge

answered Feb 27, 2017 at 17:54

Vladislav Martin

1,7243 gold badges16 silver badges37 bronze badges

7 Comments

pm1359 Over a year ago

why u are not using sniffer? that will find all automatically, right?

Vladislav Martin Over a year ago

You're right, it would. That's a perfectly fine approach. I just assumed that specifying the delimiter explicitly would make my answer more obvious to you. You could define dialect = csv.Sniffer().sniff(new_data) and provide that as input to the csv.reader() line: csv.reader(new_data.splitlines(), dialect). NOTE: I would avoid referring to csvfile after removing the double-quotes, since the original csvfile will still have the double quotes and Sniffer won't automatically detect the proper delimiter / format of the CSV.

pm1359 Over a year ago

I tried your approach, date is not correct , result is 'Jan 1', '2017' => separated as separate field, however this should be 'Jan 1, 2017' . In fact this is the problem I had when it recognise comma as separator, then date field is also being separated and this is not my reasonable result.

Vladislav Martin Over a year ago

@MaryamPashmi That's not necessarily a "problem" with the code. The code is doing exactly what it says it's doing - separating the contents of your CSV at the comma delimiter. Since there's a comma within the date, the date will inevitably be separated into two items within the resulting list. If you don't want that result, then you'll have to do some post-processing of the list you extract from the CSV file. I'll update my code to do something like this - but I'm telling you, this isn't the nicest solution. Making sure you're CSV files are properly formatted is a simpler solution usually.

Vladislav Martin Over a year ago

@MaryamPashmi Did this solution sufficiently answer your question?

|

Collectives™ on Stack Overflow

Parsing two different types of csv format in python

2 Answers 2

3 Comments

7 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

3 Comments

7 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related