0

I have several csv files in which each of them has different formats. Here an sample of two different csv files. Please look at the format not values.

 csv_2   "xxxx-0147-xxxx-194443,""Jan 1, 2017"",7:43:43 AM PST,,Google fee,,Smart Plan (Calling & Texting),com.yuilop,1,unlimited_usca_tariff_and,mimir,US,TX,76501,USD,-3.00,0.950210,EUR,-2.85"
 csv_2  "1305-xxxx-0118-54476..1,""Jan 1, 2017"",7:17:31 AM PST,,Google fee,,Smart Plan (Calling & Texting),com.yuilop,1,unlimited_usca_tariff_and,htc_a13wlpp,US,TX,79079,USD,-3.00,0.950210,EUR,-2.85"
 csv_1 GPA.xxxx-2612-xxxx-44448..0,2017-02-01,1485950845,Charged,m1,Freedom Plan (alling & Texting),com.yuilop,subscription,basic_usca_tariff_and,USD,2.99,0.00,2.99,,,07605,US
 csv:1 GPA.xxxx-6099-9725-56125,2017-02-01,1485952917,Charged,athene_f,Buy 100 credits (Calling & Texting),com.yuilop,inapp,100_credits,INR,138.41,0.00,138.41,Kolkata,West Bengal,700007,IN

As u see csv_2 is included " and sometimes "", however csv_1 is a simple format. I get all csvs on the demand and they are a lot and huge. I tried to use sniffer in order to recognise dialect automatically. But this is not enough and I don't get the reasonable response for the one that has "" . Is there anybody who can guid me how to solve this problem?

Python code 2.7

With open(file, 'rU') as csvfile:
     dialect = csv.Sniffer().sniff(csvfile.read(2024))
     csvfile.seek(0)
     reader = csv.reader(csvfile, dialect)
     for line in reader:
      print line

Parameter Values:

 dialect.escapechar     None
 dialect.quotechar      "
 dialect.quoting        0
 dialect.escapechar     None
 dialect.delimiter      ,
 dialect.doublequote    False

result

csv_1 ['GPA.13xx-xxxx-9725-5xxx', '2017-02-01', '1485952917', 'Charged', 'athene_f', 'Buy 100 credits (Calling & Texting)', 'com.yuilop', 'inapp', '100_credits', 'INR', '138.41', '0.00', '138.41', 'Kolkata', 'West Bengal', '700007', 'IN']
csv_2  ['1330-xxxx-5560-xxxx,"Jan 1', ' 2017""', '12:35:13 AM PST', '', 'Google fee', '', 'Smart Plan (Calling & Texting)', 'com.yuilop', '1', 'unlimited_usca_tariff_and', 'astar-y3', 'US', 'NC', '27288', 'USD', '-3.00', '0.950210', 'EUR', '-2.85"']

In csv_2 , you see a mess . date is separated by comma specially date field and also all the row considered as a string. How can I change my code in order to have the same result as csv_1?

2 Answers 2

0

Why not pre-process csv to clean " and normalize it, and then load the data like the other csv?

Sign up to request clarification or add additional context in comments.

3 Comments

there is a problem which I don't know what is csv format of each of them. there are around 1000 csv file, so opening each of them its a time consuming work do u have any suggestion for it?
you need to know how many formats have the 1000 csv files, after all you need to process that information after parsing all the csv files, no?
Ok, u know I don't I receive all files instantly so I don't know what would be come next!! so as i get you mean something like having exception and figure out different csv format and behave with them separately.. I though sniffer can do this job automatically and we don't need to take care about this part.@Antonio Beamud
0

You're one step from working code. All you've got to do is first replace the "s in csvfile, then your current approach will work just fine.

EDIT: However, if you're interested in merging the date-strings that were separated after reading in the CSV file, your best bet is a Regex match. I've included some code into my original answer. I've copied most of the Regex code (with edits) from this older answer.

import re
import csv

with open(file, 'rU') as csvfile:
    data = csvfile.read(2024)
    # Remove the pesky double-quotes
    no_quotes_data = data.replace('"', '')

    dialect = csv.Sniffer().sniff(no_quotes_data);

    csv_data = csv.reader(no_quotes_data.splitlines(), dialect)

    pattern = r'(?i)(%s) +(%s)'

    thirties = pattern % (
        "Sep|Apr|Jun|Nov",
        r'[1-9]|[12]\d|30')

    thirtyones = pattern % (
        "Jan|Mar|May|Jul|Aug|Oct|Dec",
        r'[1-9]|[12]\d|3[01]')

    feb = r'(Feb) +(?:%s)' % (
        r'(?:([1-9]|1\d|2[0-9]))') # 1-29 any year (including potential leap years)

    result = '|'.join('(?:%s)' % x for x in (thirties, thirtyones, feb))
    r = re.compile(result)

    for ind, phrase in enumerate(csv_data):
        if r.match(phrase):
            # If you've found a date string, a year string will follow
            new_data[ind] = ", ".join(csv_data[ind:ind+2])
            del csv_data[ind+1]

    for line in csv_data: print line

7 Comments

why u are not using sniffer? that will find all automatically, right?
You're right, it would. That's a perfectly fine approach. I just assumed that specifying the delimiter explicitly would make my answer more obvious to you. You could define dialect = csv.Sniffer().sniff(new_data) and provide that as input to the csv.reader() line: csv.reader(new_data.splitlines(), dialect). NOTE: I would avoid referring to csvfile after removing the double-quotes, since the original csvfile will still have the double quotes and Sniffer won't automatically detect the proper delimiter / format of the CSV.
I tried your approach, date is not correct , result is 'Jan 1', '2017' => separated as separate field, however this should be 'Jan 1, 2017' . In fact this is the problem I had when it recognise comma as separator, then date field is also being separated and this is not my reasonable result.
@MaryamPashmi That's not necessarily a "problem" with the code. The code is doing exactly what it says it's doing - separating the contents of your CSV at the comma delimiter. Since there's a comma within the date, the date will inevitably be separated into two items within the resulting list. If you don't want that result, then you'll have to do some post-processing of the list you extract from the CSV file. I'll update my code to do something like this - but I'm telling you, this isn't the nicest solution. Making sure you're CSV files are properly formatted is a simpler solution usually.
@MaryamPashmi Did this solution sufficiently answer your question?
|

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.