Pandas read_csv() and dtype doubts

Question

I have a set of bulletins in txt format, those have some chunky (big header, tails, etc) data that I was able "to clean up" them using pandas. Then I had to append all DataFrames in a new one DataFrame in order to have a new file, because I need to process about 10 year of data, so the code is:

os.chdir(r'D:\Inves\Catalogs\OSC')
path = os.getcwd()
csv_files = glob.glob(os.path.join(path, "*.txt"))

new_data = []

for f in csv_files:
    df = pd.read_csv(f)
    print('Location File:', f)
    print('File Name:', f.split("\\")[-1])
    df = pd.read_csv(f, header=10, sep='\s+')
    n = 2
    df.drop(df.tail(n).index, inplace = True)
    df = df[df.YYYY != '----'] # deleting the '----' row
    print('File Content:')
    
    print('...Appending...')
    print('...................')
    new_data.append(df)
new_data = pd.concat(new_data, ignore_index=True)
#new_data.dtypes
new_data.to_csv(r'D:\Inves\Catalogs\Full_1988-2008.csv',
                index=False, header=True, sep=',')

The CSV file "Full_1988-2008.csv" is about 10MB (~173395 rows), the data inside the file looks like:

YYYY,MM,JJ,HH,MI,SS,STIME,LAT,SLAT,LON,SLON,DEPTH,ML,ORID,RMS,Num,Fase
1988,07,05,03,01,44,.92,-16.420,"8,41",-68.810,"7,56",94.00,1.01,34,",4",6,
1988,07,05,03,45,00,1.70,-16.990,"10,57",-68.910,"10,15",65.00,-1.00,35,"1,12",11,
1988,07,05,04,40,00,.00,-999.000,0,-999.000,0,-999.00,-1.00,36,0,5,
1988,07,05,05,13,12,1.50,-16.600,"5,51",-68.550,"3,64",15.00,1.97,37,",92",10,
1988,07,05,06,25,45,1.21,-16.960,"4,27",-68.520,"5,92",2.00,2.03,38,",74",8,
1988,07,05,07,24,42,2.04,-19.410,"74,58",-68.910,"23,03",160.00,2.78,39,"1,18",8,
1988,07,05,09,03,00,.00,-999.000,0,-999.000,0,-999.00,-1.00,41,0,3,

I need the data from YYYY (year), LAT & LON ( coordinates) DEPTH (depth) and ML (magnitude) so I do:

DF = pd.read_csv(kat, sep=',',
                 usecols=(['YYYY', 'LAT', 'LON', 'DEPTH', 'ML']),
                 dtype={'YYYY': int, 'LAT': float, 'LON': float,
                        'DEPTH': float, 'ML': float})

but I got the error:

  File "pandas\_libs\parsers.pyx", line 1050, in pandas._libs.parsers.TextReader._convert_tokens

TypeError: Cannot cast array data from dtype('O') to dtype('int32') according to the rule 'safe'


During handling of the above exception, another exception occurred:

Traceback (most recent call last):

  File "<ipython-input-13-b2a95a2d83fd>", line 46, in <module>
    'DEPTH': float, 'ML': float})

  File "C:\Users\Director\anaconda3\envs\obspy\lib\site-packages\pandas\io\parsers.py", line 610, in read_csv
    return _read(filepath_or_buffer, kwds)

  File "C:\Users\Director\anaconda3\envs\obspy\lib\site-packages\pandas\io\parsers.py", line 468, in _read
    return parser.read(nrows)

  File "C:\Users\Director\anaconda3\envs\obspy\lib\site-packages\pandas\io\parsers.py", line 1057, in read
    index, columns, col_dict = self._engine.read(nrows)

  File "C:\Users\Director\anaconda3\envs\obspy\lib\site-packages\pandas\io\parsers.py", line 2061, in read
    data = self._reader.read(nrows)

  File "pandas\_libs\parsers.pyx", line 756, in pandas._libs.parsers.TextReader.read

  File "pandas\_libs\parsers.pyx", line 771, in pandas._libs.parsers.TextReader._read_low_memory

  File "pandas\_libs\parsers.pyx", line 850, in pandas._libs.parsers.TextReader._read_rows

  File "pandas\_libs\parsers.pyx", line 982, in pandas._libs.parsers.TextReader._convert_column_data

  File "pandas\_libs\parsers.pyx", line 1056, in pandas._libs.parsers.TextReader._convert_tokens

ValueError: invalid literal for int() with base 10: 'YYYY'

In my understanding, the header YYYY, LAT, LON, DEPTH, ML becomes part of the data and can not be format as int or float. However if I skip the header I am not able to obtain the data I need because the header becomes 1998,-16.65,-66.65,12,3.2.

Does anyone has some clues to improve the way of processing my data?, I attach 2 complete files just in case you would like to reproduce my error.

https://drive.google.com/drive/folders/18xrDC7vqEm_pY3D2sxwou3dlBdkZ6nHF?usp=sharing

There is usually no need to provide dtype: pandas is good at inferring it. — DYZ
– DYZ, Commented Aug 5, 2021 at 21:26

Corralien · Accepted Answer · 2021-08-05 21:46:03Z

1

Your code works well with your two files 1988.txt et 1989.txt. To debug, I propose to remove the cast from read_csv:

DF = pd.read_csv(kat, sep=',', usecols=(['YYYY', 'LAT', 'LON', 'DEPTH', 'ML']))

Now, check your values of YYYY column:

new_data['YYYY'].unique()

Perhaps, you will see 'YYYY' as values. To locate this lines:

new_data[new_data['YYYY'] == 'YYYY']

answered Aug 5, 2021 at 21:46

Corralien

121k8 gold badges43 silver badges68 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

tonino Over a year ago

Dear @Correlien, you are right, there are the YYYY, MM, JJ... header, the number lines are 59583 and 75969, so basically I will try to find duplicates of the header and drop them

tonino · Accepted Answer · 2021-08-06 03:16:03Z

1

After your suggestion I add the following line to code:

new_data_d= new_data[new_data.YYYY.str.contains('YYYY') == False]

Then the duplicated rows with YYYY,SS... have been removed. The final code will look like:

os.chdir(r'D:\Inves\Catalogs\OSC')
path = os.getcwd()
csv_files = glob.glob(os.path.join(path, "*.txt"))

new_data = []

for f in csv_files:
    df = pd.read_csv(f)
    print('Location File:', f)
    print('File Name:', f.split("\\")[-1])
    df = pd.read_csv(f, header=10, sep='\s+')
    n = 2
    df.drop(df.tail(n).index, inplace = True)
    df = df[df.YYYY != '----'] # deleting the '----' row
    print('File Content:')
    
    print('...Appending...')
    print('...................')
    new_data.append(df)
new_data = pd.concat(new_data, ignore_index=True)
new_data_d= new_data[new_data.YYYY.str.contains('YYYY') == False]
new_data_d.to_csv(r'D:\Inves\Catalogs\Full_1988-2008.csv',
                index=False, header=True, sep=',')

answered Aug 6, 2021 at 3:16

tonino

8511 bronze badges

1 Comment

Corralien Over a year ago

Explore other options to parse your file like this: df = pd.read_csv('1988.txt', skiprows=[*range(11), 12], skipfooter=3, sep='\s+', engine='python')

Collectives™ on Stack Overflow

Pandas read_csv() and dtype doubts

2 Answers 2

1 Comment

1 Comment

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

1 Comment

1 Comment

Your Answer

Sign up or log in

Post as a guest

Related