0

I have a set of bulletins in txt format, those have some chunky (big header, tails, etc) data that I was able "to clean up" them using pandas. Then I had to append all DataFrames in a new one DataFrame in order to have a new file, because I need to process about 10 year of data, so the code is:

os.chdir(r'D:\Inves\Catalogs\OSC')
path = os.getcwd()
csv_files = glob.glob(os.path.join(path, "*.txt"))

new_data = []

for f in csv_files:
    df = pd.read_csv(f)
    print('Location File:', f)
    print('File Name:', f.split("\\")[-1])
    df = pd.read_csv(f, header=10, sep='\s+')
    n = 2
    df.drop(df.tail(n).index, inplace = True)
    df = df[df.YYYY != '----'] # deleting the '----' row
    print('File Content:')
    
    print('...Appending...')
    print('...................')
    new_data.append(df)
new_data = pd.concat(new_data, ignore_index=True)
#new_data.dtypes
new_data.to_csv(r'D:\Inves\Catalogs\Full_1988-2008.csv',
                index=False, header=True, sep=',')

The CSV file "Full_1988-2008.csv" is about 10MB (~173395 rows), the data inside the file looks like:

YYYY,MM,JJ,HH,MI,SS,STIME,LAT,SLAT,LON,SLON,DEPTH,ML,ORID,RMS,Num,Fase
1988,07,05,03,01,44,.92,-16.420,"8,41",-68.810,"7,56",94.00,1.01,34,",4",6,
1988,07,05,03,45,00,1.70,-16.990,"10,57",-68.910,"10,15",65.00,-1.00,35,"1,12",11,
1988,07,05,04,40,00,.00,-999.000,0,-999.000,0,-999.00,-1.00,36,0,5,
1988,07,05,05,13,12,1.50,-16.600,"5,51",-68.550,"3,64",15.00,1.97,37,",92",10,
1988,07,05,06,25,45,1.21,-16.960,"4,27",-68.520,"5,92",2.00,2.03,38,",74",8,
1988,07,05,07,24,42,2.04,-19.410,"74,58",-68.910,"23,03",160.00,2.78,39,"1,18",8,
1988,07,05,09,03,00,.00,-999.000,0,-999.000,0,-999.00,-1.00,41,0,3,

I need the data from YYYY (year), LAT & LON ( coordinates) DEPTH (depth) and ML (magnitude) so I do:

DF = pd.read_csv(kat, sep=',',
                 usecols=(['YYYY', 'LAT', 'LON', 'DEPTH', 'ML']),
                 dtype={'YYYY': int, 'LAT': float, 'LON': float,
                        'DEPTH': float, 'ML': float})

but I got the error:

  File "pandas\_libs\parsers.pyx", line 1050, in pandas._libs.parsers.TextReader._convert_tokens

TypeError: Cannot cast array data from dtype('O') to dtype('int32') according to the rule 'safe'


During handling of the above exception, another exception occurred:

Traceback (most recent call last):

  File "<ipython-input-13-b2a95a2d83fd>", line 46, in <module>
    'DEPTH': float, 'ML': float})

  File "C:\Users\Director\anaconda3\envs\obspy\lib\site-packages\pandas\io\parsers.py", line 610, in read_csv
    return _read(filepath_or_buffer, kwds)

  File "C:\Users\Director\anaconda3\envs\obspy\lib\site-packages\pandas\io\parsers.py", line 468, in _read
    return parser.read(nrows)

  File "C:\Users\Director\anaconda3\envs\obspy\lib\site-packages\pandas\io\parsers.py", line 1057, in read
    index, columns, col_dict = self._engine.read(nrows)

  File "C:\Users\Director\anaconda3\envs\obspy\lib\site-packages\pandas\io\parsers.py", line 2061, in read
    data = self._reader.read(nrows)

  File "pandas\_libs\parsers.pyx", line 756, in pandas._libs.parsers.TextReader.read

  File "pandas\_libs\parsers.pyx", line 771, in pandas._libs.parsers.TextReader._read_low_memory

  File "pandas\_libs\parsers.pyx", line 850, in pandas._libs.parsers.TextReader._read_rows

  File "pandas\_libs\parsers.pyx", line 982, in pandas._libs.parsers.TextReader._convert_column_data

  File "pandas\_libs\parsers.pyx", line 1056, in pandas._libs.parsers.TextReader._convert_tokens

ValueError: invalid literal for int() with base 10: 'YYYY'

In my understanding, the header YYYY, LAT, LON, DEPTH, ML becomes part of the data and can not be format as int or float. However if I skip the header I am not able to obtain the data I need because the header becomes 1998,-16.65,-66.65,12,3.2.

Does anyone has some clues to improve the way of processing my data?, I attach 2 complete files just in case you would like to reproduce my error.

https://drive.google.com/drive/folders/18xrDC7vqEm_pY3D2sxwou3dlBdkZ6nHF?usp=sharing
1
  • There is usually no need to provide dtype: pandas is good at inferring it. Commented Aug 5, 2021 at 21:26

2 Answers 2

1

Your code works well with your two files 1988.txt et 1989.txt. To debug, I propose to remove the cast from read_csv:

DF = pd.read_csv(kat, sep=',', usecols=(['YYYY', 'LAT', 'LON', 'DEPTH', 'ML']))

Now, check your values of YYYY column:

new_data['YYYY'].unique()

Perhaps, you will see 'YYYY' as values. To locate this lines:

new_data[new_data['YYYY'] == 'YYYY']
Sign up to request clarification or add additional context in comments.

1 Comment

Dear @Correlien, you are right, there are the YYYY, MM, JJ... header, the number lines are 59583 and 75969, so basically I will try to find duplicates of the header and drop them
1

After your suggestion I add the following line to code:

new_data_d= new_data[new_data.YYYY.str.contains('YYYY') == False]

Then the duplicated rows with YYYY,SS... have been removed. The final code will look like:

os.chdir(r'D:\Inves\Catalogs\OSC')
path = os.getcwd()
csv_files = glob.glob(os.path.join(path, "*.txt"))

new_data = []

for f in csv_files:
    df = pd.read_csv(f)
    print('Location File:', f)
    print('File Name:', f.split("\\")[-1])
    df = pd.read_csv(f, header=10, sep='\s+')
    n = 2
    df.drop(df.tail(n).index, inplace = True)
    df = df[df.YYYY != '----'] # deleting the '----' row
    print('File Content:')
    
    print('...Appending...')
    print('...................')
    new_data.append(df)
new_data = pd.concat(new_data, ignore_index=True)
new_data_d= new_data[new_data.YYYY.str.contains('YYYY') == False]
new_data_d.to_csv(r'D:\Inves\Catalogs\Full_1988-2008.csv',
                index=False, header=True, sep=',')

1 Comment

Explore other options to parse your file like this: df = pd.read_csv('1988.txt', skiprows=[*range(11), 12], skipfooter=3, sep='\s+', engine='python')

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.