0

First time using python and can't seem to figure this out. I'm scraping data from a website and it's reading it as object class even though the values are numbers. I've tried all the ways described here but keep getting errors. I want the precip column to be numeric. I keep getting the following error code: ValueError: invalid literal for int() with base 10: '4.364.36'

Script with data scraping from website

import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import requests
from bs4 import BeautifulSoup
# Get URL where data we want is located
URL ="https://climate.rutgers.edu/stateclim_v1/nclimdiv/"
#Scrape data from website
result= requests.get(URL)
soup = BeautifulSoup(result.content,'lxml')
table = soup.find_all('table')[0] 
df = pd.read_html(str(table))
df = pd.concat(df) #Converts list to dataframe 
# Reshape data from wide to long
df= pd.melt(df, id_vars = 'Year', var_name='Month',value_name="precip")
# Get rid of missing data
df.dropna(subset=["precip","Year"], inplace=True)
# Filter dataframe to clean up for plotting
df = df[df["precip"].str.contains("M")==False]
df = df[df["Year"].str.contains("Max|Min|Count|Median|Normal|POR") == False]
3
  • There is a problem with the the way the html is being parsed where values are being combined. You can't convert 4.364.36 to numeric because it's not a real number. Commented Feb 10, 2022 at 18:34
  • @MatthewBorish Gotcha. Is there any way to have it parsed and not give that output? Commented Feb 10, 2022 at 18:41
  • 1
    looking into it now. You will likely need a custom solution. Commented Feb 10, 2022 at 18:48

1 Answer 1

1

So the tables from your URL are kind of funky which is why the parser is struggling. You can just copy the upper table to your clipboard (as seen in image) and use this.

df = pd.read_clipboard(header=None)

df = df.iloc[0:128, 1:]

df.columns = ['Year', 'Jan', 'Feb', 'Mar', 'Apr', 'May', 'Jun', 'Jul', 'Aug', 'Sep', 'Oct', 'Nov', 'Dec', 'Annual']

df = df.replace('M', 0)

for c in df.columns:
    
    df[c] = pd.to_numeric(df[c])

print(df)

Year        Jan     Feb     Mar     Apr     May     Jun     Jul     Aug Sep Oct Nov Dec Annual
0   1895    4.36    1.24    3.28    5.08    3.13    3.09    4.15    2.06    1.06    3.56    3.07    2.78    36.86
1   1896    1.61    6.88    5.65    1.35    3.54    5.49    5.38    1.68    4.25    2.41    3.12    1.21    42.57
2   1897    2.65    3.67    2.74    3.92    5.37    3.37    11.37   4.89    1.76    2.26    4.87    4.48    51.35
3   1898    4.10    3.45    3.15    3.58    6.77    2.07    4.63    5.45    2.05    5.51    6.60    3.63    50.99
4   1899    3.75    5.71    6.32    1.67    1.94    2.57    5.74    3.91    5.40    2.44    2.29    2.07    43.81
... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
123 2018    2.72    6.08    4.64    4.17    5.80    3.30    5.91    5.56    7.57    4.46    8.65    5.90    64.76
124 2019    4.49    3.26    3.84    3.97    6.75    5.15    6.14    3.73    1.25    5.71    1.94    5.32    51.55
125 2020    2.29    2.79    3.61    3.98    2.47    3.05    6.69    6.09    4.41    5.03    4.09    5.35    49.85
126 2021    1.86    4.72    3.82    2.35    3.84    3.37    7.62    6.59    6.45    5.06    0.98    1.28    47.94
127 2022    3.45    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00

enter image description here

Sign up to request clarification or add additional context in comments.

2 Comments

Thank you so much!
You're welcome! Sometimes a simple approach like this is a lot easier than trying to write custom parsing rules.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.