1

I am trying to scrape time series data using pandas DataFrame for Python 2.7 from the web page (http://owww.met.hu/eghajlat/eghajlati_adatsorok/bp/Navig/202_EN.htm). Could somebody please help me how I can write the code. Thanks!

I tried my code as follows:

html =urllib.urlopen("http://owww.met.hu/eghajlat/eghajlati_adatsorok/bp/Navig/202_EN.htm");
text= html.read();
df=pd.DataFrame(index=datum, columns=['m_ta','m_tax','m_taxd', 'm_tan','m_tand'])

But it doesn't give anything. Here I want to display the table as it is.

2
  • Main problem is, that it looks like table, but it is not table. Commented Mar 29, 2016 at 9:26
  • @jezrael... so how can I write the code? Commented Mar 29, 2016 at 9:39

1 Answer 1

1

You can use BeautifulSoup for parsing all font tags, then split column a, set_index from column idx and rename_axis to None - remove index name:

import pandas as pd
import urllib
from bs4 import BeautifulSoup

html = urllib.urlopen("http://owww.met.hu/eghajlat/eghajlati_adatsorok/bp/Navig/202_EN.htm");
soup = BeautifulSoup(html)
#print soup

fontTags = soup.findAll('font')
#print fontTags

#get text from tags fonts
li = [x.text for x in soup.findAll('font')]

#remove first 13 tags, before not contain necessary data 
df = pd.DataFrame(li[13:], columns=['a'])

#split data by arbitrary whitspace 
df = df.a.str.split(r'\s+', expand=True)

#set column names
df.columns = columns=['idx','m_ta','m_tax','m_taxd', 'm_tan','m_tand']

#convert column idx to period
df['idx'] = pd.to_datetime(df['idx']).dt.to_period('M')

#convert columns to datetime
df['m_taxd'] = pd.to_datetime(df['m_taxd'])
df['m_tand'] = pd.to_datetime(df['m_tand'])

#set column idx to index, remove index name
df = df.set_index('idx').rename_axis(None)
print df

         m_ta m_tax     m_taxd  m_tan     m_tand
1901-01  -4.7   5.0 1901-01-23  -12.2 1901-01-10
1901-02  -2.1   3.5 1901-02-06   -7.9 1901-02-15
1901-03   5.8  13.5 1901-03-20    0.6 1901-03-01
1901-04  11.6  18.2 1901-04-10    7.4 1901-04-23
1901-05  16.8  22.5 1901-05-31   12.2 1901-05-05
1901-06  21.0  24.8 1901-06-03   14.6 1901-06-17
1901-07  22.4  27.4 1901-07-30   16.9 1901-07-04
1901-08  20.7  25.9 1901-08-01   14.7 1901-08-29
1901-09  15.9  19.9 1901-09-01   11.8 1901-09-09
1901-10  12.6  17.9 1901-10-04    8.3 1901-10-31
1901-11   4.7  11.1 1901-11-14   -0.2 1901-11-26
1901-12   4.2   8.4 1901-12-22   -1.4 1901-12-07
1902-01   3.4   7.5 1902-01-25   -2.2 1902-01-15
1902-02   2.8   6.6 1902-02-09   -2.8 1902-02-06
1902-03   5.3  13.3 1902-03-22   -3.5 1902-03-13
1902-04  10.5  15.8 1902-04-21    6.1 1902-04-08
1902-05  12.5  20.6 1902-05-31    8.5 1902-05-10
1902-06  18.5  23.8 1902-06-30   14.4 1902-06-19
1902-07  20.2  25.2 1902-07-01   15.5 1902-07-03
1902-08  21.1  25.4 1902-08-07   14.7 1902-08-13
1902-09  16.1  23.8 1902-09-05    9.5 1902-09-24
1902-10  10.8  15.4 1902-10-12    4.9 1902-10-25
1902-11   2.4   9.1 1902-11-01   -4.2 1902-11-18
1902-12  -3.1   7.2 1902-12-27  -17.6 1902-12-15
1903-01  -0.5   8.3 1903-01-11  -11.5 1903-01-23
1903-02   4.6  13.4 1903-02-23   -2.7 1903-02-17
1903-03   9.0  16.1 1903-03-28    4.9 1903-03-09
1903-04   9.0  16.5 1903-04-29    2.6 1903-04-19
1903-05  16.4  21.2 1903-05-03   11.3 1903-05-19
1903-06  19.0  23.1 1903-06-03   15.6 1903-06-07
...       ...   ...        ...    ...        ...
1998-07  22.5  30.7 1998-07-23   15.0 1998-07-09
1998-08  22.3  30.5 1998-08-03   14.8 1998-08-29
1998-09  16.0  21.0 1998-09-12   10.4 1998-09-14
1998-10  11.9  17.2 1998-10-07    8.2 1998-10-27
1998-11   3.8   8.4 1998-11-05   -1.6 1998-11-21
1998-12  -1.6   6.2 1998-12-14   -8.2 1998-12-26
1999-01   0.6   4.7 1999-01-15   -4.8 1999-01-31
1999-02   1.5   6.9 1999-02-05   -4.8 1999-02-01
1999-03   8.2  15.5 1999-03-31    3.0 1999-03-16
1999-04  13.1  17.1 1999-04-16    6.1 1999-04-18
1999-05  17.2  25.2 1999-05-31   11.1 1999-05-06
1999-06  19.8  24.4 1999-06-07   12.2 1999-06-22
1999-07  22.3  28.0 1999-07-06   16.3 1999-07-23
1999-08  20.6  26.7 1999-08-09   17.3 1999-08-23
1999-09  19.3  22.9 1999-09-26   15.0 1999-09-02
1999-10  11.5  19.0 1999-10-03    5.7 1999-10-18
1999-11   3.9  12.6 1999-11-04   -2.2 1999-11-21
1999-12   1.3   6.4 1999-12-13   -8.1 1999-12-25
2000-01  -0.7   8.7 2000-01-31   -6.6 2000-01-25
2000-02   4.5  10.2 2000-02-01   -0.1 2000-02-23
2000-03   6.7  11.6 2000-03-09    0.6 2000-03-17
2000-04  14.8  22.1 2000-04-21    5.8 2000-04-09
2000-05  18.7  23.9 2000-05-27   12.3 2000-05-22
2000-06  21.9  29.3 2000-06-14   15.4 2000-06-17
2000-07  20.3  26.6 2000-07-03   14.0 2000-07-16
2000-08  23.8  29.7 2000-08-20   18.5 2000-08-31
2000-09  16.1  21.5 2000-09-14   12.7 2000-09-24
2000-10  14.1  18.7 2000-10-04    8.0 2000-10-23
2000-11   9.0  14.9 2000-11-15    3.7 2000-11-30
2000-12   3.0   9.4 2000-12-14   -6.8 2000-12-24

[1200 rows x 5 columns]
Sign up to request clarification or add additional context in comments.

2 Comments

Thank you so much!
You can also upvote my answer - click to small triangle above 0 above accepting mark. Thank you.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.