Python - regex matching in HTML Body

Question

I need to parse the Device Time (i.e. 2012-01-17 13:12:09) in below text by using python. Could you please tell me how I can do this using the standard regular expression library in python? Thanks.

  <html><head><style type="text/css">h1 {color:blue;}h2 {color:red;}</style>
  <h1>Device #1   Root Content</h1><h2>Device Addr: 127.0.0.1:8080</h1>
  <h2>Device Time: 2012-01-17 13:12:09</h2></body></html>

I think his context is proper. He is extracting Device time which seems perfect regular. — Shiplu Mokaddim
– Shiplu Mokaddim, Commented Jan 17, 2012 at 12:57

Tim Pietzcker · Accepted Answer · 2012-01-17 14:18:45Z

2

Just to add

import re
pattern = re.compile(r'(\d{4}-\d{2}-\d{2} \d{2}:\d{2}:\d{2})')
first_match = pattern.search(html)

edited Jan 17, 2012 at 14:18

Tim Pietzcker

337k59 gold badges520 silver badges572 bronze badges

answered Jan 17, 2012 at 12:53

Shadow

6,3073 gold badges22 silver badges14 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

Tim Pietzcker Over a year ago

Although it happens to work here, it's better to use a raw string for regexes. I've edited your answer accordingly. If you get used to this convention, you can avoid a lot of grief later (for example, when your regex contains \b).

xueyumusic · Accepted Answer · 2012-01-17 12:57:27Z

1

Maybe like this: import re

str = """ Your HTML String here"""

pattern = re.compile(r"""Device Time:([ \d\-:]*)""")
s = pattern.search(str)

time = s.group(1)

answered Jan 17, 2012 at 12:57

xueyumusic

2292 silver badges9 bronze badges

3 Comments

F. Aydemir Over a year ago

How about parsing the time excluding the date? (e.g. 14:00:51) Thanks.

xueyumusic Over a year ago

may be add: day_time = time.strip().split(' ')[1]

F. Aydemir Over a year ago

The following does everything: pattern = re.compile(r"""Device Time:([ \d\-:]*)""") s = pattern.search(str) time = (s.group(1)).strip() print time pattern = re.compile('(\d{4}-\d{2}-\d{2})') s = pattern.search(time) date_ = s.group(1) print date_ pattern = re.compile('(\d{2}:\d{2}:\d{2})') s = pattern.search(time) hour = s.group(1) print hour

bw_üezi · Accepted Answer · 2012-01-17 12:57:58Z

1

Try this regex

Device Time: ([^<]+)

this will just return the remaining rest after the words "Device Time: " till the next html tag starts. As shown in an other answer you could also search for a more specific format of this date time.

In general it's considered bad practice to parse html files with regex. However you're example is more like parsing some normal text which happens to be part of html file... In this case that's kind of fine... ;-)

edited Jan 17, 2012 at 12:57

answered Jan 17, 2012 at 12:51

bw_üezi

4,6044 gold badges26 silver badges43 bronze badges

Comments

Shiplu Mokaddim · Accepted Answer · 2012-01-17 14:57:12Z

1

You need this regex.

/Device Time: (\d{4}-\d{2}-\d{2} \d{2}:\d{2}:\d{2})/

or this,

/Device Time: (\d\d\d\d-\d\d-\d\d \d\d:\d\d:\d\d)/

Use this regular expression with global switch on.

edited Jan 17, 2012 at 14:57

answered Jan 17, 2012 at 12:49

Shiplu Mokaddim

57.5k20 gold badges147 silver badges193 bronze badges

3 Comments

F. Aydemir Over a year ago

The content printed is none when I do this. Any suggestion? Thanks.

Tim Pietzcker Over a year ago

Probably because of /.../gi delimiters/modifiers which don't work this way in Python.

Shiplu Mokaddim Over a year ago

I am not a python expert so tried to provide standard regex. Fixed it now.

Collectives™ on Stack Overflow

Python - regex matching in HTML Body

4 Answers 4

1 Comment

3 Comments

Comments

3 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

4 Answers 4

1 Comment

3 Comments

Comments

3 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related