Regular Expresion in python does not match in non-greedy in python

Question

I have a string pattern:

"(https?://finance\.sina\.com\.cn/.+?shtml)"

and I use findall method of re to match a content, but the result contains:

'http://finance.sina.com.cn/nmetal/" target="_blank" style="margin-right:7px">黄金</a><a href="http://finance.sina.com.cn/futures/quotes/CL.shtml'

and I have used non-greedy operators, but it still get wrong, where am I wrong?

Nick · Accepted Answer · 2019-11-23 11:17:21Z

1

Your problem is that the first part of your regex:

https?://finance\.sina\.com\.cn/

matches the URL in the first <a> tag, and the second part

.+?shtml

then matches until it sees the .sthml in the second <a> tag because there is no .shtml on the first href. Ideally you should be using a DOM parser to parse HTML; then you couldn't run into this problem. In the interim, changing .+? into [^"]+ so that part of the regex cannot go past the end of the current href will solve your problem. i.e.

(https?://finance\.sina\.com\.cn/[^'"]+shtml)

edited Nov 23, 2019 at 11:17

answered Nov 23, 2019 at 9:52

Nick

147k23 gold badges67 silver badges106 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

littlely Over a year ago

It also has a problem, for one example:

a = "http://finance.sina.com.cn/money/gold/AUTD/quote.shtml',name: '黄金',symbol: 'hf_AUTD',type: 'hf' },\n\t\t\t\t\t\t\t{ link: 'http://finance.sina.com.cn/money/gold/AGTD/quote.shtml'"  b = re.findall('(https?://finance\.sina\.com\.cn/[^"]+shtml)', a) print(b)

, , it still has wrong.

Nick Over a year ago

@littlely I wasn't sure if you might have href with single quotes as well. See my edit and demo

Collectives™ on Stack Overflow

Regular Expresion in python does not match in non-greedy in python

1 Answer 1

2 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

2 Comments

Your Answer

Sign up or log in

Post as a guest

Related