0

I have a string pattern:

"(https?://finance\.sina\.com\.cn/.+?shtml)"

and I use findall method of re to match a content, but the result contains:

'http://finance.sina.com.cn/nmetal/" target="_blank" style="margin-right:7px">黄金</a><a href="http://finance.sina.com.cn/futures/quotes/CL.shtml'

and I have used non-greedy operators, but it still get wrong, where am I wrong?

1 Answer 1

1

Your problem is that the first part of your regex:

https?://finance\.sina\.com\.cn/

matches the URL in the first <a> tag, and the second part

.+?shtml

then matches until it sees the .sthml in the second <a> tag because there is no .shtml on the first href. Ideally you should be using a DOM parser to parse HTML; then you couldn't run into this problem. In the interim, changing .+? into [^"]+ so that part of the regex cannot go past the end of the current href will solve your problem. i.e.

(https?://finance\.sina\.com\.cn/[^'"]+shtml)
Sign up to request clarification or add additional context in comments.

2 Comments

It also has a problem, for one example: a = "http://finance.sina.com.cn/money/gold/AUTD/quote.shtml',name: '黄金',symbol: 'hf_AUTD',type: 'hf' },\n\t\t\t\t\t\t\t{ link: 'http://finance.sina.com.cn/money/gold/AGTD/quote.shtml'" b = re.findall('(https?://finance\.sina\.com\.cn/[^"]+shtml)', a) print(b), , it still has wrong.
@littlely I wasn't sure if you might have href with single quotes as well. See my edit and demo

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.