1

I want to extract the botname with its version from user-agent strings. I tried using split function. But since the way of displaying user-agent string is different from one crawler to the other what is the best way to get my expected out put?(Please consider that i need a general solution)

Input(user-agent strings)

Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)
Mozilla/5.0 (compatible; AhrefsBot/4.0; +http://ahrefs.com/robot/)
msnbot/2.0b (+http://search.msn.com/msnbot.htm)

Expected output

Googlebot/2.1
AhrefsBot/4.0
msnbot/2.0b
1

1 Answer 1

4

Try following:

import re

lines = [
    'Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)',
    'Mozilla/5.0 (compatible; AhrefsBot/4.0; +http://ahrefs.com/robot/)',
    'msnbot/2.0b (+http://search.msn.com/msnbot.htm)'
]

botname = re.compile('\w+bot/[.\w]+', flags=re.IGNORECASE)
for line in lines:
    matched = botname.search(line)
    if matched:
        print(matched.group())

prints

Googlebot/2.1
AhrefsBot/4.0
msnbot/2.0b

assumed that bot agent names contain bot/.

Sign up to request clarification or add additional context in comments.

1 Comment

Thanks falsetru!...:)This is what I expected!

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.