0

I have a lot of long strings - not all of them have the same length and content, so that's why I can't use indices - and I want to extract a string from all of them. This is what I want to extract:

http://www.someDomainName.com/anyNumber 

SomeDomainName doesn't contain any numbers and and anyNumber is different in each long string. The code should extract the desired string from any string possible and should take into account spaces and any other weird thing that might appear in the long string - should be possible with regex right? -. Could anybody help me with this? Thank you.

Update: I should have said that www. and .com are always the same. Also someDomainName! But there's another http://www. in the string

6
  • What about "www." and ".com"? Commented Sep 30, 2012 at 17:01
  • That's always the same, luckily! Commented Sep 30, 2012 at 17:02
  • As in "I don't care about them"? Commented Sep 30, 2012 at 17:02
  • No! I mean they are always www. and .com. See my update please. Commented Sep 30, 2012 at 17:05
  • That still doesn't answer my question. Commented Sep 30, 2012 at 17:05

4 Answers 4

2
import re
results = re.findall(r'\bhttp://www\.someDomainName\.com/\d+\b', long_string)
Sign up to request clarification or add additional context in comments.

1 Comment

Thanks. Exactly what I wanted.
1
>>> import re
>>> pattern = re.compile("(http://www\\.)(\\w*)(\\.com/)(\\d+)")
>>> matches = pattern.search("http://www.someDomainName.com/2134")
>>> if matches:
        print matches.group(0)
        print matches.group(1)
        print matches.group(2)
        print matches.group(3)
        print matches.group(4)

http://www.someDomainName.com/2134
http://www.
someDomainName
.com/
2134

In the above pattern, we have captured 5 groups -

  • One is the complete string that is matched
  • Rest are in the order of the brackets you see.. (So, you are looking for the second one..) - (\\w*)

If you want, you can capture only the part of the string you are interested in.. So, you can remove the brackets from rest of the pattern that you don't want and just keep (\w*)

>>> pattern = re.compile("http://www\\.(\\w*)\\.com/\\d+")
>>> matches = patter.search("http://www.someDomainName.com/2134")
>>> if matches:
       print matches.group(1) 

someDomainName

In the above example, you won't have groups - 2, 3 and 4, as in the previous example, as we have captured only 1 group.. And yes group 0 is always captured.. That is the complete string that matches..

5 Comments

Are you sure this works for every string? Because this doesn't match anything in my case. How do I use a string instead of w*, because I know the name and there's no need for that.
Only the number is variable each time.
What input string are you giving.. As I showed you that matches in my case... Variable number, any domain name...
If you are having fixed domain name, then you can replace (\\w*) with your domain name - someDomainName.. It will match..
J.F. Sebastian's answer solved my problem. Thank you though for your explanation and time.
0

Yeah, your simplest bet is regex. Here's something that will probably get the job done:

import re
matcher = re.compile(r'www.(.+).com\/(.+)
matches = matcher.search(yourstring)
if matches:
    str1,str2 = matches.groups()

Comments

0

If you are sure that there are no dots in SomeDomainName you can just take the first occurence of the string ".com/" and take everything from that index on

this will avoid you the use of regex which are harder to maintain

exp = 'http://www.aejlidjaelidjl.com/alieilael'
print exp[exp.find('.com/')+5:]

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.