I would like suggestions on extracting a substring from a range of URLs. The code I'm writing should extract this piece of info (the actual id of the URL) from URLs in incoming events from our web tracker.
Take these URLs (the URLs that contain the substrings I'm looking for is in the format of the first three)
https://www.rbnett.no/sport/i/LA8gxP/_
https://www.itromso.no/sport/sprek/i/GGobq6/derfor-vraker-tromsoes-beste-loeper-sesongens-eneste-konkurranse-det-er-for-risikabelt-aa-delta
https://www.adressa.no/sport/fotball/i/9vyQGW/brann-treneren-ferdig-avsluttet-pressekonferansen-med-aa-sitere-max-manus
https://www.rbnett.no/dakapo/banner/
https://www.adressa.no/search/
where I want to extract the substrings "LA8gxP", "GGobq6" and "9vyQGW" from the three former URLs respectively, without hitting "dakapo", "banner" or "search" from the latter two.
I'm asking for suggestions on a regexp to extract that piece of info. As far as I know, the substrings only contain a-z, A-Z, and 0-9. The substrings seem to be only 6 chars long, but that will probably change over time.
The best solution (using Python) I have found so far is this:
match = re.search(r"/i/([a-zA-Z0-9]+)/", url)
substring = match.group(1)
It works, but I don't find it to be very elegant.
Also, it's relying on having the /i/-pattern as a prefix. Even though it looks like a consistent pattern, I'm not 100% sure if it is.
/i/attempt. I don't really see any other syntax which seems as reliable as/i/