bash script on specific URL string manipulation

Question

I need to manipulate a string (URL) of which I don't know lenght.

the string is something like

https://x.xx.xxx.xxx/dontcare1/dontcare2/dontcareN/keyword/restofstring

I basically need a regular expression which returns this:

https://x.xx.xxx.xxx/keyword/restofstring where the x is the current ip which can vary everytime and I don't know the number of dontcares.

I actually have no idea how to do it, been 2 hours on the problem but didn't find a solution.

thanks!

Good that you are putting effort to solve your own problem, on SO we do encourage all members to add their efforts in their posts, so kindly do so and let us know then. — RavinderSingh13
– RavinderSingh13, Commented Apr 29, 2019 at 11:31
There are a lot of answers here. If one of them solved your problem you should accept it. If your problem wasn't solved, edit your question and point out why. — Socowi
– Socowi, Commented May 9, 2019 at 6:49

Socowi · Accepted Answer · 2019-04-29 11:41:48Z

1

You can use sed as follows:

sed -E 's=(https://[^/]*).*(/keyword/.*)=\1\2='

s stands for substitute and has the form s=search pattern=replacement pattern=.
The search pattern is a regex in which we grouped (...) the parts you want to extract.
The replacement pattern accesses these groups with \1 and \2.

You can feed a file or stdin to sed and it will process the input line by line.
If you have a string variable and use bash, zsh, or something similar you also can feed that variable directly into stdin using <<<.

Example usage for bash:

input='https://x.xx.xxx.xxx/dontcare1/dontcare2/dontcareN/keyword/restofstring'
output="$(sed -E 's=(https://[^/]*).*(/keyword/.*)=\1\2=' <<< "$input")"
echo "$output" # prints https://x.xx.xxx.xxx/keyword/restofstring

edited Apr 29, 2019 at 11:41

answered Apr 29, 2019 at 11:35

Socowi

27.9k4 gold badges41 silver badges72 bronze badges

Sign up to request clarification or add additional context in comments.

Comments

TheNavigat · Accepted Answer · 2019-04-29 11:44:02Z

0

echo "https://x.xx.xxx.xxx/dontcare1/dontcare2/dontcareN/keyword/restofstring" | sed "s/dontcare[0-9]\+\///g"

sed is used to manipulate text. dontcare[0-9]\+\///g is an escaped form of the regular expression dontcare[0-9]+/, which matches the word "dontcare" followed by 1 or more digits, followed by the / character.

sed's pattern works like this: s/find/replace/g, where g is a command that allowed you to match more than one instance of the pattern.

You can see that regular expression in action here.

Note that this assumes there are no dontcareNs in the rest of the string. If that's the case, Socowi's answer works better.

edited Apr 29, 2019 at 11:44

answered Apr 29, 2019 at 11:42

TheNavigat

8751 gold badge10 silver badges30 bronze badges

3 Comments

TheNavigat Over a year ago

That is correct, but we replacing dontcare with any string should work. For \+, I had to escape it locally for the command to work correctly. According to tldp.org/LDP/abs/html/special-chars.html it might be interpreted as an arithmetic operator, but it also clearly says that it should work here: tldp.org/LDP/abs/html/x17129.html#PLUSREF

Socowi Over a year ago

Ah, I see. I didn't think of the difference between sed and sed -E. For sed -E \+ is a literal, but for sed it's a quantifier. However, "replacing dontcare with any string" might be a bit harder than it seems. I guess OP expects to match anything other than keyword instead of dontcare, so you have to invert the regex /keyword/ which is not so easy in sed.

TheNavigat Over a year ago

Ah! I see what you mean. I am not exactly sure what is the approach explained by the OP here, then. If the intention is to target the position of /keyword/, then your answer is correct (and mine isn't), but if the intention is to find /dontcareN/ then mine should work. The format of the question doesn't clearly explain which is intended, at least to me.

Paul Hodges · Accepted Answer · 2019-04-29 13:20:43Z

0

You could also use read with a / value for $IFS to parse out the trash.

$: IFS=/ read proto trash url trash trash trash keyword rest <<< "https://x.xx.xxx.xxx/dontcare1/dontcare2/dontcareN/keyword/restofstring"
$: echo "$proto//$url/$keyword/$rest"
https://x.xx.xxx.xxx/keyword/restofstring

This is more generalized when the dontcare... values aren't known and predictable strings.

This one is pure bash, though I like Socowi's answer better.

answered Apr 29, 2019 at 13:20

Paul Hodges

16k1 gold badge24 silver badges43 bronze badges

Comments

tripleee · Accepted Answer · 2019-04-29 14:41:40Z

0

Here's a sed variation which picks out the host part and the last two components from the path.

url='http://example.com:1234/ick/poo/bar/quux/fnord'
newurl=$(echo "$url" | sed 's%\(https*://[^/?]*[^?/]\)[^ <>'"'"'"]*/\([^/ <>'"''"]*/^/ <>'"''"]*\)%\1\2%')

The general form is sed 's%pattern%replacement%' where the pattern matches through the end of the host name part (captured into one set of backslashed parentheses) then skips through the penultimate slash, then captures the remainder of the URL including the last slash; and the replacement simply recalls the two captured groups without the skipped part between them.

edited Apr 29, 2019 at 14:41

answered Apr 29, 2019 at 13:58

tripleee

192k37 gold badges318 silver badges367 bronze badges

Collectives™ on Stack Overflow

bash script on specific URL string manipulation

4 Answers 4

Comments

3 Comments

Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

4 Answers 4

Comments

3 Comments

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related