0

I need to manipulate a string (URL) of which I don't know lenght.

the string is something like

https://x.xx.xxx.xxx/dontcare1/dontcare2/dontcareN/keyword/restofstring

I basically need a regular expression which returns this:

https://x.xx.xxx.xxx/keyword/restofstring where the x is the current ip which can vary everytime and I don't know the number of dontcares.

I actually have no idea how to do it, been 2 hours on the problem but didn't find a solution.

thanks!

2
  • Good that you are putting effort to solve your own problem, on SO we do encourage all members to add their efforts in their posts, so kindly do so and let us know then. Commented Apr 29, 2019 at 11:31
  • There are a lot of answers here. If one of them solved your problem you should accept it. If your problem wasn't solved, edit your question and point out why. Commented May 9, 2019 at 6:49

4 Answers 4

1

You can use sed as follows:

sed -E 's=(https://[^/]*).*(/keyword/.*)=\1\2='

s stands for substitute and has the form s=search pattern=replacement pattern=.
The search pattern is a regex in which we grouped (...) the parts you want to extract.
The replacement pattern accesses these groups with \1 and \2.

You can feed a file or stdin to sed and it will process the input line by line.
If you have a string variable and use bash, zsh, or something similar you also can feed that variable directly into stdin using <<<.

Example usage for bash:

input='https://x.xx.xxx.xxx/dontcare1/dontcare2/dontcareN/keyword/restofstring'
output="$(sed -E 's=(https://[^/]*).*(/keyword/.*)=\1\2=' <<< "$input")"
echo "$output" # prints https://x.xx.xxx.xxx/keyword/restofstring
Sign up to request clarification or add additional context in comments.

Comments

0

echo "https://x.xx.xxx.xxx/dontcare1/dontcare2/dontcareN/keyword/restofstring" | sed "s/dontcare[0-9]\+\///g"

sed is used to manipulate text. dontcare[0-9]\+\///g is an escaped form of the regular expression dontcare[0-9]+/, which matches the word "dontcare" followed by 1 or more digits, followed by the / character.

sed's pattern works like this: s/find/replace/g, where g is a command that allowed you to match more than one instance of the pattern.

You can see that regular expression in action here.

Note that this assumes there are no dontcareNs in the rest of the string. If that's the case, Socowi's answer works better.

3 Comments

That is correct, but we replacing dontcare with any string should work. For \+, I had to escape it locally for the command to work correctly. According to tldp.org/LDP/abs/html/special-chars.html it might be interpreted as an arithmetic operator, but it also clearly says that it should work here: tldp.org/LDP/abs/html/x17129.html#PLUSREF
Ah, I see. I didn't think of the difference between sed and sed -E. For sed -E \+ is a literal, but for sed it's a quantifier. However, "replacing dontcare with any string" might be a bit harder than it seems. I guess OP expects to match anything other than keyword instead of dontcare, so you have to invert the regex /keyword/ which is not so easy in sed.
Ah! I see what you mean. I am not exactly sure what is the approach explained by the OP here, then. If the intention is to target the position of /keyword/, then your answer is correct (and mine isn't), but if the intention is to find /dontcareN/ then mine should work. The format of the question doesn't clearly explain which is intended, at least to me.
0

You could also use read with a / value for $IFS to parse out the trash.

$: IFS=/ read proto trash url trash trash trash keyword rest <<< "https://x.xx.xxx.xxx/dontcare1/dontcare2/dontcareN/keyword/restofstring"
$: echo "$proto//$url/$keyword/$rest"
https://x.xx.xxx.xxx/keyword/restofstring

This is more generalized when the dontcare... values aren't known and predictable strings.

This one is pure bash, though I like Socowi's answer better.

Comments

0

Here's a sed variation which picks out the host part and the last two components from the path.

url='http://example.com:1234/ick/poo/bar/quux/fnord'
newurl=$(echo "$url" | sed 's%\(https*://[^/?]*[^?/]\)[^ <>'"'"'"]*/\([^/ <>'"''"]*/^/ <>'"''"]*\)%\1\2%')

The general form is sed 's%pattern%replacement%' where the pattern matches through the end of the host name part (captured into one set of backslashed parentheses) then skips through the penultimate slash, then captures the remainder of the URL including the last slash; and the replacement simply recalls the two captured groups without the skipped part between them.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.