2

Let say I have this string which contains html a tag:

<a href="abgeordnete-1128-0----w8397.html" class="small_link">Berlin-Treptow-Köpenick</a>

How do I use regex in ruby to extract the text of "Berlin-Treptow-Köpenick" ?

Thanks! :)

4
  • 1
    Possible duplicate of RegEx match open tags except XHTML self-contained tags Commented Nov 29, 2015 at 21:19
  • 1
    Why the rush to select an answer? Commented Nov 29, 2015 at 21:46
  • 2
    You should specify the extraction rule. For example, it appears from the example that it is the text comprised of alphanumeric characters and '-' following the character '>', but the reader cannot determine if that would always be the case. Also, when you give an example, it is helpful to assign all input objects to variables (e.g., str = "<a href...") so that readers can refer to those variables in answers and comments without having to define them. Commented Nov 29, 2015 at 22:04
  • I know this question is pretty old, but I think its still worth noting: Your title clearly states that you want to extract text from between 2 tags, but the question does not. Furthermore, you don't specify what those tags are. Commented Jan 5, 2022 at 17:14

4 Answers 4

4

You can use:

html = '<a href="abgeordnete-1128-0----w8397.html" class="small_link">Berlin-Treptow-Köpenick</a>'

html[/>(.*)</, 1]
#=> "Berlin-Treptow-Köpenick"

When your HTML partials are more complex then I recommend using a libraries like Nokogiri:

html = '<a href="abgeordnete-1128-0----w8397.html" class="small_link">Berlin-Treptow-Köpenick</a>'

require 'nokogiri'

Nokogiri::HTML(html).text
#=> "Berlin-Treptow-Köpenick"
         
Sign up to request clarification or add additional context in comments.

1 Comment

This is awesome, but looks like a magic :) Could you please provide the docs or some description how it works?
2

I have made the assumption that the string to be extracted is comprised of alphanumeric characters--including accented letters--and hyphens, and that the string immediately follows the first instance of the character '>'.

string =
'<a href="abgeordnete-1128-0----w8397.html" class="small_link">Berlin-Treptow-Köpenick</a>'

r = /
    (?<=\>)       # match '>' in a positive lookbehind
    [\p{Alnum}-]+ # match >= 0 alphameric character and hyphens
    /x            # extended or free-spacing mode

string[r] #=> "Berlin-Treptow-Köpenick"

Note that /A-Za-z0-9/ does not match accented characters such as 'ö'.

Alternatively, one can use the POSIX syntax:

r = /(?<=\>)[[[:alnum:]]-]+/

Comments

1
string = '<a href="abgeordnete-1128-0----w8397.html" class="small_link">Berlin-Treptow-Köpenick</a>'

string.scan(/<[a][^>]*>(.+?)<\/[a]>/).flatten

Comments

0

ActionController::Base.helpers.strip_tags(html)

this base helper return only text

html = "<a href=\" https://something.com/\"></a><br><strong style=\"color: red;\"><em><del>this</del></em></strong> <strong style=\"color: red;\"><em style=\"color: red;\">works</em></strong"

and this will be returned "this works"

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.