Using String methods instead of Regex

Question

as i am not very familiar with regex, is it possible (whether its hard to do or not) to extract certain text inbetween symbols? for example:

<meta name="description" content="THIS IS THE TEXT I WANT TO EXTRACT" />

I don't beleive it will be hard, but for any non-trivial implementations, you are looking at a reasonably large amount of code to write and maintain. And it is unlikey that you will be hitting anywhere near the performance of a Regex. — Gregory
– Gregory, Commented Oct 14, 2009 at 5:06
RegEx is one of those horribly confusing things that ought not be avoided simply because of its complexity. It's is a great deal more efficient than any standard string method (in most cases) and chances are it's a better choice- even if it does boggle the mind. :-! — Nathan Taylor
– Nathan Taylor, Commented Oct 14, 2009 at 8:28

Marc Gravell · Accepted Answer · 2009-10-14 05:28:59Z

5

Since you give an xml example, just use an xml parser:

string s = (string) XElement.Parse(xml).Attribute("content");

xml is not a simple text format, and Regex isn't really a very good fit; using an appropriate tool will protect you from a range of evils... for example, the following is identical as xml:

<meta
    name="description"
    content=
        'THIS IS THE TEXT I WANT TO EXTRACT'
/>

It also means that when the requirement changes, you have a simple tweak to make to the code, rather than trying to unpick a regex and put it back together again (which can be tricky if you are access a non-trivial node). Equally, xpath might be an option; so in your data the xpath:

/meta/@content

is all you need.

If you haven't got .NET 3.5:

XmlDocument doc = new XmlDocument();
doc.LoadXml(xml);
string s = doc.DocumentElement.GetAttribute("content");

edited Oct 14, 2009 at 5:28

answered Oct 14, 2009 at 5:05

Marc Gravell

1.1m273 gold badges2.6k silver badges3k bronze badges

Sign up to request clarification or add additional context in comments.

Comments

Alex Martelli · Accepted Answer · 2009-10-14 05:28:28Z

2

Sure, you can identify the start and the end of your desired substring by string methods such as IndexOf, then get the desired Substring! In your example, you want to locate (with IndexOf) the "contents=" and then the first following ", right? And once you have those indices into the string, Substring will work fine. (Not posting C# code because I'm not entirely sure of what exactly it IS that you want, beyond IndexOf and Substring...!-)

If so, then:

int first = str.IndexOf("contents=\"");
int last = str.IndexOf("\"", first + 10);
return str.Substring(first + 10, last - first - 10);

should more or less do what you want (apologies in again if there's an off-by-one or so in those hardcoded 10s -- they're meant to stand for the length of the first substring you're looking for; adjust them a little bit up or down until you get exactly the result you want!-), but this is the general concept. Locate the start with single-argument IndexOf, locate the end with two-args IndexOf, slice off the desired piece with Substring...!

edited Oct 14, 2009 at 5:28

answered Oct 14, 2009 at 4:57

Alex Martelli

887k175 gold badges1.3k silver badges1.4k bronze badges

3 Comments

jay_t55 Over a year ago

thats right, what i'm after is the text inbetween both quotes like inside the content tag like this: content="i need this text"

jay_t55 Over a year ago

thanks for the code Alex, but its nowhere near close, it always extracts the first 15 or so chars of the beginning of the file.. weird???

Alex Martelli Over a year ago

What do you see when you add output statements to show the value of first and last?

solairaja · Accepted Answer · 2009-10-14 05:01:02Z

1

if the input is : text1/text2/text3

The below regex will give the 2 in the group i.e, TEXT3

^([^/]*/){2}([^/]*)/$


if you need the last text always, then use the below

^.*/([^/]*)/$

answered Oct 14, 2009 at 5:01

solairaja

9647 silver badges17 bronze badges

1 Comment

Vlad the Impala Over a year ago

I think OP is looking for a non-regex solution.

noctonura · Accepted Answer · 2009-10-14 05:00:15Z

0

Sure you can do it with out Regex. Say you want to get the text between < and >...

string GetTextBetween(string content)
{
  int start = content.IndexOf("<");
  if(start == -1) return null; // Not found.
  int end = content.IndexOf(">");
  if(end == -1) return null;  // end not found
  return content.SubString(start, end - start);
}

answered Oct 14, 2009 at 5:00

noctonura

13.2k10 gold badges56 silver badges89 bronze badges

Collectives™ on Stack Overflow

Using String methods instead of Regex

4 Answers 4

Comments

3 Comments

1 Comment

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

4 Answers 4

Comments

3 Comments

1 Comment

Comments

Your Answer

Sign up or log in

Post as a guest

Related