2

as i am not very familiar with regex, is it possible (whether its hard to do or not) to extract certain text inbetween symbols? for example:

<meta name="description" content="THIS IS THE TEXT I WANT TO EXTRACT" />
2
  • 1
    I don't beleive it will be hard, but for any non-trivial implementations, you are looking at a reasonably large amount of code to write and maintain. And it is unlikey that you will be hitting anywhere near the performance of a Regex. Commented Oct 14, 2009 at 5:06
  • RegEx is one of those horribly confusing things that ought not be avoided simply because of its complexity. It's is a great deal more efficient than any standard string method (in most cases) and chances are it's a better choice- even if it does boggle the mind. :-! Commented Oct 14, 2009 at 8:28

4 Answers 4

5

Since you give an xml example, just use an xml parser:

string s = (string) XElement.Parse(xml).Attribute("content");

xml is not a simple text format, and Regex isn't really a very good fit; using an appropriate tool will protect you from a range of evils... for example, the following is identical as xml:

<meta
    name="description"
    content=
        'THIS IS THE TEXT I WANT TO EXTRACT'
/>

It also means that when the requirement changes, you have a simple tweak to make to the code, rather than trying to unpick a regex and put it back together again (which can be tricky if you are access a non-trivial node). Equally, xpath might be an option; so in your data the xpath:

/meta/@content

is all you need.

If you haven't got .NET 3.5:

XmlDocument doc = new XmlDocument();
doc.LoadXml(xml);
string s = doc.DocumentElement.GetAttribute("content");
Sign up to request clarification or add additional context in comments.

Comments

2

Sure, you can identify the start and the end of your desired substring by string methods such as IndexOf, then get the desired Substring! In your example, you want to locate (with IndexOf) the "contents=" and then the first following ", right? And once you have those indices into the string, Substring will work fine. (Not posting C# code because I'm not entirely sure of what exactly it IS that you want, beyond IndexOf and Substring...!-)

If so, then:

int first = str.IndexOf("contents=\"");
int last = str.IndexOf("\"", first + 10);
return str.Substring(first + 10, last - first - 10);

should more or less do what you want (apologies in again if there's an off-by-one or so in those hardcoded 10s -- they're meant to stand for the length of the first substring you're looking for; adjust them a little bit up or down until you get exactly the result you want!-), but this is the general concept. Locate the start with single-argument IndexOf, locate the end with two-args IndexOf, slice off the desired piece with Substring...!

3 Comments

thats right, what i'm after is the text inbetween both quotes like inside the content tag like this: content="i need this text"
thanks for the code Alex, but its nowhere near close, it always extracts the first 15 or so chars of the beginning of the file.. weird???
What do you see when you add output statements to show the value of first and last?
1

if the input is : text1/text2/text3

The below regex will give the 2 in the group i.e, TEXT3

^([^/]*/){2}([^/]*)/$


if you need the last text always, then use the below

^.*/([^/]*)/$

1 Comment

I think OP is looking for a non-regex solution.
0

Sure you can do it with out Regex. Say you want to get the text between < and >...

string GetTextBetween(string content)
{
  int start = content.IndexOf("<");
  if(start == -1) return null; // Not found.
  int end = content.IndexOf(">");
  if(end == -1) return null;  // end not found
  return content.SubString(start, end - start);
}

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.