1

I have a text document in which I have a bunch of urls of the form /courses/......./.../.. and from among these urls, I only want to extract those urls that are of the form /courses/.../lecture-notes. Meaning the urls that begin with /courses and ends with /lecture-notes. Would anyone know of a good way to do this with regular expressions or just by string matching?

3 Answers 3

5

Here's one alternative:

Scanner s = new Scanner(new FileReader("filename.txt"));

String str;
while (null != (str = s.findWithinHorizon("/courses/\\S*/lecture-notes", 0)))
    System.out.println(str);

Given a filename.txt with the content

Here /courses/lorem/lecture-notes and
here /courses/ipsum/dolor/lecture-notes perhaps.

the above snippet prints

/courses/lorem/lecture-notes
/courses/ipsum/dolor/lecture-notes
Sign up to request clarification or add additional context in comments.

Comments

1

The following will only return the middle part (ie: exclude /courses/ and /lectures-notes/:

Pattern p = Pattern.compile("/courses/(.*)/lectures-notes");
Matcher m = p.matcher(yourStrnig);

if(m.find()).
  return m.group(1) // The "1" here means it'll return the first part of the regex between parethesis.

Comments

1

Assuming that you have 1 URL per line, could use:

    BufferedReader br = new BufferedReader(new FileReader("urls.txt"));
    String urlLine;
    while ((urlLine = br.readLine()) != null) {
        if (urlLine.matches("/courses/.*/lecture-notes")) {
            // use url
        }
    }

2 Comments

Nothing in the description precludes processing the urls. This check is within a loop.
Unless you explain how to traverse a text token by token (or at least line by line) this answer is not complete. (Also, ^ and $ are not needed when using matches.)

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.