3

I have some code that takes in a URL, reads through the file and searches for Strings that match a given regular expression and adds any matches to an arrayList until it reaches the end of the file. How can I modify my code so that while reading through the file, I can check for other Strings matching other regular expressions on the same pass rather than having to read the file multiple times checking for each different regex?

    //Pattern currently being checked for
    Pattern name = Pattern.compile("<a id=.dg__ct(.+?)_hpl1.>(.+?)</a>");

    //Pattern I want to check for as well, currently not implemented
    Pattern date = Pattern.compile("[0-9]{2}/[0-9]{2}/[0-9]{4}");

    Matcher m;
    InputStream inputStream = null;
    arrayList = new ArrayList<String>();
    try {
        URL url = new URL(
                "URL to be read");
        inputStream = (InputStream) url.getContent();
    } catch (Exception e) {
        e.printStackTrace();
    } finally {
        InputStreamReader isr = new InputStreamReader(inputStream);
        BufferedReader buf = new BufferedReader(isr);
        String str = null;
        String s = null;

        try {
            while ((str = buf.readLine()) != null) {

                m = name.matcher(str);
                while(m.find()){
                    s = m.group();
                    arrayList.add(s);
                }

            }
        } catch (IOException e) {
            e.printStackTrace();
        }
    }

4 Answers 4

6

From 2 Matchers on, you should use a List. And you shouldn't do it in the finally block, which is entered, if one of the streams fails. Instead, the finally block should be used to close the resources.

    List <Pattern> patterns = new ArrayList <Pattern> ();
    //Pattern currently being checked for
    patterns.add (Pattern.compile ("<a id=.dg__ct(.+?)_hpl1.>(.+?)</a>"));
    //Pattern I want to check for as well, currently not implemented
    patterns.add (Pattern.compile ("[0-9]{2}/[0-9]{2}/[0-9]{4}"));
    BufferedReader buf = null;
    List <String> matches = new ArrayList <String> ();
    try {
        URL url = new URL ("URL to be read");
        InputStream inputStream = (InputStream) url.getContent ();
        InputStreamReader isr = new InputStreamReader (inputStream);
        buf = new BufferedReader (isr);
        String str = null;
        while ((str = buf.readLine ()) != null) 
        {
            for (Pattern p : patterns) 
            {
                Matcher m = p.matcher (str);
                while (m.find ()) 
                    matches.add (m.group ());
            }
        }       
    } 
    catch (Exception e) 
    {
        e.printStackTrace();
    }
    finally  
    {
        if (buf != null) 
            try { buf.close (); } catch (IOException ignored) { /*empty*/}
    }

Not corrected in the code: Instead of 'Exception', you should enumerate the specific exceptions. And Matcher is just used inside the innermost loop, so declare it there, not in a bigger scope. A small scope makes it easy to reason about the usage of a variable.

I'm not sure whether the util.Scanner.class can be used to make reading from an Url more easy. Have a look at the documentation.

Sign up to request clarification or add additional context in comments.

1 Comment

very clean and simple solution. Thanks
2

Instead of using a regular expression, use a java library which understands how to parse HTML properly.

For example, check out the answers for: Java HTML Parsing

Comments

1

Simply obtain a new matcher for the other pattern

   Matcher m2 = date.matcher(str);
   ... // do whatever you want to do with this pattern match

BTW, it's not really a extremely good idea, in general, to parse HTML with regular expressions. (ob. link, by Assistant Don't Parse HTML With Regex Officer in charge)

Comments

1
  1. Create two Matcher objects

    //Pattern currently being checked for
    Matcher nameMatcher = Pattern.compile("<a id=.dg__ct(.+?)_hpl1.>(.+?)</a>").matcher();
    
    //Pattern I want to check for as well, currently not implemented
    Matcher dateMatcher = Pattern.compile("[0-9]{2}/[0-9]{2}/[0-9]{4}").matcher();
    
    
    // other stuff...
    
  2. Check the read string against each matcher

    while ((str = buf.readLine()) != null) {
    
            nameMatcher.reset(str);
    
            while(nameMatcher.find()){
                s = nameMatcher.group();
                arrayList.add(s);
            }
    
            dateMatcher.reset(str);
    
            while(nameMatcher.find()){
                s = nameMatcher.group();
                arrayList.add(s);
            }
        }
    

Important

Use reset(Charsequence) instead of allocation a new Matcher object every time.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.