4

I have an extremely long string that I want to parse for a numeric value that occurs after the substring "ISBN". However, this grouping of 13 digits can be arranged differently via the "-" character. Examples: (these are all valid ISBNs) 123-456-789-123-4, OR 1-2-3-4-5-67891234, OR 12-34-56-78-91-23-4. Essentially, I want to use a regex pattern matcher on the potential ISBN to see if there is a valid 13 digit ISBN. How do I 'ignore' the "-" character so I can just regex for a \d{13} pattern? My function:

public String parseISBN (String sourceCode) {
  int location = sourceCode.indexOf("ISBN") + 5;
  String ISBN = sourceCode.substring(location); //substring after "ISBN" occurs
  int i = 0;
  while ( ISBN.charAt(i) != ' ' )
    i++;
  ISBN = ISBN.substring(0, i); //should contain potential ISBN value
  Pattern pattern = Pattern.compile("\\d{13}"); //this clearly will find 13 consecutive numbers, but I need it to ignore the "-" character
  Matcher matcher = pattern.matcher(ISBN); 
  if (matcher.find()) return ISBN;
  else return null;
}
2
  • 2
    my suggestion would be just replace - with nothing. Then use your ISBN checking function. And then if it is correct, you can use either one you need. Commented Aug 18, 2011 at 20:19
  • Some good regexes in the answers, and I'd certainly use one because it would catch all valid ISBN formats -- but I just wanted to point out that the dashes in an ISBN aren't arbitrarily inserted; there are only certain combinations that are valid. (Of course you're not trying to validate the form of the number, just get the number) Commented Aug 18, 2011 at 23:30

8 Answers 8

7
  • Alternative 1:

    pattern.matcher(ISBN.replace("-", ""))
    
  • Alternative 2: Something like

    Pattern.compile("(\\d-?){13}")
    

Demo of second alternative:

String ISBN = "ISBN: 123-456-789-112-3, ISBN: 1234567891123";

Pattern pattern = Pattern.compile("(\\d-?){13}");
Matcher matcher = pattern.matcher(ISBN);

while (matcher.find())
    System.out.println(matcher.group());

Output:

123-456-789-112-3
1234567891123
Sign up to request clarification or add additional context in comments.

4 Comments

You'll be hard pressed to find a better combination of readability and code efficiency. You can make it more efficient, but you'll lose a lot of readability.
more efficient how so? Im going to be calling this function potentially thousands of times so anything would be helpful...
Some regular expression can get compiled into nasty back-tracking routines. This one will probably be quite efficient though. The obvious alternative is to go through the string "manually" in a for loop. If you end up comparing the two approaches, please report your results back here... would be interesting to hear about :-)
Why not use String.replace("-",""); instead?
5

Try this:

Pattern.compile("\\d(-?\\d){12}")

1 Comment

+1 Best answer yet. But I would add a \b word boundary anchor on each end.
3

Use this pattern:

Pattern.compile("(?:\\d-?){13}")

and strip all dashes from the found isbn number

Comments

2

Do it in one step with a pattern recognizing everything, and optional dashes between digits. No need to fiddle with ISBN offset + substrings.

ISBN(\d(-?\d){12})

If you want the raw number, strip dashes from the first matched subgroup afterwards. I am not a Java guy so I won't show you code.

Comments

2

If you're going to be calling the method a lot, the best thing you can do is not compile the Pattern inside it. Otherwise, each time you call the method you'll spend more time creating the regex than you will actually searching for it.

But after looking at your code again, I think you have a bigger problem, performance-wise. All that business of locating "ISBN" and then creating substrings to apply the regex to is completely unnecessary. Let the regex do that stuff; it's what they're for. The following regex finds the "ISBN" sentinel and the following thirteen digits, if they're there:

static final Pattern isbnPattern = Pattern.compile(
    "\\bISBN[^A-Z0-9]*+(\\d(?:-*+\\d){12})", Pattern.CASE_INSENSITIVE );

The [^A-Z0-9]*+ gobbles up whatever characters may appear between the "ISBN" and the first digit. The possessive quantifier (*+) prevents needless backtracking; if the next character is not a digit, the regex engine immediately quits that match attempt and resumes scanning for another "ISBN" instance.

I used another possessive quantifier for the optional hyphens, plus a non-capturing group ((?:...)) for the repeated portion; that gives another slight performance gain over the capturing groups most of the other responders are using. But I used a capturing group for the whole number, so it can be extracted from the overall match easily. With these changes, your method reduces to this:

public String parseISBN (String source) {
  Matcher m = isbnPattern.matcher(source); 
  return m.find() ? m.group(1) : null;
}

...and it's much more efficient, too. Note that we haven't addressed how the strings are getting into memory. If you're doing the I/O yourself, it's possible there are significant performance gains to be achieved in that area, too.

Comments

1

You can strip out the dashes with string manipulation, or you could use this:

"\\b(?:\\d-?){13}\\b"

It has the added bonus of making sure the string doesn't start or end with -.

Comments

0

Try stripping the dashes out, and regex the new string

Comments

0

you can try this

"(?:[0-9]{9}[0-9X]|[0-9]{13}|[0-9][0-9-]{11}[0-9X]|[0-9][0-9-]{15}[0-9])(?![0-9-])"

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.