Extract an ISBN with regex

Question

I have an extremely long string that I want to parse for a numeric value that occurs after the substring "ISBN". However, this grouping of 13 digits can be arranged differently via the "-" character. Examples: (these are all valid ISBNs) 123-456-789-123-4, OR 1-2-3-4-5-67891234, OR 12-34-56-78-91-23-4. Essentially, I want to use a regex pattern matcher on the potential ISBN to see if there is a valid 13 digit ISBN. How do I 'ignore' the "-" character so I can just regex for a \d{13} pattern? My function:

public String parseISBN (String sourceCode) {
  int location = sourceCode.indexOf("ISBN") + 5;
  String ISBN = sourceCode.substring(location); //substring after "ISBN" occurs
  int i = 0;
  while ( ISBN.charAt(i) != ' ' )
    i++;
  ISBN = ISBN.substring(0, i); //should contain potential ISBN value
  Pattern pattern = Pattern.compile("\\d{13}"); //this clearly will find 13 consecutive numbers, but I need it to ignore the "-" character
  Matcher matcher = pattern.matcher(ISBN); 
  if (matcher.find()) return ISBN;
  else return null;
}

my suggestion would be just replace - with nothing. Then use your ISBN checking function. And then if it is correct, you can use either one you need. — Igoris
– Igoris, Commented Aug 18, 2011 at 20:19
Some good regexes in the answers, and I'd certainly use one because it would catch all valid ISBN formats -- but I just wanted to point out that the dashes in an ISBN aren't arbitrarily inserted; there are only certain combinations that are valid. (Of course you're not trying to validate the form of the number, just get the number) — Stephen P
– Stephen P, Commented Aug 18, 2011 at 23:30

aioobe · Accepted Answer · 2011-08-18 20:21:45Z

7

Alternative 1:
```
pattern.matcher(ISBN.replace("-", ""))
```
Alternative 2: Something like
```
Pattern.compile("(\\d-?){13}")
```

Demo of second alternative:

String ISBN = "ISBN: 123-456-789-112-3, ISBN: 1234567891123";

Pattern pattern = Pattern.compile("(\\d-?){13}");
Matcher matcher = pattern.matcher(ISBN);

while (matcher.find())
    System.out.println(matcher.group());

Output:

123-456-789-112-3
1234567891123

answered Aug 18, 2011 at 20:21

aioobe

423k115 gold badges831 silver badges844 bronze badges

Sign up to request clarification or add additional context in comments.

4 Comments

corsiKa Over a year ago

You'll be hard pressed to find a better combination of readability and code efficiency. You can make it more efficient, but you'll lose a lot of readability.

Adam Storm Over a year ago

more efficient how so? Im going to be calling this function potentially thousands of times so anything would be helpful...

aioobe Over a year ago

Some regular expression can get compiled into nasty back-tracking routines. This one will probably be quite efficient though. The obvious alternative is to go through the string "manually" in a for loop. If you end up comparing the two approaches, please report your results back here... would be interesting to hear about :-)

Dojo Over a year ago

Why not use String.replace("-",""); instead?

Jonathan M · Accepted Answer · 2011-08-18 20:22:00Z

5

Try this:

Pattern.compile("\\d(-?\\d){12}")

answered Aug 18, 2011 at 20:22

Jonathan M

17.5k9 gold badges61 silver badges95 bronze badges

1 Comment

ridgerunner Over a year ago

+1 Best answer yet. But I would add a \b word boundary anchor on each end.

Sean Patrick Floyd · Accepted Answer · 2011-08-18 20:22:22Z

3

Use this pattern:

Pattern.compile("(?:\\d-?){13}")

and strip all dashes from the found isbn number

answered Aug 18, 2011 at 20:22

Sean Patrick Floyd

301k72 gold badges481 silver badges598 bronze badges

Comments

Jürgen Strobel · Accepted Answer · 2011-08-18 20:25:55Z

2

Do it in one step with a pattern recognizing everything, and optional dashes between digits. No need to fiddle with ISBN offset + substrings.

ISBN(\d(-?\d){12})

If you want the raw number, strip dashes from the first matched subgroup afterwards. I am not a Java guy so I won't show you code.

answered Aug 18, 2011 at 20:25

Jürgen Strobel

2,26819 silver badges30 bronze badges

Comments

Alan Moore · Accepted Answer · 2011-08-18 23:51:52Z

If you're going to be calling the method a lot, the best thing you can do is not compile the Pattern inside it. Otherwise, each time you call the method you'll spend more time creating the regex than you will actually searching for it.

But after looking at your code again, I think you have a bigger problem, performance-wise. All that business of locating "ISBN" and then creating substrings to apply the regex to is completely unnecessary. Let the regex do that stuff; it's what they're for. The following regex finds the "ISBN" sentinel and the following thirteen digits, if they're there:

static final Pattern isbnPattern = Pattern.compile(
    "\\bISBN[^A-Z0-9]*+(\\d(?:-*+\\d){12})", Pattern.CASE_INSENSITIVE );

The [^A-Z0-9]*+ gobbles up whatever characters may appear between the "ISBN" and the first digit. The possessive quantifier (*+) prevents needless backtracking; if the next character is not a digit, the regex engine immediately quits that match attempt and resumes scanning for another "ISBN" instance.

I used another possessive quantifier for the optional hyphens, plus a non-capturing group ((?:...)) for the repeated portion; that gives another slight performance gain over the capturing groups most of the other responders are using. But I used a capturing group for the whole number, so it can be extracted from the overall match easily. With these changes, your method reduces to this:

public String parseISBN (String source) {
  Matcher m = isbnPattern.matcher(source); 
  return m.find() ? m.group(1) : null;
}

...and it's much more efficient, too. Note that we haven't addressed how the strings are getting into memory. If you're doing the I/O yourself, it's possible there are significant performance gains to be achieved in that area, too.

Justin Morgan · Accepted Answer · 2011-08-19 14:10:03Z

1

You can strip out the dashes with string manipulation, or you could use this:

"\\b(?:\\d-?){13}\\b"

It has the added bonus of making sure the string doesn't start or end with -.

edited Aug 19, 2011 at 14:10

answered Aug 18, 2011 at 20:22

Justin Morgan

30.7k13 gold badges82 silver badges109 bronze badges

Comments

beefyhalo · Accepted Answer · 2011-08-18 20:20:25Z

0

Try stripping the dashes out, and regex the new string

answered Aug 18, 2011 at 20:20

beefyhalo

1,8912 gold badges22 silver badges35 bronze badges

Comments

Pankaj Rastogi · Accepted Answer · 2022-09-13 15:05:53Z

0

you can try this

"(?:[0-9]{9}[0-9X]|[0-9]{13}|[0-9][0-9-]{11}[0-9X]|[0-9][0-9-]{15}[0-9])(?![0-9-])"

answered Sep 13, 2022 at 15:05

Pankaj Rastogi

611 silver badge6 bronze badges

Collectives™ on Stack Overflow

Extract an ISBN with regex

8 Answers 8

4 Comments

1 Comment

Comments

Comments

Comments

Comments

Comments

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

8 Answers 8

4 Comments

1 Comment

Comments

Comments

Comments

Comments

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related