2

I have very little experience with regex, so thanks in advance.

I have a string like,

Doe Jane, Doe John. The Works of Dr. Suess. Harvard Press(1984).

I am using string.split(regex) to split the string into a String[] based on the regex I supply. I want to split it into authors, title, publication info. The problem is just using [.] breaks it up after Dr. too.

How can I write a regext to include '.' but exclude something like 'Dr.' or 'Mr.' ?

Thanks

1
  • 3
    If you want to write a general parser for bibliography entries, you'll have to whip up something "smarter" than a regex. Using just a regex means that it will have to account for every possible period-delimited abbreviation, which is basically not feasible. Commented Feb 7, 2012 at 19:09

4 Answers 4

4

I'd recommend using a specialized package for parsing bibliography entries, such as ParsCit.

I've tried their Web interface, and it seems to correctly parse your example out of the box.

With regular expressions, you'll be faced with an uphill struggle in that you'll have to figure out and account for every single possible use of the full stop in a title.

Sign up to request clarification or add additional context in comments.

Comments

1

You can use negative lookbehind:

(?<!Dr|Mr)\.

Comments

1

Use negative lookbehind regex like this:

str.split("(?<!(D|M|J|S)r)\\.\\s*");

Sample Code:

String str="Mr. Doe Jane, Doe John Sr.. The Works of Dr. Suess. Harvard Press(1984).";
String[] arr = str.split("(?<![DMJS]r)\\.\\s*");
for (int i=0; i<arr.length; i++)
    System.out.println(arr[i]);

OUTPUT:

Mr. Doe Jane, Doe John Sr.
The Works of Dr. Suess
Harvard Press(1984)

Comments

0

This has to use some sort of negative lookbehind, like in this sample:

String input = "Doe Jane, Doe John. The Works of Dr. Suess. Harvard Press(1984)";
String [] tokens = input.split("(?<!Dr|Mr)\\.");
for(String token : tokens){
        // this will output3 tokens
    System.out.println(token);
}

What this says is split on . (dot), BUT the thing that comes behind (?< sign) this dot has to be different (! sign) then Dr or (| sign) Mr

Cheers, Eugene.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.