0

i am working on a project to take a file, and save its sections. sections can be

1.

2.

3.

etc., but can also be

1.1

2.3.1.II.

etc.

now the basic of how to read i know. i need to know if there is a good way to detect and separate the text into sections.

i thought about regex, but i don't know enough regex to do it. any suggestions?

Update

example:

1. Position
1.1. Position.
1.2. Scope
1.3. Location. 
2. Compensation
2.1. Schedule
2.2. 
3. Term
3.1. Term.
3.1.i. bla
3.1.ii. bla bla
5
  • 1
    Can you share some sample data? And expected output? Commented Jan 28, 2019 at 18:39
  • Using regex seems to be a good lead. Maybe if your sections' lines only contain the regex you can avoid false positive by testing if there is something after the regex itself. (For instance [your regex] vs [your regex] something else) Commented Jan 28, 2019 at 18:47
  • Are your sections guaranteed to start with numerals? Commented Jan 28, 2019 at 18:51
  • @PushpeshKumarRajwanshi added example Commented Jan 28, 2019 at 18:57
  • @Compass yes, the are guaranteed Commented Jan 28, 2019 at 18:58

2 Answers 2

1

You can use this regex to divide and capture the numbered section in group1 and paragraph section in group2.

^((?:[a-zA-Z\d]{1,2}\.)+)\s+(.*)

Here, ^((?:[a-zA-Z\d]{1,2}\.)+) captures the numbered section which starts with one to two alphanumeric characters followed by a literal dot whole of it one or more times. Then followed by a space hence \s+ then (.*) captures the remaining text which is assumed to be a paragraph. With your given sample data, this is what I have come up with. In case you need more cases covered differently, please add more samples and I will give you further refined solution.

Demo

Here is a sample Java code,

List<String> list = Arrays.asList("1. Position", "1.1. Position.", "1.2. Scope", "1.3. Location. ",
        "2. Compensation", "2.1. Schedule", "2.2. ", "3. Term", "3.1. Term.", "3.1.i. bla", "3.1.ii. bla bla",
        "12.a. some para", "13.a. some para", "1.a. some para", "A.1.a. another para", "B.1.a. some para");
Pattern p = Pattern.compile("^((?:[a-zA-Z\\d]+\\.)+)\\s+(.*)");

list.stream().forEach(x -> {
    Matcher m = p.matcher(x);
    if (m.matches()) {
        System.out.println(x + " --> " + "number section: ("+m.group(1)+")" + " para section: ("+m.group(2)+")");
    }
});

Prints,

1. Position --> number section: (1.) para section: (Position)
1.1. Position. --> number section: (1.1.) para section: (Position.)
1.2. Scope --> number section: (1.2.) para section: (Scope)
1.3. Location.  --> number section: (1.3.) para section: (Location. )
2. Compensation --> number section: (2.) para section: (Compensation)
2.1. Schedule --> number section: (2.1.) para section: (Schedule)
2.2.  --> number section: (2.2.) para section: ()
3. Term --> number section: (3.) para section: (Term)
3.1. Term. --> number section: (3.1.) para section: (Term.)
3.1.i. bla --> number section: (3.1.i.) para section: (bla)
3.1.ii. bla bla --> number section: (3.1.ii.) para section: (bla bla)
12.a. some para --> number section: (12.a.) para section: (some para)
13.a. some para --> number section: (13.a.) para section: (some para)
1.a. some para --> number section: (1.a.) para section: (some para)
A.1.a. another para --> number section: (A.1.a.) para section: (another para)
B.1.a. some para --> number section: (B.1.a.) para section: (some para)
Sign up to request clarification or add additional context in comments.

2 Comments

this works well, but also gives some flase-positive such as a line starts with "Inventions."
If you have such kind of data, then we can restrict the length to 2 characters (or may be 3 do as per your need) so it doesn't match a larger word. Let me update my post.
1

You can match headings with regex like this one (assuming that Roman numerals are u to X):

^((?:(?:\d+|I{1,3}|IV|VI{0,3}|IX|X)\.)+)

Demo

Explanation:

  • ^ beginning of the line
  • \d+|I{1,3}|IV|VI{0,3}|IX|X - matches a numeral:

    • \d+ digits
    • I{1,3}|IV|VI{0,3}|IX|X Roman numerals
  • (?:...) non capturing groups

  • \. dot separating the numerals
  • (...)+ repeating NUMERAL DOT groups once or more

EDIT:

In java you need to escape the pattern (so that java interprets it correctly) and use Pattern.MULTILINE (so that ^ marks beginning of the line not beginning of the string):

Pattern.compile("^((?:(?:\\d+|I{1,3}|IV|VI{0,3}|IX|X)\\.)+)", Pattern.MULTILINE)

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.