Java - splitting a file into sections

Question

i am working on a project to take a file, and save its sections. sections can be

1.

2.

3.

etc., but can also be

1.1

2.3.1.II.

etc.

now the basic of how to read i know. i need to know if there is a good way to detect and separate the text into sections.

i thought about regex, but i don't know enough regex to do it. any suggestions?

Update

example:

1. Position
1.1. Position.
1.2. Scope
1.3. Location. 
2. Compensation
2.1. Schedule
2.2. 
3. Term
3.1. Term.
3.1.i. bla
3.1.ii. bla bla

Using regex seems to be a good lead. Maybe if your sections' lines only contain the regex you can avoid false positive by testing if there is something after the regex itself. (For instance [your regex] vs [your regex] something else) — Romano
– Romano, Commented Jan 28, 2019 at 18:47

Pushpesh Kumar Rajwanshi · Accepted Answer · 2019-01-29 06:02:13Z

1

You can use this regex to divide and capture the numbered section in group1 and paragraph section in group2.

^((?:[a-zA-Z\d]{1,2}\.)+)\s+(.*)

Here, ^((?:[a-zA-Z\d]{1,2}\.)+) captures the numbered section which starts with one to two alphanumeric characters followed by a literal dot whole of it one or more times. Then followed by a space hence \s+ then (.*) captures the remaining text which is assumed to be a paragraph. With your given sample data, this is what I have come up with. In case you need more cases covered differently, please add more samples and I will give you further refined solution.

Demo

Here is a sample Java code,

List<String> list = Arrays.asList("1. Position", "1.1. Position.", "1.2. Scope", "1.3. Location. ",
        "2. Compensation", "2.1. Schedule", "2.2. ", "3. Term", "3.1. Term.", "3.1.i. bla", "3.1.ii. bla bla",
        "12.a. some para", "13.a. some para", "1.a. some para", "A.1.a. another para", "B.1.a. some para");
Pattern p = Pattern.compile("^((?:[a-zA-Z\\d]+\\.)+)\\s+(.*)");

list.stream().forEach(x -> {
    Matcher m = p.matcher(x);
    if (m.matches()) {
        System.out.println(x + " --> " + "number section: ("+m.group(1)+")" + " para section: ("+m.group(2)+")");
    }
});

Prints,

1. Position --> number section: (1.) para section: (Position)
1.1. Position. --> number section: (1.1.) para section: (Position.)
1.2. Scope --> number section: (1.2.) para section: (Scope)
1.3. Location.  --> number section: (1.3.) para section: (Location. )
2. Compensation --> number section: (2.) para section: (Compensation)
2.1. Schedule --> number section: (2.1.) para section: (Schedule)
2.2.  --> number section: (2.2.) para section: ()
3. Term --> number section: (3.) para section: (Term)
3.1. Term. --> number section: (3.1.) para section: (Term.)
3.1.i. bla --> number section: (3.1.i.) para section: (bla)
3.1.ii. bla bla --> number section: (3.1.ii.) para section: (bla bla)
12.a. some para --> number section: (12.a.) para section: (some para)
13.a. some para --> number section: (13.a.) para section: (some para)
1.a. some para --> number section: (1.a.) para section: (some para)
A.1.a. another para --> number section: (A.1.a.) para section: (another para)
B.1.a. some para --> number section: (B.1.a.) para section: (some para)

edited Jan 29, 2019 at 6:02

answered Jan 28, 2019 at 19:22

Pushpesh Kumar Rajwanshi

18.4k2 gold badges22 silver badges39 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

No Idea For Name Over a year ago

this works well, but also gives some flase-positive such as a line starts with "Inventions."

Pushpesh Kumar Rajwanshi Over a year ago

If you have such kind of data, then we can restrict the length to 2 characters (or may be 3 do as per your need) so it doesn't match a larger word. Let me update my post.

mrzasa · Accepted Answer · 2019-01-28 22:23:23Z

1

You can match headings with regex like this one (assuming that Roman numerals are u to X):

^((?:(?:\d+|I{1,3}|IV|VI{0,3}|IX|X)\.)+)

Demo

Explanation:

^ beginning of the line
\d+|I{1,3}|IV|VI{0,3}|IX|X - matches a numeral:
- \d+ digits
- I{1,3}|IV|VI{0,3}|IX|X Roman numerals
(?:...) non capturing groups
\. dot separating the numerals
(...)+ repeating NUMERAL DOT groups once or more

EDIT:

In java you need to escape the pattern (so that java interprets it correctly) and use Pattern.MULTILINE (so that ^ marks beginning of the line not beginning of the string):

Pattern.compile("^((?:(?:\\d+|I{1,3}|IV|VI{0,3}|IX|X)\\.)+)", Pattern.MULTILINE)

edited Jan 28, 2019 at 22:23

answered Jan 28, 2019 at 18:47

mrzasa

23.4k11 gold badges60 silver badges96 bronze badges

Collectives™ on Stack Overflow

Java - splitting a file into sections

Update

2 Answers 2

2 Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

Update

2 Answers 2

2 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related