3

I have a file in the following format, records are separated by newline but some records have line feed in them, like below. I need to get each record and process them separately. The file could be a few Mb in size.

 <?aaaaa>
 <?bbbb
     bb>
 <?cccccc>

I have the code:

 FileInputStream fs = new FileInputStream(FILE_PATH_NAME);
 Scanner scanner = new Scanner(fs);
 scanner.useDelimiter(Pattern.compile("<\\?"));
 if (scanner.hasNext()) {
     String line = scanner.next();
     System.out.println(line);
 } 
 scanner.close();

But the result I got have the begining <\? removed:

aaaaa>
bbbb
   bb>
cccccc>

I know the Scanner consumes any input that matches the delimiter pattern. All I can think of is to add the delimiter pattern back to each record mannully.

Is there a way to NOT have the delimeter pattern removed?

3 Answers 3

5

Break on a newline only when preceded by a ">" char:

scanner.useDelimiter("(?<=>)\\R"); // Note you can pass a string directly

\R is a system independent newline
(?<=>) is a look behind that asserts (without consuming) that the previous char is a >

Plus it's cool because <=> looks like Darth Vader's TIE fighter.

Sign up to request clarification or add additional context in comments.

5 Comments

I tested with more records and this approach made some records on the same line. Can you please help ?
@jlp do you mean a "missing" newline like in "<?aaa>\n<?bbb><?ccc>\n<?ddd>" between bbb and ccc?
btw, if that is the case it's easy to handle - just add ? to the end of the regex
@Bohemian Hi, sorry, I have another question. If the data is like "<aaaa><bbbb><cccc>" with no space or newline between them, how do I make my regex so that it can break them into 3 lines like <aaaa> then <bbbb> and <cccc>. Please let me know if I should create another post for this questions. Thanks.
@jlp as per previous comment; add ? to the regex: scanner.useDelimiter("(?<=>)\\R?");. This makes the newline optional, but will consume it if it's there.
1

I'm assuming you want to ignore the newline character '\n' everywhere.

I would read the whole file into a String and then remove all of the '\n's in the String. The part of the code this question is about looks like this:

String fileString = new String(Files.readAllBytes(Paths.get(path)), StandardCharsets.UTF_8);
fileString = fileString.replace("\n", "");
Scanner scanner = new Scanner(fileString);
...  //your code

Feel free to ask any further questions you might have!

2 Comments

The file may be a few Mb large, not sure if it's going to cause any issues if storing the entire file into a string.
@jlp I wouldn't worry about files being a couple of megabytes in size, but you're right that this approach wouldn't scale very well.
0

Here is one way of doing it by using a StringBuilder:

public static void main(String[] args) throws FileNotFoundException {
    Scanner in = new Scanner(new File("C:\\test.txt"));
    StringBuilder builder = new StringBuilder();

    String input = null;
    while (in.hasNextLine() && null != (input = in.nextLine())) {
        for (int x = 0; x < input.length(); x++) {
            builder.append(input.charAt(x));
            if (input.charAt(x) == '>') {
                System.out.println(builder.toString());
                builder = new StringBuilder();
            }
        }
    }

    in.close();
}

Input:

 <?aaaaa>
 <?bbbb
     bb>
 <?cccccc>

Output:

 <?aaaaa>
 <?bbbb     bb>
 <?cccccc>

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.