Java - parsing text file - Scanner, Reader or something else?

Question

I'd like to parse an UTF8 encoded text file that may contain something like this:

int 1
text " some text with \" and \\ "
int list[-45,54, 435 ,-65]
float list [ 4.0, 5.2,-5.2342e+4]

The numbers in the list are separated by commas. Whitespace is permitted but not required between any number and any symbol like commas and brackets here. Similarly for words and symbols, like in the case of list[

I've done the quoted string reading by forcing Scanner to give me single chars (setting its delimiter to an empty pattern) because I still thought it'll be useful for reading the ints and floats, but I'm not sure anymore.

The Scanner always takes a complete token and then tries to match it. What I need is try to match as much (or as little) as possible, disregarding delimiters.

Basically for this input

int list[-45,54, 435 ,-65]

I'd like to be able to call and get this

s.nextWord()   // int 
s.nextWord()   // list
s.nextSymbol() // [
s.nextInt()    // -45
s.nextSymbol() // ,
s.nextInt()    // 54
s.nextSymbol() // ,
s.nextInt()    // 435
s.nextSymbol() // ,
s.nextInt()    // -65
s.nextSymbol() // ]

and so on.

Or, if it couldn't parse doubles and other types itself, at least a method that takes a regex, returns the biggest string that matches it (or an error) and sets the stream position to just after what it matched.

Can the Scanner somehow be used for this? Or is there another approach? I feel this must be quite a common thing to do, but I don't seem to be able to find the right tool for it.

I'd parse that file line-by-line using regular expressions to extract the tokens. This would be useful in the same time for syntax check. — Dmitry
– Dmitry, Commented Sep 3, 2012 at 21:07

davidbuzatto · Accepted Answer · 2012-09-03 22:02:55Z

1

I'm not an ANTLR expert, but this ANTLR grammar is capable to parse your code:

grammar Expressions;

expressions 
    :   expression+ EOF
    ;

expression 
    :   intExpression
    |   intListExpression
    |   floatExpression
    |   floatListExpression
    |   textExpression
    |   textListExpression
    ;

intExpression        :  intType INT;
intListExpression    :  intType listType '[' ( INT (',' INT)* )? ']';
floatExpression      :  floatType FLOAT;
floatListExpression  :  floatType listType '[' ( (INT|FLOAT) (',' (INT|FLOAT))* )? ']';
textExpression       :  textType STRING;
textListExpression   :  textType listType '[' ( STRING (',' STRING)* )? ']';

intType   :  'int';
floatType :  'float';
textType  :  'text';
listType  :  'list';

INT :   '0'..'9'+
    ;

FLOAT
    :   ('0'..'9')+ '.' ('0'..'9')* EXPONENT?
    |   '.' ('0'..'9')+ EXPONENT?
    |   ('0'..'9')+ EXPONENT
    ;

STRING
    :  '"' ( ESC_SEQ | ~('\\'|'"') )* '"'
    ;

fragment
EXPONENT : ('e'|'E') ('+'|'-')? ('0'..'9')+ ;

fragment
HEX_DIGIT : ('0'..'9'|'a'..'f'|'A'..'F') ;

fragment
ESC_SEQ
    :   '\\' ('b'|'t'|'n'|'f'|'r'|'\"'|'\''|'\\')
    |   UNICODE_ESC
    |   OCTAL_ESC
    ;

fragment
OCTAL_ESC
    :   '\\' ('0'..'3') ('0'..'7') ('0'..'7')
    |   '\\' ('0'..'7') ('0'..'7')
    |   '\\' ('0'..'7')
    ;

fragment
UNICODE_ESC
    :   '\\' 'u' HEX_DIGIT HEX_DIGIT HEX_DIGIT HEX_DIGIT
    ;

WS  :   ( ' '
        | '\t'
        | '\r'
        | '\n'
        ) {$channel=HIDDEN;}
    ;

Of course you will need to improve it, but I think that with this structure is easy to insert code in the parser to do what you want (a kind of token stream). Try it in ANTLRWorks debug to see what happens.

For your input, this is the parse tree:

Parse Tree for OP input

Edit: I changed it to support empty lists.

edited Sep 3, 2012 at 22:02

answered Sep 3, 2012 at 21:38

davidbuzatto

9,4841 gold badge48 silver badges52 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

Neil Over a year ago

Thank you, this looks like a good way to go. I've never used ANTLR but I guess I should look into it.

davidbuzatto Over a year ago

You are welcome! This is a good book about ANTLR (the author is the mind behind ANTLR): amazon.com/The-Definitive-Antlr-Reference-Domain-Specific/dp/…

Victor Mukherjee · Accepted Answer · 2012-09-03 21:15:23Z

0

Initiate the scanner with the file in the class constructor. then for the nextWord Method, do this,

public static nextWord(){
return(sc.findInLine("\\w+"));
}

You can derive the code for other methods using the above example with the findInLine method of the Scanner class and changing the regex pattern.

answered Sep 3, 2012 at 21:15

Victor Mukherjee

11.3k20 gold badges61 silver badges104 bronze badges

Collectives™ on Stack Overflow

Java - parsing text file - Scanner, Reader or something else?

2 Answers 2

2 Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

2 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related