3

I'd like to parse an UTF8 encoded text file that may contain something like this:

int 1
text " some text with \" and \\ "
int list[-45,54, 435 ,-65]
float list [ 4.0, 5.2,-5.2342e+4]

The numbers in the list are separated by commas. Whitespace is permitted but not required between any number and any symbol like commas and brackets here. Similarly for words and symbols, like in the case of list[

I've done the quoted string reading by forcing Scanner to give me single chars (setting its delimiter to an empty pattern) because I still thought it'll be useful for reading the ints and floats, but I'm not sure anymore.

The Scanner always takes a complete token and then tries to match it. What I need is try to match as much (or as little) as possible, disregarding delimiters.

Basically for this input

int list[-45,54, 435 ,-65]

I'd like to be able to call and get this

s.nextWord()   // int 
s.nextWord()   // list
s.nextSymbol() // [
s.nextInt()    // -45
s.nextSymbol() // ,
s.nextInt()    // 54
s.nextSymbol() // ,
s.nextInt()    // 435
s.nextSymbol() // ,
s.nextInt()    // -65
s.nextSymbol() // ]

and so on.

Or, if it couldn't parse doubles and other types itself, at least a method that takes a regex, returns the biggest string that matches it (or an error) and sets the stream position to just after what it matched.

Can the Scanner somehow be used for this? Or is there another approach? I feel this must be quite a common thing to do, but I don't seem to be able to find the right tool for it.

2
  • I'd parse that file line-by-line using regular expressions to extract the tokens. This would be useful in the same time for syntax check. Commented Sep 3, 2012 at 21:07
  • 1
    I would write a parser using ANTLR. Commented Sep 3, 2012 at 21:15

2 Answers 2

1

I'm not an ANTLR expert, but this ANTLR grammar is capable to parse your code:

grammar Expressions;

expressions 
    :   expression+ EOF
    ;

expression 
    :   intExpression
    |   intListExpression
    |   floatExpression
    |   floatListExpression
    |   textExpression
    |   textListExpression
    ;

intExpression        :  intType INT;
intListExpression    :  intType listType '[' ( INT (',' INT)* )? ']';
floatExpression      :  floatType FLOAT;
floatListExpression  :  floatType listType '[' ( (INT|FLOAT) (',' (INT|FLOAT))* )? ']';
textExpression       :  textType STRING;
textListExpression   :  textType listType '[' ( STRING (',' STRING)* )? ']';

intType   :  'int';
floatType :  'float';
textType  :  'text';
listType  :  'list';

INT :   '0'..'9'+
    ;

FLOAT
    :   ('0'..'9')+ '.' ('0'..'9')* EXPONENT?
    |   '.' ('0'..'9')+ EXPONENT?
    |   ('0'..'9')+ EXPONENT
    ;

STRING
    :  '"' ( ESC_SEQ | ~('\\'|'"') )* '"'
    ;

fragment
EXPONENT : ('e'|'E') ('+'|'-')? ('0'..'9')+ ;

fragment
HEX_DIGIT : ('0'..'9'|'a'..'f'|'A'..'F') ;

fragment
ESC_SEQ
    :   '\\' ('b'|'t'|'n'|'f'|'r'|'\"'|'\''|'\\')
    |   UNICODE_ESC
    |   OCTAL_ESC
    ;

fragment
OCTAL_ESC
    :   '\\' ('0'..'3') ('0'..'7') ('0'..'7')
    |   '\\' ('0'..'7') ('0'..'7')
    |   '\\' ('0'..'7')
    ;

fragment
UNICODE_ESC
    :   '\\' 'u' HEX_DIGIT HEX_DIGIT HEX_DIGIT HEX_DIGIT
    ;

WS  :   ( ' '
        | '\t'
        | '\r'
        | '\n'
        ) {$channel=HIDDEN;}
    ;

Of course you will need to improve it, but I think that with this structure is easy to insert code in the parser to do what you want (a kind of token stream). Try it in ANTLRWorks debug to see what happens.

For your input, this is the parse tree:

Parse Tree for OP input

Edit: I changed it to support empty lists.

Sign up to request clarification or add additional context in comments.

2 Comments

Thank you, this looks like a good way to go. I've never used ANTLR but I guess I should look into it.
You are welcome! This is a good book about ANTLR (the author is the mind behind ANTLR): amazon.com/The-Definitive-Antlr-Reference-Domain-Specific/dp/…
0

Initiate the scanner with the file in the class constructor. then for the nextWord Method, do this,

public static nextWord(){
return(sc.findInLine("\\w+"));
}

You can derive the code for other methods using the above example with the findInLine method of the Scanner class and changing the regex pattern.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.