0

I'm trying to parse big text files with python.

These files have a syntax like this:

<option1> {
<variable1>=<value1>; //<comment> 
<variable2>=<value2>;
..
<variableN>=<valueN>; //<comment> 
}

<option2> {
<variable1>=<value1>; //<comment> 
<variable2>=<value2>;
..
<variableN>=<valueN>; //<comment> 
}

...
...

<optionN> {
<variable1>=<value1>; //<comment> 
<variable2>=<value2>;
..
<variableN>=<valueN>; //<comment> 
}

And I want to get for instance <optionK>[<variableT>] value.

Is there an optimal way to do this by using a fileparser?

5
  • @sshashank124: The OP stated the file is huge; regex would require you read the whole file into memory, perhaps not the most practical advice? Commented Mar 20, 2014 at 9:57
  • @MartijnPieters: mmap allows you to apply regex to a huge file. See How to read tokens without reading whole line or file Commented Mar 20, 2014 at 10:07
  • you could try something like lepl (discontinued) to parse the file, here's a code example Commented Mar 20, 2014 at 10:14
  • @JFSebastian: Can't look it up right now but Jon Clements the other day had found you couldn't if the file was larger than available memory. But I have no first-hand experience there and I'll happily defer to you. I'd read the file line by line detection sections, myself. Commented Mar 20, 2014 at 10:34
  • @MartijnPieters: My answer explicitly says "It works even if the file doesn't fit in memory." I wouldn't have said that if I hadn't tried it. I also would not use a single regex to parse the file. I just mentioned it to say that it is possible Commented Mar 26, 2014 at 21:30

1 Answer 1

1

Consider your above example (ugly solution) you can use http://docs.python.org/2/library/htmlparser.html as follow:

test = """
<option1> {
<variable1>=<value1>; //<comment>
<variable2>=<value2>;
..
<variableN>=<valueN>; //<comment>
}

<option2> {
<variable1>=<value1>; //<comment>
<variable2>=<value2>;
..
<variableN>=<valueN>; //<comment>
}

...
...

<optionN> {
<variable1>=<value1>; //<comment>
<variable2>=<value2>;
..
<variableN>=<valueN>; //<comment>
}

"""

from HTMLParser import HTMLParser

# create a subclass and override the handler methods
class MyHTMLParser(HTMLParser):
    option = ""
    key = ""
    value = ""
    r = {}
    def handle_starttag(self, tag, attrs):
        self.currentTag = tag
        print "Encountered a start tag:", tag
        if "option" in tag:
            #self.r = {}
            self.option = tag
            self.r[self.option] = {}
        elif "{" in self.currentData or "=" not in self.currentData and "//" not in self.currentData:
            self.key = tag
            self.r[self.option][self.key] = ""
        elif "=" in self.currentData:
            self.value = tag
            self.r[self.option][self.key] = self.value
            #print self.r
    def handle_endtag(self, tag):
        self.currentData = None
        print "Encountered an end tag :", tag
    def handle_data(self, data):
        self.currentData = data
        print "Encountered some data  :", data
        #find a condition to yield result here "}" ? 

# instantiate the parser and fed it some HTML
parser = MyHTMLParser()  
parser.feed(test) 
print parser.r
Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.