I have many markdown files with titles, subheadings, sub-subheadings etc.
I'm interested in parsing them into a JSON that'll separate for each heading the text and "subheadings" in it.
For example, I've got the following markdown file, I want it to be parsed into something of the form:
outer1
outer2
# title 1
text1.1
## title 1.1
text1.1.1
# title 2
text 2.1
to:
{
"text": [
"outer1",
"outer2"
],
"inner": [
{
"section": [
{
"title": "title 1",
"inner": [
{
"text": [
"text1.1"
],
"inner": [
{
"section": [
{
"title": "title 1.1",
"inner": [
{
"text": [
"text1.1.1"
]
}
]
}
]
}
]
}
]
},
{
"title": "title 2",
"inner": [
{
"text": [
"text2.1"
]
}
]
}
]
}
]
}
To further illustrate the need - notice how the inner heading is nested inside the outer one, whereas the 2nd outer heading is not.
I tried using pyparser to solve this but it seems to me that it's not able to achieve this because to get section "title 2" to be on the same level as "title 1" I need some sort of "counting logic" to check that the number or "#" in the new header is less than or equal which is something I can't seem to do.
Is this an issue with the expressibility of pyparser? Is there another kind of parser that could achieve this?
I could implement this in pure python but I wanted to do something better.
Here is my current pyparsing implementation which doesn't work as explained above:
section = pp.Forward()("section")
inner_block = pp.Forward()("inner")
start_section = pp.OneOrMore(pp.Word("#"))
title_section = line
title = start_section.suppress() + title_section('title')
line = pp.Combine(
pp.OneOrMore(pp.Word(pp.unicode.Latin1.printables), stop_on=pp.LineEnd()),
join_string=' ', adjacent=False)
text = \~title + pp.OneOrMore(line, stop_on=(pp.LineEnd() + pp.FollowedBy("#")))
inner_block \<\< pp.Group(section | (text('text') + pp.Optional(section.set_parse_action(foo))))
section \<\< pp.Group(title + pp.Optional(inner_block))
markdown = pp.OneOrMore(inner_block)
test = """\
out1
out2
# title 1
text1.1
# title 2
text2.1
"""
res = markdown.parse_string(test, parse_all=True).as_dict()
test_eq(res, dict(
inner=[
dict(
text = ["out1", "out2"],
section=[
dict(title="title 1", inner=[
dict(
text=["text1.1"]
),
]),
dict(title="title 2", inner=[
dict(
text=["text2.1"]
),
]),
]
)
]
))