How to parse a markdown file to json in python?

Question

I have many markdown files with titles, subheadings, sub-subheadings etc.

I'm interested in parsing them into a JSON that'll separate for each heading the text and "subheadings" in it.

For example, I've got the following markdown file, I want it to be parsed into something of the form:

outer1
outer2

# title 1
text1.1

## title 1.1
text1.1.1

# title 2
text 2.1

to:

{
  "text": [
    "outer1",
    "outer2"
  ],
  "inner": [
    {
      "section": [
        {
          "title": "title 1",
          "inner": [
            {
              "text": [
                "text1.1"
              ],
              "inner": [
                {
                  "section": [
                    {
                      "title": "title 1.1",
                      "inner": [
                        {
                          "text": [
                            "text1.1.1"
                          ]
                        }
                      ]
                    }
                  ]
                }
              ]
            }
          ]
        },
        {
          "title": "title 2",
          "inner": [
            {
              "text": [
                "text2.1"
              ]
            }
          ]
        }
      ]
    }
  ]
}

To further illustrate the need - notice how the inner heading is nested inside the outer one, whereas the 2nd outer heading is not.

I tried using pyparser to solve this but it seems to me that it's not able to achieve this because to get section "title 2" to be on the same level as "title 1" I need some sort of "counting logic" to check that the number or "#" in the new header is less than or equal which is something I can't seem to do.

Is this an issue with the expressibility of pyparser? Is there another kind of parser that could achieve this?

I could implement this in pure python but I wanted to do something better.

Here is my current pyparsing implementation which doesn't work as explained above:

section = pp.Forward()("section")
inner_block = pp.Forward()("inner")

start_section = pp.OneOrMore(pp.Word("#"))
title_section = line
title = start_section.suppress() + title_section('title')

line = pp.Combine(
pp.OneOrMore(pp.Word(pp.unicode.Latin1.printables), stop_on=pp.LineEnd()),
join_string=' ', adjacent=False)
text = \~title + pp.OneOrMore(line, stop_on=(pp.LineEnd() + pp.FollowedBy("#")))

inner_block \<\< pp.Group(section | (text('text') + pp.Optional(section.set_parse_action(foo))))

section \<\< pp.Group(title + pp.Optional(inner_block))

markdown = pp.OneOrMore(inner_block)


test = """\
out1
out2

# title 1
text1.1

# title 2
text2.1

"""

res = markdown.parse_string(test, parse_all=True).as_dict()
test_eq(res, dict(
    inner=[
        dict(
            text = ["out1", "out2"],
            section=[
                dict(title="title 1", inner=[
                    dict(
                        text=["text1.1"]
                    ),
                ]),
                dict(title="title 2", inner=[
                    dict(
                        text=["text2.1"]
                    ),
                ]),
            ]
        )
    ]
))

PaulMcG · Accepted Answer · 2022-12-12 08:53:31Z

I took a slightly different approach to this problem, using scan_string instead of parse_string, and doing more of the data structure management and storage in a scan_string loop instead of in the parser itself with parse actions.

scan_string scans the input and for each match found, returns the matched tokens as a ParseResults, and the start and end locations of the match in the source string.

Starting with an import, I define an expression for a title line:

import pyparsing as pp

# define a pyparsing expression that will match a line with leading '#'s
title = pp.AtLineStart(pp.Word("#")) + pp.rest_of_line

To get ready to gather data by title, I define a title_stack list, and a last_end int to keep track of the end of the last title found (so we can slice out the contents of the last title that was parsed). I initialize this stack with a fake entry representing the start of the file:

# initialize title_stack with level-0 title at the start of the file
title_stack.append([0, '<start of file>'])

Here is the scan loop using scan_string:

for t, start, end in title.scan_string(sample):
    # save content since last title in the last item in title_stack
    title_stack[-1].append(sample[last_end:start].lstrip("\n"))

    # add a new entry to title_stack
    marker, title_content = t
    level = len(marker)
    title_stack.append([level, title_content.lstrip()])

    # update last_end to the end of the current match
    last_end = end

# add trailing text to the final parsed title
title_stack[-1].append(sample[last_end:])

At this point, title_stack contains a list of 3-element lists, the title level, the title text, and the body text for that title. Here is the output for your sample markdown:

[[0, '<start of file>', 'outer1\nouter2\n\n'],
 [1, 'title 1', 'text1.1\n\n'],
 [2, 'title 1.1', 'text1.1.1\n\n'],
 [3, 'title 1.1.1', 'text 1.1.1\n\n'],
 [1, 'title 2', 'text 2.1']]

From here, you should be able to walk this list and convert it into your desired tree structure.

Collectives™ on Stack Overflow

How to parse a markdown file to json in python?

1 Answer 1

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Your Answer

Sign up or log in

Post as a guest

Related