how to replace markdown tags into html by python?

Question

I want to replace some "markdown" tags into html tags.

for example:

#Title1#
##title2##
Welcome to **My Home Page**

will be turned into

<h1>Title1</h1>
<h2>title2</h2>
Welcome to <b>My Home Page</b>

I just don't know how to do that...For Title1,I tried this:

#!/usr/bin/env python3
import re
text = '''
        #Title1#
        ##title2##
'''
 p = re.compile('^#\w*#\n$')
 print(p.sub('<h1>\w*</h1>',text))

but nothing happens..

 #Title1#
 ##title2##

How could those bbcode/markdown language come into html tags?

Look for some Markdown parser. A search for pypi markdown parser gives several results. I don't have any experience with them, so I think you should download them and try them out on some Markdown formatted text. — nhahtdh
– nhahtdh, Commented Oct 5, 2015 at 7:59
Thanks, But I want to know how those markdown languages works and I am willing to write my own style markdown standards for my homepage with python3 cgi program. — Bing Sun
– Bing Sun, Commented Oct 5, 2015 at 8:59
for this reason I never solve the problem with markdown packages.. — Bing Sun
– Bing Sun, Commented Oct 5, 2015 at 9:00
@BingSun: The actual parsing algorithm is described in CommonMark specs in details, if I remember correctly - it's a two-pass algorithm - first pass to identify block constructs, and 2nd pass to parser the rest. If you want to learn to write a parser, the best way is to look at how existing parsers are written. — nhahtdh
– nhahtdh, Commented Oct 5, 2015 at 9:13

Asunez · Accepted Answer · 2015-10-05 09:49:43Z

4

Check this regex: demo

Here you can see how I substituted the #...# into <h1>...</h1>. I believe you can get this to work with double # and so on to get other markdown features considered, but still you should listen to @Thomas and @nhahtdh comments and use a markdown parser. Using regexes in such cases is unreliable, slow and unsafe.

As for inline text like **...** to <b>...</b> you can try this regex with substitution: demo. Hope you can twink this for other features like underlining and so on.

edited Oct 5, 2015 at 9:49

answered Oct 5, 2015 at 9:04

Asunez

2,3671 gold badge25 silver badges49 bronze badges

Sign up to request clarification or add additional context in comments.

4 Comments

Bing Sun Over a year ago

Thanks, I will try markdown parser later. I tried to reset text='#title#' and print(p.sub('<h1>\1</h1>',text)). the program returns <h1></h1>. what does \1 means? how to define contents that never need to be modified?

Asunez Over a year ago

\1 is a backreference to capture group 1. You will need to check how to do this in Python, as I am not familiar with this language.

Asunez Over a year ago

@BingSun I added regex for **...** part, you might want to check this out too.

Bing Sun Over a year ago

Thank you very much! It is very kind of you!

Community · Accepted Answer · 2020-06-20 09:12:55Z

1

Your regular expression does not work because in the default mode, ^ and $ (respectively) matches the beginning and the end of the whole string.

'^'

(Caret.) Matches the start of the string, and in MULTILINE mode also matches immediately after each newline (my emph.)

'$'

Matches the end of the string or just before the newline at the end of the string, and in MULTILINE mode also matches before a newline. foo matches both ‘foo’ and ‘foobar’, while the regular expression foo$ matches only ‘foo’. More interestingly, searching for foo.$ in 'foo1\nfoo2\n' matches ‘foo2’ normally, but ‘foo1’ in MULTILINE mode; searching for a single $ in 'foo\n' will find two (empty) matches: one just before the newline, and one at the end of the string.

(7.2.1. Regular Expression Syntax)

Add the flag re.MULTILINE in your compile line:

p = re.compile('^#(\w*)#\n$', re.MULTILINE)

and it should work – at least for single words, such as your example. A better check would be

p = re.compile('^#([^#]*)#\n$', re.MULTILINE)

– any sequence that does not contain a #.

In both expressions, you need to add parentheses around the part you want to copy so you can use that text in your replacement code. See the official documentation on Grouping for that.

edited Jun 20, 2020 at 9:12

CommunityBot

11 silver badge

answered Oct 5, 2015 at 9:13

Jongware

22.6k8 gold badges56 silver badges104 bronze badges

1 Comment

nhahtdh Over a year ago

When you mention "single line mode", it's going to be confused with s flag which makes . matches new line. (Well, the naming is quite confusing, and I was bitten by it when I started out). It's more accurate to say, by default ^ and $ matches the beginning and the end of the whole string. You need MULTILINE mode (m flag) to make them also match the beginning and the end of the line.

Collectives™ on Stack Overflow

how to replace markdown tags into html by python?

2 Answers 2

4 Comments

1 Comment

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

4 Comments

1 Comment

Your Answer

Sign up or log in

Post as a guest

Related