Python multiple regular expression replace

Question

I'm a python newbie. I've been searching days long, but found only some little bits of my conception. Python 2.7 on windows (I chose python because it's multiplatform and result can be portable on windows).

I'd like to make a script, that searches a folder for *.txt UTF-8 text files, loads the content (one file after each other), changes non-ascii chars to html entitites, next adds html tags at the start and at the end of each line, but 2 variations of tags, one for the head of the file, and one for the tail of the file, which (head-tail) are separated by an empty line. After that, all the result have to be written out to another text file(s), like *.htm. To be visual:

unicode1.txt:

űnícődé text line1
űnícődé text line2
[empty line]
űnícődé text line3
űnícődé text line4

result have to be in unicode1.htm:

<p class='aaa'>&#369;n&iacute;c&#337;d&eacute; text line1</p>
<p class='aaa'>&#369;n&iacute;c&#337;d&eacute; text line2</p>
[empty line]
<p class='bbb'>&#369;n&iacute;c&#337;d&eacute; text line3</p>
<p class='bbb'>&#369;n&iacute;c&#337;d&eacute; text line3</p>

I started to develop the core of my solution, but I stucked. See script versions (for simplicity I chose encode with xmlcharrefreplace).

V1:

import re, cgi, fileinput
file="_utf8.txt"
text=""
for line in fileinput.input(file, inplace=0):
  line=cgi.escape(line.decode('utf8'),1).encode('ascii', 'xmlcharrefreplace')
  line=re.sub(r"^", "<p>", line, 1)
  text=text+re.sub(r"$", "</p>", line, 1)
print text

It worked, good result, but for this task fileinput is not a usable way I think.

V2:

import re, cgi, codecs
file="_utf8.txt"
text=""
f=codecs.open(file, encoding='utf-8')
for line in f:
  line=cgi.escape(line,1).encode('ascii', 'xmlcharrefreplace')
  line=re.sub(r"^", "<p>", line, 1)
  text=text+re.sub(r"$", "</p>", line, 1)
f.close()
print text

It messed up the result, closing tag at line start replacing first letter, etc.

V3 (tried multiline flag):

import re, cgi, codecs
file="_utf8.txt"
text=""
f=codecs.open(file, encoding='utf-8')
for line in f:
  line=cgi.escape(line,1).encode('ascii', 'xmlcharrefreplace')
  line=re.sub(r"^", "<p>", line, 1, flags=re.M)
  text=text+re.sub(r"$", "</p>", line, 1, flags=re.M)
f.close()
print text

Same result.

V4 (tried 1 regex instead of 2):

import re, cgi, codecs
file="_utf8.txt"
text=""
f=codecs.open(file, encoding='utf-8')
for line in f:
  line=cgi.escape(line,1).encode('ascii', 'xmlcharrefreplace')
  text=text+re.sub(r"^(.*)$", r"<p>\1</p>", line, 1)
f.close()
print text

Same result. Please help.

Edit: I just checked the result file with a hexeditor, and there is an x0D byte before each closing tag! Why?

Edit2: changes for a more logical approach

text+=re.sub(r"^(.*)$", r"<p>\1</p>", line, 1)

Edit3: with a hexeditor I saw what was the reason for the messed up result: extra CR (x0D) byte before each CRLF. I tracked down the CR problem, what made that: the concatenation with +

# -*- coding: utf-8 -*-
text=""
f=u"unicode text line1\r\n unicode text line2"
for line in f:
  text+=line
print text

This results in:

unicode text line1\r\r\n unicode text line2

Any idea, how to fix this?

Indenting 4 spaces creates a code block. Edit your question so that it is more readable. — sgallen
– sgallen, Commented Jan 22, 2012 at 14:29
I used indenting for the first time, but missed the empty paragraph before each indented block. — user1163487
– user1163487, Commented Jan 22, 2012 at 14:35
I'm not sure I really understand your question, I've tried your last script and it seems to get the result you are looking for and the result looks OK in the browser. Can you show the results of your testing with notes where the result is wrong? — snim2
– snim2, Commented Jan 22, 2012 at 14:42
@snim2 For me it messed up the result: closing tag at the line start, deleting the first letter, nothing at the line end. I try here a line to show the result if source line is 'text': ext — user1163487
– user1163487, Commented Jan 22, 2012 at 14:48
Why do you need line=re.sub(r"^", "", line, 1) text=text+re.sub(r"$", "", line, 1)? Can't you just do concatenation: text += "\n" + line + "" — Roman Susi
– Roman Susi, Commented Jan 22, 2012 at 15:02

Gandaro · Accepted Answer · 2012-01-23 03:00:02Z

3

There's no need for regular expressions at all here, just do this:

with open('utf8.txt') as f:
    class_name = 'aaa'
    for line in f:
        if line == '\n':
            classname = 'bbb'
        else:
            # decode / convert line
            line = '<p class="{0}">{1}</p>\n'.format(class_name, line.rstrip())
        # write line to file

The results you are getting do not look to be caused by the regular expressions as they appear to be correct. The problem is most likely in the line where you do your encoding / converting. Print that line without adding the tags to see if it is as expected.

edited Jan 23, 2012 at 3:00

Gandaro

3,4531 gold badge19 silver badges19 bronze badges

answered Jan 22, 2012 at 15:01

Rob Wouters

16.4k3 gold badges44 silver badges36 bronze badges

Sign up to request clarification or add additional context in comments.

15 Comments

jfs Over a year ago

it will leave newline before 

Rob Wouters Over a year ago

@J.F.Sebastian, nice catch. Added rstrip to the answer.

jfs Over a year ago

.rstrip('\n\r') will preserve ' \t' at EOL.

Rob Wouters Over a year ago

@J.F.Sebastian, I thought about that but trailing whitespace in a  didn't seem very useful. If OP still wants to your suggestion should do it.

Rob Wouters Over a year ago

@Tib, did you use my original method without rstrip? If so, try my edited answer with rstrip()

|

jfs · Accepted Answer · 2012-01-22 17:10:14Z

1

#!/usr/bin/env python
import cgi
import fileinput
import os
import shutil
import sys

def textfiles(rootdir, extensions=('.txt',)):
    for dirpath, dirs, files in os.walk(rootdir):
        for f in files:
            if f.lower().endswith(extensions):
               yield os.path.join(dirpath, f)

def htmlfiles(files):
    for f in files:
        root, _ = os.path.splitext(f)
        newf = root + '.html'
        shutil.copy2(f, newf)
        yield newf

for line in fileinput.input(htmlfiles(textfiles(sys.argv[1])), inplace=True):
    if fileinput.isfirstline():
       klass = 'aaa' # start head part
    line = cgi.escape(line.decode('utf-8').strip())
    line = line.encode('ascii', 'xmlcharrefreplace')
    if not line: # empty line
       klass = 'bbb' # start tail part
       print(line)
    else:
       print('<p class="%s">%s</p>' % (klass, line))

Example

$ python txt2html.py c:\root\dir

edited Jan 22, 2012 at 17:10

answered Jan 22, 2012 at 16:11

jfs

417k210 gold badges1k silver badges1.7k bronze badges

6 Comments

user1163487 Over a year ago

Added import sys. Now works, but only prints lines out, and I'd like it written out to *.htm text file(s). Is there a fileoutput also like fileinput?

jfs Over a year ago

@Tib: there are multiple options e.g., you could wrap textfiles() to copy each '.txt' file with shutil.copy2() and then yield '.html' filenames to fileinput (use inplace=True in this case). Or close/open new file inside if fileinput.isfirstline().

user1163487 Over a year ago

I have to investigate and learn the docs because I did not understand totally what you wrote :-) Remember, I just started with python :-)

jfs Over a year ago

@Tib: I've added htmlfiles() function to illustrate the previous comment. Note: the data is read/written twice in this case.

jfs Over a year ago

@Tib: shutil.copy reads .txt file, writes .html file; fileinput.input() reads .html, writes .html: 4 times in total.

|

Collectives™ on Stack Overflow

Python multiple regular expression replace

2 Answers 2

15 Comments

Example

6 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

15 Comments

6 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related