4

I'm a python newbie. I've been searching days long, but found only some little bits of my conception. Python 2.7 on windows (I chose python because it's multiplatform and result can be portable on windows).

I'd like to make a script, that searches a folder for *.txt UTF-8 text files, loads the content (one file after each other), changes non-ascii chars to html entitites, next adds html tags at the start and at the end of each line, but 2 variations of tags, one for the head of the file, and one for the tail of the file, which (head-tail) are separated by an empty line. After that, all the result have to be written out to another text file(s), like *.htm. To be visual:

unicode1.txt:

űnícődé text line1
űnícődé text line2
[empty line]
űnícődé text line3
űnícődé text line4

result have to be in unicode1.htm:

<p class='aaa'>&#369;n&iacute;c&#337;d&eacute; text line1</p>
<p class='aaa'>&#369;n&iacute;c&#337;d&eacute; text line2</p>
[empty line]
<p class='bbb'>&#369;n&iacute;c&#337;d&eacute; text line3</p>
<p class='bbb'>&#369;n&iacute;c&#337;d&eacute; text line3</p>

I started to develop the core of my solution, but I stucked. See script versions (for simplicity I chose encode with xmlcharrefreplace).

V1:

import re, cgi, fileinput
file="_utf8.txt"
text=""
for line in fileinput.input(file, inplace=0):
  line=cgi.escape(line.decode('utf8'),1).encode('ascii', 'xmlcharrefreplace')
  line=re.sub(r"^", "<p>", line, 1)
  text=text+re.sub(r"$", "</p>", line, 1)
print text

It worked, good result, but for this task fileinput is not a usable way I think.

V2:

import re, cgi, codecs
file="_utf8.txt"
text=""
f=codecs.open(file, encoding='utf-8')
for line in f:
  line=cgi.escape(line,1).encode('ascii', 'xmlcharrefreplace')
  line=re.sub(r"^", "<p>", line, 1)
  text=text+re.sub(r"$", "</p>", line, 1)
f.close()
print text

It messed up the result, closing tag at line start replacing first letter, etc.

V3 (tried multiline flag):

import re, cgi, codecs
file="_utf8.txt"
text=""
f=codecs.open(file, encoding='utf-8')
for line in f:
  line=cgi.escape(line,1).encode('ascii', 'xmlcharrefreplace')
  line=re.sub(r"^", "<p>", line, 1, flags=re.M)
  text=text+re.sub(r"$", "</p>", line, 1, flags=re.M)
f.close()
print text

Same result.

V4 (tried 1 regex instead of 2):

import re, cgi, codecs
file="_utf8.txt"
text=""
f=codecs.open(file, encoding='utf-8')
for line in f:
  line=cgi.escape(line,1).encode('ascii', 'xmlcharrefreplace')
  text=text+re.sub(r"^(.*)$", r"<p>\1</p>", line, 1)
f.close()
print text

Same result. Please help.

Edit: I just checked the result file with a hexeditor, and there is an x0D byte before each closing tag! Why?

Edit2: changes for a more logical approach

text+=re.sub(r"^(.*)$", r"<p>\1</p>", line, 1)

Edit3: with a hexeditor I saw what was the reason for the messed up result: extra CR (x0D) byte before each CRLF. I tracked down the CR problem, what made that: the concatenation with +

# -*- coding: utf-8 -*-
text=""
f=u"unicode text line1\r\n unicode text line2"
for line in f:
  text+=line
print text

This results in:

unicode text line1\r\r\n unicode text line2

Any idea, how to fix this?

25
  • Indenting 4 spaces creates a code block. Edit your question so that it is more readable. Commented Jan 22, 2012 at 14:29
  • I used indenting for the first time, but missed the empty paragraph before each indented block. Commented Jan 22, 2012 at 14:35
  • I'm not sure I really understand your question, I've tried your last script and it seems to get the result you are looking for and the result looks OK in the browser. Can you show the results of your testing with notes where the result is wrong? Commented Jan 22, 2012 at 14:42
  • @snim2 For me it messed up the result: closing tag at the line start, deleting the first letter, nothing at the line end. I try here a line to show the result if source line is 'text': </p>ext Commented Jan 22, 2012 at 14:48
  • 1
    Why do you need line=re.sub(r"^", "<p>", line, 1) text=text+re.sub(r"$", "</p>", line, 1)? Can't you just do concatenation: text += "\n<p>" + line + "</p>" Commented Jan 22, 2012 at 15:02

2 Answers 2

3

There's no need for regular expressions at all here, just do this:

with open('utf8.txt') as f:
    class_name = 'aaa'
    for line in f:
        if line == '\n':
            classname = 'bbb'
        else:
            # decode / convert line
            line = '<p class="{0}">{1}</p>\n'.format(class_name, line.rstrip())
        # write line to file

The results you are getting do not look to be caused by the regular expressions as they appear to be correct. The problem is most likely in the line where you do your encoding / converting. Print that line without adding the tags to see if it is as expected.

Sign up to request clarification or add additional context in comments.

15 Comments

it will leave newline before </p>
@J.F.Sebastian, nice catch. Added rstrip to the answer.
.rstrip('\n\r') will preserve ' \t' at EOL.
@J.F.Sebastian, I thought about that but trailing whitespace in a <p> didn't seem very useful. If OP still wants to your suggestion should do it.
@Tib, did you use my original method without rstrip? If so, try my edited answer with rstrip()
|
1
#!/usr/bin/env python
import cgi
import fileinput
import os
import shutil
import sys

def textfiles(rootdir, extensions=('.txt',)):
    for dirpath, dirs, files in os.walk(rootdir):
        for f in files:
            if f.lower().endswith(extensions):
               yield os.path.join(dirpath, f)

def htmlfiles(files):
    for f in files:
        root, _ = os.path.splitext(f)
        newf = root + '.html'
        shutil.copy2(f, newf)
        yield newf

for line in fileinput.input(htmlfiles(textfiles(sys.argv[1])), inplace=True):
    if fileinput.isfirstline():
       klass = 'aaa' # start head part
    line = cgi.escape(line.decode('utf-8').strip())
    line = line.encode('ascii', 'xmlcharrefreplace')
    if not line: # empty line
       klass = 'bbb' # start tail part
       print(line)
    else:
       print('<p class="%s">%s</p>' % (klass, line))

Example

$ python txt2html.py c:\root\dir

6 Comments

Added import sys. Now works, but only prints lines out, and I'd like it written out to *.htm text file(s). Is there a fileoutput also like fileinput?
@Tib: there are multiple options e.g., you could wrap textfiles() to copy each '.txt' file with shutil.copy2() and then yield '.html' filenames to fileinput (use inplace=True in this case). Or close/open new file inside if fileinput.isfirstline().
I have to investigate and learn the docs because I did not understand totally what you wrote :-) Remember, I just started with python :-)
@Tib: I've added htmlfiles() function to illustrate the previous comment. Note: the data is read/written twice in this case.
@Tib: shutil.copy reads .txt file, writes .html file; fileinput.input() reads .html, writes .html: 4 times in total.
|

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.