1

I have a folder full of html files as follows:

aaa.html
bbb.html
ccc.html
....
......
.........
zzz.html

All these htmls are created using a python script, and hence follow the same template.

Now, I want to link all these html files, for which I already have the placeholders in the html as follows:

<nav>
    <ul class="pager">
        <li class="previous"><a href="#">Previous</a></li>
        <li class="next"><a href="#">Next</a></li>
    </ul>
</nav>

I want to fill these placeholders with the filenames in the folder. For example, bbb.html will have

<nav>
    <ul class="pager">
        <li class="previous"><a href="aaa.html">Previous</a></li>
        <li class="next"><a href="ccc.html">Next</a></li>
    </ul>
</nav>

and the ccc.html file will contain:

<nav>
    <ul class="pager">
        <li class="previous"><a href="bbb.html">Previous</a></li>
        <li class="next"><a href="ddd.html">Next</a></li>
    </ul>
</nav>

And so on for rest of the files. Can this task be done using python? I don't even know how to start with. Any hints, suggestions would be really helpful.

3
  • is the order of the html files truly alphabetic? If you have AAA.html and aaa.html, which comes first? Commented Apr 7, 2017 at 8:20
  • 2
    You can use os.walk to list of files in that directory, sort them with custom sorting function that you use for template in web scraping then iterate over that list read each file with beautiful soup to change those 2 placeholders to previous and next elementes on list. Commented Apr 7, 2017 at 8:23
  • @philshem The order really doesn't matter. It is just that one file has to be linked with other two. So, any order would do. Commented Apr 7, 2017 at 8:27

2 Answers 2

2

You can use the beautifulsoup library to modify html:

from bs4 import BeautifulSoup

file_names = ['bbb.html', 'ccc.html', ... , 'yyy.html']
# we exclude first and last files (not sure what to do with them ?)

for ind, file_name in enumerate(file_names):
    with open(file_name, 'r+') as f:
        soup = BeautifulSoup(f.read(), 'html.parser')
        # we suppose that there is only one link for previous and next
        soup.find_all(class_='previous')[0]['href'] = file_names[ind - 1]
        soup.find_all(class_='next')[0]['href'] = file_names[ind + 1]
        # erase contents and replace with new html
        f.seek(0)
        f.truncate()
        f.write(soup.prettify("utf-8"))  # to get readable HTML

If the filenames aren't as consistent as in your example, and you want to generate the list from the files in the directory, you can use os.walk or glob.glob.

Sign up to request clarification or add additional context in comments.

Comments

1

You can replace elements from your template by looping over the file list, with list wrapping. Here's an example for aaa.html using aaa,bbb,ccc:

#f = ['aaa.html','bbb.html','ccc.html']
f = sorted(['aaa.html','bbb.html','ccc.html'])  # explicit sorting

t = """<nav>
    <ul class="pager">
        <li class="previous"><a href="#">Previous</a></li>
        <li class="next"><a href="#">Next</a></li>
    </ul>
</nav>"""  # sample aaa.html file

for i in xrange(len(f)-1):
    #print f[i]
    t = t.replace('<li class="previous"><a href="#">Previous','<li class="previous"><a href="'+f[(i % len(f)) -1]+'">Previous')
    t = t.replace('<li class="next"><a href="#">Next','<li class="next"><a href="'+f[(i % len(f)) +1]+'">Next')

print t

To do the list-wrapping I use this concept (After zzz comes aaa)

Gives as an output for aaa.html:

<nav>
    <ul class="pager">
        <li class="previous"><a href="ccc.html">Previous</a></li>
        <li class="next"><a href="bbb.html">Next</a></li>
    </ul>
</nav>

To complete the code, you'd have to loop over *.html files (see glob.glob)

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.