1

Python 3.4.2... I have been trying to dynamically load a custom module from an argument. I want to load custom code to scrape specific HTML files. Example: scrape.py -m name_of_module_to_load file_to_scrape.html

I have tried an number of solutions including this one: importing a module when the module name is in a variable

The module loads fine when I use the actual module name instead of the variable name args.module.

Code:

$ cat scrape.py 
#!/usr/bin/env python3
from urllib.request import urlopen
from bs4 import BeautifulSoup
import argparse
import os, sys
import importlib

parser = argparse.ArgumentParser(description='HTML web scraper')
parser.add_argument('filename', help='File to act on')
parser.add_argument('-m', '--module', metavar='MODULE_NAME', help='File with code specific to the site--must be a defined class named Scrape')
args = parser.parse_args()

if args.module:
#    from get_div_content import Scrape #THIS WORKS#
    sys.path.append(os.getcwd())
    #EDIT--change this:
    #wrong# module_name = importlib.import_module(args.module, package='Scrape')
    #to this:
    module = importlib.import_module(args.module) # correct

try:
    html = open(args.filename, 'r')
except:
    try:
    html = urlopen(args.filename)
    except HTTPError as e:
    print(e)
try:
    soup = BeautifulSoup(html.read())
except:
    print("Error... Sorry... not sure what happened")

#EDIT--change this
#wrong#scraper = Scrape(soup)
#to this:
scraper = module.Scrape(soup) # correct

Module:

$ cat get_div_content.py 
class Scrape:
    def __init__(self, soup):
    content = soup.find('div', {'id':'content'})
    print(content)

Command run and Error:

$ ./scrape.py -m get_div_content.py file.html 
Traceback (most recent call last):
  File "./scrape.py", line 16, in <module>
    module_name = importlib.import_module(args.module, package='Scrape')
  File "/usr/lib/python3.4/importlib/__init__.py", line 109, in import_module
    return _bootstrap._gcd_import(name[level:], package, level)
  File "<frozen importlib._bootstrap>", line 2249, in _gcd_import
  File "<frozen importlib._bootstrap>", line 2199, in _sanity_check
SystemError: Parent module 'Scrape' not loaded, cannot perform relative import

Working Command -- No Errors:

$ ./scrape.py -m get_div_content file.html
<div id="content">
...
</div>

1 Answer 1

2

You don't need a package. Use only the module name

module = importlib.import_module(args.module)

then you have a module namespace with everything that was defined in the module:

scraper = module.Scrape(soup)

Remember, when calling, to use the module name, not the filename:

./scrape.py -m get_div_content file.html 
Sign up to request clarification or add additional context in comments.

2 Comments

Great! that did it. After seeing your answer it makes sense. Thanks.
I corrected the original post to show working changes. Look for #EDIT

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.