Revisions to Parsing locally stored HTML files

Rollback to Revision 6

Source Link

edited May 17, 2016 at 21:22

75.6k
18
195
469

EDIT: If I use the f.write() operator on my code:

from bs4 import BeautifulSoup
import glob
import os
import contextlib


@contextlib.contextmanager


def trade_spider():
    os.chdir(r"C:\Users\6930p\FLO'S DATEIEN\Master FAU\Sommersemester 2016\02_Masterarbeit\04_Testumgebung\01_Probedateien für Analyseaspekt\Independent Auditors Report")
    for file in glob.iglob('**/*.html', recursive=True):
        with open(file, encoding="utf8") as f:
            contents = f.read()
            soup = BeautifulSoup(contents, "html.parser")
            for item in soup.findAll("ix:nonfraction"):
                if item['name'].endswith("AuditFeesExpenses"):
                    print(file.split(os.path.sep)[-1], end="| ")
                    print(item['name'], end="| ")
                    print(item.get_text())
                    f.write(item)
                    break
trade_spider()

I get the output of the first HTML file that has been parsed, but afterwards I get these error messages:

Prod224_0010_00178176_20131231.html| ns19:AuditFeesExpenses| 3,420
Traceback (most recent call last):
  File "C:/Users/6930p/PycharmProjects/untitled/Versuch/SomeTesting.py", line 23, in <module>
    trade_spider()
  File "C:\Users\6930p\Anaconda3\lib\contextlib.py", line 133, in helper
    return _GeneratorContextManager(func, args, kwds)
  File "C:\Users\6930p\Anaconda3\lib\contextlib.py", line 38, in __init__
    self.gen = func(*args, **kwds)
  File "C:/Users/6930p/PycharmProjects/untitled/Versuch/SomeTesting.py", line 21, in trade_spider
    f.write(item)
TypeError: write() argument must be str, not Tag

What am I doing wrong? Seems like he doesn't get into the 'for' operation correctly?

EDIT: If I use the f.write() operator on my code:

from bs4 import BeautifulSoup
import glob
import os
import contextlib


@contextlib.contextmanager


def trade_spider():
    os.chdir(r"C:\Users\6930p\FLO'S DATEIEN\Master FAU\Sommersemester 2016\02_Masterarbeit\04_Testumgebung\01_Probedateien für Analyseaspekt\Independent Auditors Report")
    for file in glob.iglob('**/*.html', recursive=True):
        with open(file, encoding="utf8") as f:
            contents = f.read()
            soup = BeautifulSoup(contents, "html.parser")
            for item in soup.findAll("ix:nonfraction"):
                if item['name'].endswith("AuditFeesExpenses"):
                    print(file.split(os.path.sep)[-1], end="| ")
                    print(item['name'], end="| ")
                    print(item.get_text())
                    f.write(item)
                    break
trade_spider()

I get the output of the first HTML file that has been parsed, but afterwards I get these error messages:

Prod224_0010_00178176_20131231.html| ns19:AuditFeesExpenses| 3,420
Traceback (most recent call last):
  File "C:/Users/6930p/PycharmProjects/untitled/Versuch/SomeTesting.py", line 23, in <module>
    trade_spider()
  File "C:\Users\6930p\Anaconda3\lib\contextlib.py", line 133, in helper
    return _GeneratorContextManager(func, args, kwds)
  File "C:\Users\6930p\Anaconda3\lib\contextlib.py", line 38, in __init__
    self.gen = func(*args, **kwds)
  File "C:/Users/6930p/PycharmProjects/untitled/Versuch/SomeTesting.py", line 21, in trade_spider
    f.write(item)
TypeError: write() argument must be str, not Tag

What am I doing wrong? Seems like he doesn't get into the 'for' operation correctly?

added 1874 characters in body

Source Link

edited May 17, 2016 at 21:18

Florian Schramm

171
1
1
4

EDIT: If I use the f.write() operator on my code:

from bs4 import BeautifulSoup
import glob
import os
import contextlib


@contextlib.contextmanager


def trade_spider():
    os.chdir(r"C:\Users\6930p\FLO'S DATEIEN\Master FAU\Sommersemester 2016\02_Masterarbeit\04_Testumgebung\01_Probedateien für Analyseaspekt\Independent Auditors Report")
    for file in glob.iglob('**/*.html', recursive=True):
        with open(file, encoding="utf8") as f:
            contents = f.read()
            soup = BeautifulSoup(contents, "html.parser")
            for item in soup.findAll("ix:nonfraction"):
                if item['name'].endswith("AuditFeesExpenses"):
                    print(file.split(os.path.sep)[-1], end="| ")
                    print(item['name'], end="| ")
                    print(item.get_text())
                    f.write(item)
                    break
trade_spider()

I get the output of the first HTML file that has been parsed, but afterwards I get these error messages:

Prod224_0010_00178176_20131231.html| ns19:AuditFeesExpenses| 3,420
Traceback (most recent call last):
  File "C:/Users/6930p/PycharmProjects/untitled/Versuch/SomeTesting.py", line 23, in <module>
    trade_spider()
  File "C:\Users\6930p\Anaconda3\lib\contextlib.py", line 133, in helper
    return _GeneratorContextManager(func, args, kwds)
  File "C:\Users\6930p\Anaconda3\lib\contextlib.py", line 38, in __init__
    self.gen = func(*args, **kwds)
  File "C:/Users/6930p/PycharmProjects/untitled/Versuch/SomeTesting.py", line 21, in trade_spider
    f.write(item)
TypeError: write() argument must be str, not Tag

What am I doing wrong? Seems like he doesn't get into the 'for' operation correctly?

EDIT: If I use the f.write() operator on my code:

from bs4 import BeautifulSoup
import glob
import os
import contextlib


@contextlib.contextmanager


def trade_spider():
    os.chdir(r"C:\Users\6930p\FLO'S DATEIEN\Master FAU\Sommersemester 2016\02_Masterarbeit\04_Testumgebung\01_Probedateien für Analyseaspekt\Independent Auditors Report")
    for file in glob.iglob('**/*.html', recursive=True):
        with open(file, encoding="utf8") as f:
            contents = f.read()
            soup = BeautifulSoup(contents, "html.parser")
            for item in soup.findAll("ix:nonfraction"):
                if item['name'].endswith("AuditFeesExpenses"):
                    print(file.split(os.path.sep)[-1], end="| ")
                    print(item['name'], end="| ")
                    print(item.get_text())
                    f.write(item)
                    break
trade_spider()

I get the output of the first HTML file that has been parsed, but afterwards I get these error messages:

Prod224_0010_00178176_20131231.html| ns19:AuditFeesExpenses| 3,420
Traceback (most recent call last):
  File "C:/Users/6930p/PycharmProjects/untitled/Versuch/SomeTesting.py", line 23, in <module>
    trade_spider()
  File "C:\Users\6930p\Anaconda3\lib\contextlib.py", line 133, in helper
    return _GeneratorContextManager(func, args, kwds)
  File "C:\Users\6930p\Anaconda3\lib\contextlib.py", line 38, in __init__
    self.gen = func(*args, **kwds)
  File "C:/Users/6930p/PycharmProjects/untitled/Versuch/SomeTesting.py", line 21, in trade_spider
    f.write(item)
TypeError: write() argument must be str, not Tag

What am I doing wrong? Seems like he doesn't get into the 'for' operation correctly?

added 708 characters in body

Source Link

edited May 17, 2016 at 6:48

Florian Schramm

171
1
1
4

Update: An

Further Explanation: Basically I want to find a certain name attribute (name=".+AuditFeesExpenses") in each HTML document and IF this attribute is found I want to have the name of the file, the Name Attribute and the correlating HTML text be printed into a separat text file.

An example string that I extracted from a single HTML file is:

added 708 characters in body

Source Link

edited May 17, 2016 at 6:41

Florian Schramm

171
1
1
4

Loading

Tweeted twitter.com/StackCodeReview/status/732349704080642052

occurred May 16, 2016 at 23:19

deleted 87 characters in body; edited tags; edited title; edited tags

Source Link

edited May 16, 2016 at 17:36

200_success

145.7k
22
191
481

Loading

edited title

Link

edited May 16, 2016 at 17:33

Florian Schramm

171
1
1
4

Loading

edited title

Link

edited May 16, 2016 at 17:18

Florian Schramm

171
1
1
4

Loading

Source Link

asked May 16, 2016 at 17:06

Florian Schramm

171
1
1
4

Loading

Stack Exchange Network

Return to Question