Skip to main content
We’ve updated our Terms of Service. A new AI Addendum clarifies how Stack Overflow utilizes AI interactions.
Rollback to Revision 6
Source Link
Mathieu Guindon
  • 75.6k
  • 18
  • 195
  • 469

EDIT: If I use the f.write() operator on my code:

from bs4 import BeautifulSoup
import glob
import os
import contextlib


@contextlib.contextmanager


def trade_spider():
    os.chdir(r"C:\Users\6930p\FLO'S DATEIEN\Master FAU\Sommersemester 2016\02_Masterarbeit\04_Testumgebung\01_Probedateien für Analyseaspekt\Independent Auditors Report")
    for file in glob.iglob('**/*.html', recursive=True):
        with open(file, encoding="utf8") as f:
            contents = f.read()
            soup = BeautifulSoup(contents, "html.parser")
            for item in soup.findAll("ix:nonfraction"):
                if item['name'].endswith("AuditFeesExpenses"):
                    print(file.split(os.path.sep)[-1], end="| ")
                    print(item['name'], end="| ")
                    print(item.get_text())
                    f.write(item)
                    break
trade_spider()

I get the output of the first HTML file that has been parsed, but afterwards I get these error messages:

Prod224_0010_00178176_20131231.html| ns19:AuditFeesExpenses| 3,420
Traceback (most recent call last):
  File "C:/Users/6930p/PycharmProjects/untitled/Versuch/SomeTesting.py", line 23, in <module>
    trade_spider()
  File "C:\Users\6930p\Anaconda3\lib\contextlib.py", line 133, in helper
    return _GeneratorContextManager(func, args, kwds)
  File "C:\Users\6930p\Anaconda3\lib\contextlib.py", line 38, in __init__
    self.gen = func(*args, **kwds)
  File "C:/Users/6930p/PycharmProjects/untitled/Versuch/SomeTesting.py", line 21, in trade_spider
    f.write(item)
TypeError: write() argument must be str, not Tag

What am I doing wrong? Seems like he doesn't get into the 'for' operation correctly?

EDIT: If I use the f.write() operator on my code:

from bs4 import BeautifulSoup
import glob
import os
import contextlib


@contextlib.contextmanager


def trade_spider():
    os.chdir(r"C:\Users\6930p\FLO'S DATEIEN\Master FAU\Sommersemester 2016\02_Masterarbeit\04_Testumgebung\01_Probedateien für Analyseaspekt\Independent Auditors Report")
    for file in glob.iglob('**/*.html', recursive=True):
        with open(file, encoding="utf8") as f:
            contents = f.read()
            soup = BeautifulSoup(contents, "html.parser")
            for item in soup.findAll("ix:nonfraction"):
                if item['name'].endswith("AuditFeesExpenses"):
                    print(file.split(os.path.sep)[-1], end="| ")
                    print(item['name'], end="| ")
                    print(item.get_text())
                    f.write(item)
                    break
trade_spider()

I get the output of the first HTML file that has been parsed, but afterwards I get these error messages:

Prod224_0010_00178176_20131231.html| ns19:AuditFeesExpenses| 3,420
Traceback (most recent call last):
  File "C:/Users/6930p/PycharmProjects/untitled/Versuch/SomeTesting.py", line 23, in <module>
    trade_spider()
  File "C:\Users\6930p\Anaconda3\lib\contextlib.py", line 133, in helper
    return _GeneratorContextManager(func, args, kwds)
  File "C:\Users\6930p\Anaconda3\lib\contextlib.py", line 38, in __init__
    self.gen = func(*args, **kwds)
  File "C:/Users/6930p/PycharmProjects/untitled/Versuch/SomeTesting.py", line 21, in trade_spider
    f.write(item)
TypeError: write() argument must be str, not Tag

What am I doing wrong? Seems like he doesn't get into the 'for' operation correctly?

added 1874 characters in body
Source Link

EDIT: If I use the f.write() operator on my code:

from bs4 import BeautifulSoup
import glob
import os
import contextlib


@contextlib.contextmanager


def trade_spider():
    os.chdir(r"C:\Users\6930p\FLO'S DATEIEN\Master FAU\Sommersemester 2016\02_Masterarbeit\04_Testumgebung\01_Probedateien für Analyseaspekt\Independent Auditors Report")
    for file in glob.iglob('**/*.html', recursive=True):
        with open(file, encoding="utf8") as f:
            contents = f.read()
            soup = BeautifulSoup(contents, "html.parser")
            for item in soup.findAll("ix:nonfraction"):
                if item['name'].endswith("AuditFeesExpenses"):
                    print(file.split(os.path.sep)[-1], end="| ")
                    print(item['name'], end="| ")
                    print(item.get_text())
                    f.write(item)
                    break
trade_spider()

I get the output of the first HTML file that has been parsed, but afterwards I get these error messages:

Prod224_0010_00178176_20131231.html| ns19:AuditFeesExpenses| 3,420
Traceback (most recent call last):
  File "C:/Users/6930p/PycharmProjects/untitled/Versuch/SomeTesting.py", line 23, in <module>
    trade_spider()
  File "C:\Users\6930p\Anaconda3\lib\contextlib.py", line 133, in helper
    return _GeneratorContextManager(func, args, kwds)
  File "C:\Users\6930p\Anaconda3\lib\contextlib.py", line 38, in __init__
    self.gen = func(*args, **kwds)
  File "C:/Users/6930p/PycharmProjects/untitled/Versuch/SomeTesting.py", line 21, in trade_spider
    f.write(item)
TypeError: write() argument must be str, not Tag

What am I doing wrong? Seems like he doesn't get into the 'for' operation correctly?

EDIT: If I use the f.write() operator on my code:

from bs4 import BeautifulSoup
import glob
import os
import contextlib


@contextlib.contextmanager


def trade_spider():
    os.chdir(r"C:\Users\6930p\FLO'S DATEIEN\Master FAU\Sommersemester 2016\02_Masterarbeit\04_Testumgebung\01_Probedateien für Analyseaspekt\Independent Auditors Report")
    for file in glob.iglob('**/*.html', recursive=True):
        with open(file, encoding="utf8") as f:
            contents = f.read()
            soup = BeautifulSoup(contents, "html.parser")
            for item in soup.findAll("ix:nonfraction"):
                if item['name'].endswith("AuditFeesExpenses"):
                    print(file.split(os.path.sep)[-1], end="| ")
                    print(item['name'], end="| ")
                    print(item.get_text())
                    f.write(item)
                    break
trade_spider()

I get the output of the first HTML file that has been parsed, but afterwards I get these error messages:

Prod224_0010_00178176_20131231.html| ns19:AuditFeesExpenses| 3,420
Traceback (most recent call last):
  File "C:/Users/6930p/PycharmProjects/untitled/Versuch/SomeTesting.py", line 23, in <module>
    trade_spider()
  File "C:\Users\6930p\Anaconda3\lib\contextlib.py", line 133, in helper
    return _GeneratorContextManager(func, args, kwds)
  File "C:\Users\6930p\Anaconda3\lib\contextlib.py", line 38, in __init__
    self.gen = func(*args, **kwds)
  File "C:/Users/6930p/PycharmProjects/untitled/Versuch/SomeTesting.py", line 21, in trade_spider
    f.write(item)
TypeError: write() argument must be str, not Tag

What am I doing wrong? Seems like he doesn't get into the 'for' operation correctly?

added 708 characters in body
Source Link

Update: An

Further Explanation: Basically I want to find a certain name attribute (name=".+AuditFeesExpenses") in each HTML document and IF this attribute is found I want to have the name of the file, the Name Attribute and the correlating HTML text be printed into a separat text file.

An example string that I extracted from a single HTML file is:

Update: An example string that I extracted from a single HTML file is:

Update:

Further Explanation: Basically I want to find a certain name attribute (name=".+AuditFeesExpenses") in each HTML document and IF this attribute is found I want to have the name of the file, the Name Attribute and the correlating HTML text be printed into a separat text file.

An example string that I extracted from a single HTML file is:

added 708 characters in body
Source Link
Loading
Tweeted twitter.com/StackCodeReview/status/732349704080642052
deleted 87 characters in body; edited tags; edited title; edited tags
Source Link
200_success
  • 145.7k
  • 22
  • 191
  • 481
Loading
edited title
Link
Loading
edited title
Link
Loading
Source Link
Loading