5

I am having a hard time understanding this code.

I would like to extract HTML comments using BeautifulSoup and Python3.

Given:

html = '''
       <!-- Python is awesome -->
       <!-- Lambda is confusing -->
       <title>I don't grok it</title>
       '''

soup = BeautifulSoup(html, 'html.parser')

I searched for solutions and most people said:

comments = soup.find_all(text= lambda text: isinstance(text, Comment))

Which in my case would result in:

[' Python is awesome ', ' Lambda is confusing ']

This is what I understand:

  • isinstance asks if text is an instance of Comment and returns a boolean.
  • I sort of understand lambda. Takes text as an argument and evaluates the isinstance expression.
  • You can pass a function to find_all

This is what I do not understand:

  • What is text in text=?
  • What is text in lambda text?
  • What argument from html is passed into lambda text
  • soup.text returns I don't grok it. Why is lambda text passing <!-- Python is awesome --> as an argument?
2
  • It really sounds like you need a basic Python tutorial, followed by a detailed reading of the relevant beautifulsoup docs. The docs are pretty hard to understand if you do not have a working knowledge of basic Python constructs. I am telling you this because even the explanations you get here will not help you as much as they ought until you understand the underlying concepts. While jumping head first into a subject is indeed a good way to learn, fundamentals are also very important. Commented Apr 24, 2018 at 13:44
  • @MadPhysicist can you please clarify what fundamentals I need? I think I am strong with basic Python. Commented Apr 24, 2018 at 15:10

2 Answers 2

5

Summary

.find_all() goes through each line and tries to match text='<our_text>. Instead of an actual string (like in the example down) '<our_text>' is a lambda function that basically has a condition.

I'll explain each part of this question.

text=

html = '''
       <!--Python is awesome-->
       <!--Lambda is confusing-->
       <title>I don't grok it</title>
       '''

soup = BeautifulSoup(html, 'html.parser')

print(soup.find_all(text='Python is awesome'))

Output:

['Python is awesome']

Here text= is only a parameter (i.e. argument) where we can pass a regex or another function or a variable or 'string'. It just happened to be a lambda in our case. We'll explain next what the lambda does.

Lambda

This lambda function takes in text variable as input.

We automatically feed the text of each line into the lambda-func with .find_all

lambda text: isinstance(text, Comment) 

And the isinstance checks if the first arg. text is Comment it either returns True OR False. Example: some_var = 'Ey man' then I do isisntance(some_var, str) -> True. It's a string (str).

Next, we combine both of these.

soup.find_all(text= lambda text: isinstance(text, Comment))

  1. soup.find_all - goes through each line <--Python is awesome.., <--Lambda.. <title>I..

  2. We have a condition within the .find_all(<the_condition>) and keep the lines that fulfill that condition

  3. The condition in our case is,

    3.1. Firstly we don't check everything only the clear, plain English text and inside tags, and/or whatever string there is. That's text=

    3.2. The text also has a condition, it doesn't take any text, only if a lambda function returns True, i.e. fulfills the condition of the lambda.

    3.3. The lambda condition is that it has to be an instance of Comment meaning only if it's a Comment it will return True.

Only and only if all these conditions are met we take that line and store it.

Sign up to request clarification or add additional context in comments.

3 Comments

@tomorodonez I have explained it in great detail in my answer, let me know if I should clarify anything else. I would also appreciate edits (suggestions or other ways) to my answer if it's going to improve it. Anything, readability -- functionality -- effectivness.
Thanks for the details. I think the only change is that after digging into the BS4 docs. It said that the text argument was replaced by string. Then the text variable inside lambda could be anything. Such as: soup.find_all(string= lambda html_comment: isinstance(html_comment, Comment)). Thanks.
text is still supported. Yes exactly, awesome understanding, thank you too for going through the effort of looking into the docs, you made an educated and formidable question and without any research, it'd be hard for anyone to explain all of this. Take care!
2

What is text in text=?

A keyword argument to the find_all function

What is text in lambda text?

The parameter for the function, same as

def <name>(text)...

What argument from html is passed into lambda text

that would be up to you, in the sample the variable Comments refers to the text to parse.

soup.text returns I don't grok it. Why is lambda text passing as an argument?

that's just an example to be replaced with real HTML

2 Comments

Thanks. Can you please clarify? I know text= is a keyword argument. But what is its value? I know lambda text is the same as def name(text). But what is the value of text there?. There is no variable Comments. Above comments is refer to a list of results and not as the text to parse. What do you mean to be replaced with real HTML? Thanks.
The value is the the lambda (a function definition, in other words). The value of text in the lambda definition is set at run time, when it is called by find_all. Comment is the type being tested in isinstance. The sample code you posted has sample html, in the real world it will be, well, real! Like the html for this page you are looking at, right now.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.