0

I am scraping a website and I am trying to get a specific values from a tag within the HTML page. The HTML page has many other tags. The specific script I am targeting has all the images I need to scrape.

I am not able to scrape the images directly using Cheerio because they are not available on the main HTML page unless I click on the main image to see all other images.

What I need is something like this:

find the tag that has the key {someImages}, then for each key with the name {large}, return the value of this key.

I have created an example below to explain my problem.

Your help is very much appreciate

Thank you very much

<body>
    <script type="text/javascript"> ... </script>
    <script type="text/javascript"> ... </script>
    <script type="text/javascript"> ... </script>
    <script type="text/javascript"> ... </script>
    .
    .
    .
    
    <script type="text/javascript">
        var data = {
            'someImages': {
                'initial': [
                        {
                        "hiRes": "https://somewebsite/images/imageName1.jpg",
                        "thumb": "https://somewebsite/images/imageName1.jpg",
                        "large": "https://somewebsite/images/imageName1.jpg", // I would like to be able to get the value of large from this line
                        "main": { 
                            "https://somewebsite/images/imageName1.jpg": [1654],
                            "https://somewebsite/images/imageName1.jpg": [3416],
                            "https://somewebsite/images/imageName1.jpg": [7560]
                            }
                        }, 
                    
                        {
                    "hiRes": "https://somewebsite/images/imageName2.jpg",
                    "thumb": "https://somewebsite/images/imageName2.jpg",
                    "large": "https://somewebsite/images/imageName2.jpg", // I would like to be able to get the value of large from this line
                    "main": { 
                        "https://somewebsite/images/imageName2.jpg": [2234],
                        "https://somewebsite/images/imageName2.jpg": [3616],
                        "https://somewebsite/images/imageName2.jpg": [7849]
                        }
                    },

                    {
                    "hiRes": "https://somewebsite/images/imageName3.jpg",
                    "thumb": "https://somewebsite/images/imageName3.jpg",
                    "large": "https://somewebsite/images/imageName3.jpg", // I would like to be able to get the value of large from this line
                    "main": { 
                        "https://somewebsite/images/imageName3.jpg": [2344],
                        "https://somewebsite/images/imageName3.jpg": [3556],
                        "https://somewebsite/images/imageName3.jpg": [7490]
                        }
                    },
                ]
            }
            
    </script>
    
    
    <script type="text/javascript"> ... </script>
    <script type="text/javascript"> ... </script>
    <script type="text/javascript"> ... </script>
    <script type="text/javascript"> ... </script>
    .
    .
    .

</body>
1
  • 1
    Once the script loads you should be able access the data variable directly Commented Nov 15, 2020 at 16:30

1 Answer 1

1

A simple regex should do the trick. Use a capture group to capture the URLs.

/"large": ?"(.+?)",/g

Test it in Regexpal if you want

Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.