0

I need to process 10k JSON files in a folder called "All_Files", having names as Result_1.json, Result_2.json and so on, and having the following structure:

{
    "kind": "string",
    "url": {
        "type": "application/json",
        "template": "string"
    },
    "items": [
        {
            "kind": "string",
            "link": "https://www.somewhere.com/..."            
        },
        {
            "kind": "string",
            "link": "https://www.anywhere.com/..."            
        },
        {
            "kind": "string",
            "link": "https://www.nowhere.com/..."            
        },
        ...
        
        }
    ]
}

It is to be noted that a single JSON file may or may not contain an "items" array. Also, if the "items" array is present, then it can contain one or more objects as given in the example above. The "link" key contains full URLs. If the "items" array is present, then I need to access the "link" key and search for a specific substring that begins with "https://www.nowhere.com". There could be additional string after "https://www.nowhere.com" in a "link" key, but I need to match only the first part as described. If the first part matches, we need to save the name of the .JSON file having this particular key value in a text file called "Found.txt", one unquoted filename on each line in the file.

Please help me in writing a Python script that does this.

3
  • Please share, What did you tried to do this, what did you found while researching about your problem? Commented Jul 11, 2021 at 20:22
  • This is part of a large research problem that I am part of. We are analysing responses from various search engines on some given criteria. Commented Jul 12, 2021 at 5:17
  • I had already gave you answer as idea to solve your problems, but if you had provided some code that you had tried then I may had gave you some code also, but you didn't present anything that you had tried so, do on your own! Commented Jul 12, 2021 at 5:21

1 Answer 1

1

For getting the list of files and opening them you can do but this will only work if you have jsons files only in that directory.

import os 
print(os.listdir("/path/to/All_Files"))

And now you have name of all files in that directory the you can open those files by using either json or just with.

Loop through all the files, read files and change Data type of content of the file to str by using doing content=str(content) content is the data inside file that we have read already and saved in content. and now you got data in every iterations. Add this code in loop:

pattern=r"(?i)(\b(https?|http):\/\/[-A-Z0-9+&@#\/%?=~_|!:,.;]*[-A-Z0-9+&@#\/%=~_|])"

urls=[x[0] for x in re.findall(pattern,content)]

if "https://www.nowhere.com/" in urls:
    with open("found.txt","a+") as f:
        f.write(f"Found in {<file name we got previously using os.listdir()>}")

We had used re so, don't forget to import it in your program.

Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.