1

I want to get a string value from a script with jsoup from a html page. But there are some problems:

  1. there are six scipts in that page. and i want to select forth of all with jsoup(I mean number 4). and I don't know how I can do it.
  2. there is a key in that script and i want to catch value of that key

here you can see wanted script:

<script type="text/javascript">window._sharedData={

  "entry_data": {
    "PostPage": [
      {
        "media": {

          "key": "This is the key and i wanna catch it!!!",

        },      
      }
    ]
  },

};</script>

I have tried many ways, but I wasn't successful.

I'm looking forwrd to get the answer, so pls don't let me down!

1
  • Please provide the link to the website so I can inspect the problem for you Commented Nov 14, 2015 at 11:22

1 Answer 1

4

JSoup will only help you to get the contents of the script tag as a string. It parses HTML, not script content which is JavaScript. Since in your case the contents of the script is a simple object in JSON notation you could employ a JSON parser after you get the script string and stripping off the variable assignment. IN the below code I use the JSON simple parser.

String html = "<script></script><script></script><script></script>"
    +"<script type=\"text/javascript\">window._sharedData={"
    +"  \"entry_data\": {"
    +"    \"PostPage\": ["
    +"      {"
    +"        \"media\": {"
    +"          \"key\": \"This is the key and i wanna catch it!!!\","
    +"        },"
    +"      }"
    +"    ]"
    +"  },"
    +"};</script><script></script>";
Document doc = Jsoup.parse(html);
//get the 4th script
Element scriptEl = doc.select("script").get(3);
String scriptContentStr = scriptEl.html();
//clean to get json
String jsonStr = scriptContentStr
     .replaceFirst("^.*=\\{", "{") //clean beginning
     .replaceFirst("\\;$", ""); //clean end
JSONObject jo = (JSONObject) JSONValue.parse(jsonStr);
JSONArray postPageJA = ((JSONArray)((JSONObject)jo.get("entry_data")).get("PostPage"));
JSONObject mediaJO = (JSONObject) postPageJA.get(0);
JSONObject keyJO = (JSONObject) mediaJO.get("media");
String keyStr = (String) keyJO.get("key");

System.out.println("keyStr = "+keyStr);

This is a bit complicated, and also it depends on your knowledge about the structure of the JSON. A much simpler way may be to use regular expressions:

Pattern p = Pattern.compile(
    "media[\":\\s\\{]+key[\":\\s\\{]+\"([^\"]+)\"", 
    Pattern.DOTALL);
Matcher m = p.matcher(html);
if (m.find()){
    String keyFromRE = m.group(1);
    System.out.println("keyStr (via RegEx) = "+keyFromRE);  
}
Sign up to request clarification or add additional context in comments.

4 Comments

thank you so much. honestly, i wanna get "caption" from an instagram page. please take a look and tell me what's the best way to do this. pls paste this line in google chrome : view-source:instagram.com/p/m7SaJFIhyB
"caption"? I do not understand what info you need to extract. Just modify my approach and you should be fine.
Thank you for the appreciation. The OP seems a bit lost at this :)
thank you so much! it works. yes! you are true! I was lost! but i have a tiny problem, I don't know how make a pattern. can you introduce a good training source?

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.