0

I have came to a part where I dont' managed to regex a json value from a html source.

The html source looks like:

<script data-csp-hash=""> window.__webpack_public_path__='https://renderer-assets.typeform.com/';
window.__webpack_nonce__='3088edaa602c001b5f6e1f31e3179422';
window.rendererAssets='["https://renderer-assets.typeform.com/vendors~libphonenumber~submission.c94d30638908af997673.js","https://renderer-assets.typeform.com/country-data.526012987a7e72182726.js","https://renderer-assets.typeform.com/form-container.98c74a2ac320736bdb16.js","https://renderer-assets.typeform.com/renderer.8282fd35106b77e43e2f.js","https://renderer-assets.typeform.com/submission.5d9a15e294b33a20ea2e.js","https://renderer-assets.typeform.com/vendors~form-container.b5fb128466f604baadba.js","https://renderer-assets.typeform.com/vendors~video.aa830e76dcc8735c9936.js","https://renderer-assets.typeform.com/video.45eca666f47b245e8fdb.js"]';
window.rendererData= {
    rootDomNode: 'root',
    form:     {
            "id":"Z3PvTW",
            "title":Testing",
            "welcome_screens":[ {
                "ref":"a13820db-af60-40eb-823d-86cf0f20299b",
                "title":"Yessir!",
                "properties": {
                    "show_button": true, "button_text": "Start"
                }
            }
            ],
            "thankyou_screens":[ {
                "ref":"default_tys",
                "title":"Done! Your information was sent perfectly.",
                "properties": {
                    "show_button": false, "share_icons": false
                }
            }
            ],
            "fields":[ {
                "id":"kxWycKljdtBq",
                "title":"FIRST NAME",
                "ref":"27f403f7-8c5b-4e18-b19d-1501e8f137ee",
                "validations": {
                    "required": true
                }
                ,
                "type":"short_text"
            }
            ,
            {
                "id":"WEXCnZ7EAFjN",
                "title":"LAST NAME",
                "ref":"a6bf6d83-ee37-4870-b6c5-779822290cde",
                "validations": {
                    "required": true
                }
                ,
                "type":"short_text"
            }
            ,
            {
                "id":"ButwoV1bTge5",
                "title":"EMAIL ADDRESS",
                "ref":"8860a4cf-71ec-4bfa-a2c7-934fd405f200",
                "properties": {
                    "description": "Note for stackoverflow!"
                }
                ,
                "validations": {
                    "required": true
                }
                ,
                "type":"email"
            }
            ],
            "_links": {
                "display": "link.com"
            }
        }
    ,
    messages: {
        "a11y.file-upload.remove":"Remove uploaded file",
    }
    ,
    trackingInfo: {
        "segmentKey": "9at6spGDYXelHDdz4r0cP73b3wV1f0ri", "accountId": 12587347, "accountLimitName": "Essentials", "userId": 12586030
    }
    ,
    stripe: null,
    showBranding: true,
    accessScheduling: {
        "closeScreenData": {
            "title":"This typeform isn't accepting new responses",
            "description":"",
            "brandingMottoText":"How you ask is everything",
            "brandingButtonText":"Create a *typeform*",
            "attachment": {}
            ,
            "textColor": "#3D3D3D", "showBranding": true, "brandingButtonColor": "#000000", "buttonRedirectLink": "https:\u002F\u002Fwww.typeform.com\u002Fsignup?utm_campaign=undefined&utm_source=typeform.com-12587347-Essentials&utm_medium=typeform&utm_content=typeform-closescreen&utm_term=EN"
        }
    }
    ,
    featureFlags: {
        "always-inject-new-relic": false, "beta-testers": false, "sb-3671-inline-submit-flow": "out-of-experiment", "sb-3671-new-submit-flow": false
    }
}

;
window.rendererTheme= {
    color: '#3D3D3D',
    backgroundColor: {
        red: '255', green: '255', blue: '255'
    }
}

;

and what I want to scrape is the form value json which is this part:

{
            "id":"Z3PvTW",
            "title":Testing",
            "welcome_screens":[ {
                "ref":"a13820db-af60-40eb-823d-86cf0f20299b",
                "title":"Yessir!",
                "properties": {
                    "show_button": true, "button_text": "Start"
                }
            }
            ],
            "thankyou_screens":[ {
                "ref":"default_tys",
                "title":"Done! Your information was sent perfectly.",
                "properties": {
                    "show_button": false, "share_icons": false
                }
            }
            ],
            "fields":[ {
                "id":"kxWycKljdtBq",
                "title":"FIRST NAME",
                "ref":"27f403f7-8c5b-4e18-b19d-1501e8f137ee",
                "validations": {
                    "required": true
                }
                ,
                "type":"short_text"
            }
            ,
            {
                "id":"WEXCnZ7EAFjN",
                "title":"LAST NAME",
                "ref":"a6bf6d83-ee37-4870-b6c5-779822290cde",
                "validations": {
                    "required": true
                }
                ,
                "type":"short_text"
            }
            ,
            {
                "id":"ButwoV1bTge5",
                "title":"EMAIL ADDRESS",
                "ref":"8860a4cf-71ec-4bfa-a2c7-934fd405f200",
                "properties": {
                    "description": "Note for stackoverflow!"
                }
                ,
                "validations": {
                    "required": true
                }
                ,
                "type":"email"
            }
            ],
            "_links": {
                "display": "link.com"
            }
        }

I have been able to almost scrape it by using this

(?sm)^\s*form:\s*{(.*?)\n}$ #Not quite sure if this would work in Python however.

https://regex101.com/r/zTJQ0A/3

However my problem is that it continues to scrape whats after the form value like messages, trackingInfo, stripe and so on and I just want to be able to get the form json and nothing else.

How can I be able to only get the regex for form: json value?

2
  • Hey 👋 I am curious to hear about your use case and why you are trying to scrape this part of the form. Typeform has an API to retrieve the form definition directly. On api.typeform.com/forms/:form_id Commented Jan 29, 2020 at 16:34
  • @NicolasGrenié Oh wow I did not know about it and this would actually help me 1000x easier. Thank you so much! Well the reason is that if you convert it to a special json parser, you will be able to send requests to that typeform so my idea was to make a script that can write for me and submit the typeform :) If you know what I mean? Commented Jan 29, 2020 at 16:41

1 Answer 1

1

You can try this method:

data = '''....'''
data = re.findall("form\:[\S\s]*messages",data)[0]
data = re.sub("^form\:","",data)
data = re.sub("\,\n|\smessages","",data)
print data
Sign up to request clarification or add additional context in comments.

2 Comments

It seems to do the form correct but however it prints out also the values after the form json but skips the message json. in this case it prints out the form and trackingInfo, stripe, showBranding window.rendererThemeand so on. So basically if there is a way to stop data = re.sub("\,\n|\smessages","",data) and use everything before messages value then it probably solve the issue
Then try data = re.sub("\,\n|\smessages[\s\S]*","",data)

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.