2

I'm trying to solve a problem which I have done it with PHP, not sure how to do that in Python.

In the following three Rows, we like to match based on these two patterns:

  • only vine.co and twitter.com URLs (other domains should be ignored)

  • only URLs before commas , (last URL in each Row should be ignored)

Input

Row 1: https://vine.co/v/5W2Dg3XPX7a,https://vine.co/v/5W2Dg3XPX7a
Row 2: https://twitter.com/dog_rates/status/836677758902222849/photo/1,https://twitter.com/dog_rates/status/836677758902222849/photo/1
Row 3: https://www.gofundme.com/lolas-life-saving-surgery-funds,https://twitter.com/dog_rates/status/835264098648616962/photo/1,https://twitter.com/dog_rates/status/835264098648616962/photo/1

The output would be an array in Python (which this output is based on PHP):

array(3) {
  [0]=>
  string(30) "https://vine.co/v/5W2Dg3XPX7a
"
  [1]=>
  string(64) "https://twitter.com/dog_rates/status/836677758902222849/photo/1
"
  [2]=>
  string(63) "https://twitter.com/dog_rates/status/835264098648616962/photo/1"
}

PHP Code:

$input = 'Row 1: https://vine.co/v/5W2Dg3XPX7a,https://vine.co/v/5W2Dg3XPX7a
Row 2: https://twitter.com/dog_rates/status/836677758902222849/photo/1,https://twitter.com/dog_rates/status/836677758902222849/photo/1
Row 3: https://www.gofundme.com/lolas-life-saving-surgery-funds,https://twitter.com/dog_rates/status/835264098648616962/photo/1,https://twitter.com/dog_rates/status/835264098648616962/photo/1';

$array = preg_split('/Row\s\d:\s/s', $input);

$output = array();
foreach ($array as $key => $value) {
    if (strlen($value) > 1) {
        $URL_arrays = explode(',', $value);
        foreach ($URL_arrays as $key => $value) {
            if ($key = sizeof($URL_arrays) - 1) {
                unset($URL_arrays[sizeof($URL_arrays) - 1]);
            } else {
                $match = preg_match('/twitter\.com|vine\.co/s', $value);
                if ($match) {
                    array_push($output, $value);
                }
            }
        }
    }
}

var_dump($output);

This question is based on this RegEx problem, which you may answer either of which.

0

2 Answers 2

1

You can use this regex to capture all URLs having vine.com or twitter.com domain which have a comma just after the URL,

https:\/\/(?:www\.)?(?:vine\.co|twitter\.com)[^,\s]*(?=,)

As you wanted, the key point is this positive look ahead (?=,) which ensures, your URL is followed by a comma immediately after the URL.

Regex Demo

Python code extracting URLs using re.findall

import re

s = '''Row 1: https://vine.co/v/5W2Dg3XPX7a,https://vine.co/v/5W2Dg3XPX7a
Row 2: https://twitter.com/dog_rates/status/836677758902222849/photo/1,https://twitter.com/dog_rates/status/836677758902222849/photo/1
Row 3: https://www.gofundme.com/lolas-life-saving-surgery-funds,https://twitter.com/dog_rates/status/835264098648616962/photo/1,https://twitter.com/dog_rates/status/835264098648616962/photo/1'''

print(re.findall(r'https:\/\/(?:www\.)?(?:vine\.co|twitter\.com)[^,\s]*(?=,)', s))

Outputs,

['https://vine.co/v/5W2Dg3XPX7a', 'https://twitter.com/dog_rates/status/836677758902222849/photo/1', 'https://twitter.com/dog_rates/status/835264098648616962/photo/1']
Sign up to request clarification or add additional context in comments.

Comments

0

Because you don't need to hold duplicates, I would suggest to use a set instead of array (but order changes):

{url for x in s.split('\n') for url in x.split(': ')[1].split(',')  if 'vine.co' in url or 'twitter.co' in url}

Code:

s = '''Row 1: https://vine.co/v/5W2Dg3XPX7a,https://vine.co/v/5W2Dg3XPX7a
Row 2: https://twitter.com/dog_rates/status/836677758902222849/photo/1,https://twitter.com/dog_rates/status/836677758902222849/photo/1
Row 3: https://www.gofundme.com/lolas-life-saving-surgery-funds,https://twitter.com/dog_rates/status/835264098648616962/photo/1,https://twitter.com/dog_rates/status/835264098648616962/photo/1'''

print({url for x in s.split('\n') for url in x.split(': ')[1].split(',')  if 'vine.co' in url or 'twitter.co' in url})

# {'https://twitter.com/dog_rates/status/835264098648616962/photo/1', 
#  'https://twitter.com/dog_rates/status/836677758902222849/photo/1',
#  'https://vine.co/v/5W2Dg3XPX7a'}

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.