0

In my database, there is a table which shows the landing pages, and what I want to do is to classify if the traffic is from SEO channel. When the landing page is exactly equal to the pattern /countrycode/index.aspx, then it should be regarded as 'SEO', otherwise mark it as 'non-SEO'.

The table should be something like this

    landing_page                               channel 
   /en/index.aspx                               SEO 
   /de/index.aspx                               SEO 
   /es/features/mobile-apps/index.aspx         Non-SEO 
   /ja/products/product01123                   Non-SEO 

To do this, I write a regex exression in Redshift like

   SELECT 
      landing_page, 
      CASE 
        WHEN 
          regexp_substr(landing_page, '/\/[a-z]{2,4}\/index.aspx')  IS NULL
        THEN 'Non-SEO' ELSE 'SEO'
      END channel 
   FROM 
      marketing_table

I tested in the regextester, it works prefect for me. However, when I apply it into Redshift, the outcome is simply as follows

    landing_page                               channel 
   /en/index.aspx                               SEO 
   /de/index.aspx                               SEO 
   /es/features/mobile-apps/index.aspx          SEO 
   /ja/products/product01123                    SEO
   /download/testing                           NON-SEO 

That means, all the strings that between the / and /index.aspx are all considered, and what I need is the exact match. Is there any suggestion that I can fix it?

Many thanks for your help!

Update: Sorry guys for the late update. The problem is still not yet solved. The most confusing point is, For the same landing page in different traffic, some of them is regarded as SEO and some of them are not e.g

        landing_page                               channel 
   /en/index.aspx                               SEO 
   /en/index.aspx                               Non-SEO

We tried different method e.g not using regex but length of the string for example len(landing_page) in (12,13,14,15,16). Does anyone have any idea for that?

1
  • Did my answer help? Note that you should really use $ to mastch end of string and ^ to match start of string. Commented Jun 19, 2018 at 7:24

2 Answers 2

1

You should use

'/[a-z]{2,4}/index[.]aspx'

Here, the / is removed from the start and [.] is used to match a literal dot. Since the regex in Amazon Redshift does not use regex delimiters, you do not need to "wrap" the whole pattern with / chars and you do not need to escape / since they are not special regex metachars.

Sign up to request clarification or add additional context in comments.

Comments

1

The answer above me by @Wiktor-Stribiżew is almost right. he is missing the start and end-of-line characters. consider the following case with the input URL:

/es/features/en/index.aspx

according to the OP, this should not be classified as SEO. but with the the regexp '/[a-z]{2,4}/index[.]aspx' it will be. the correct regexp to use is '^/[a-z]{2,4}/index[.]aspx$'

select regexp_substr('/es/features/en/index.aspx','/[a-z]{2,4}/index[.]aspx');
>>> /en/index.aspx
select regexp_substr('/es/features/en/index.aspx','^/[a-z]{2,4}/index[.]aspx$');
>>> null
select regexp_substr('/en/index.aspx','^/[a-z]{2,4}/index[.]aspx$');
>>> /en/index.aspx
select regexp_substr('/es/features/mobile-apps/index.aspx','^/[a-z]{2,4}/index[.]aspx$')
>>> null

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.