SQL Regex substr function in amazon redshift

Question

In my database, there is a table which shows the landing pages, and what I want to do is to classify if the traffic is from SEO channel. When the landing page is exactly equal to the pattern /countrycode/index.aspx, then it should be regarded as 'SEO', otherwise mark it as 'non-SEO'.

The table should be something like this

    landing_page                               channel 
   /en/index.aspx                               SEO 
   /de/index.aspx                               SEO 
   /es/features/mobile-apps/index.aspx         Non-SEO 
   /ja/products/product01123                   Non-SEO

To do this, I write a regex exression in Redshift like

   SELECT 
      landing_page, 
      CASE 
        WHEN 
          regexp_substr(landing_page, '/\/[a-z]{2,4}\/index.aspx')  IS NULL
        THEN 'Non-SEO' ELSE 'SEO'
      END channel 
   FROM 
      marketing_table

I tested in the regextester, it works prefect for me. However, when I apply it into Redshift, the outcome is simply as follows

    landing_page                               channel 
   /en/index.aspx                               SEO 
   /de/index.aspx                               SEO 
   /es/features/mobile-apps/index.aspx          SEO 
   /ja/products/product01123                    SEO
   /download/testing                           NON-SEO

That means, all the strings that between the / and /index.aspx are all considered, and what I need is the exact match. Is there any suggestion that I can fix it?

Many thanks for your help!

Update: Sorry guys for the late update. The problem is still not yet solved. The most confusing point is, For the same landing page in different traffic, some of them is regarded as SEO and some of them are not e.g

        landing_page                               channel 
   /en/index.aspx                               SEO 
   /en/index.aspx                               Non-SEO

We tried different method e.g not using regex but length of the string for example len(landing_page) in (12,13,14,15,16). Does anyone have any idea for that?

Did my answer help? Note that you should really use $ to mastch end of string and ^ to match start of string. — Wiktor Stribiżew
– Wiktor Stribiżew, Commented Jun 19, 2018 at 7:24

Wiktor Stribiżew · Accepted Answer · 2018-06-18 10:23:02Z

1

You should use

'/[a-z]{2,4}/index[.]aspx'

Here, the / is removed from the start and [.] is used to match a literal dot. Since the regex in Amazon Redshift does not use regex delimiters, you do not need to "wrap" the whole pattern with / chars and you do not need to escape / since they are not special regex metachars.

answered Jun 18, 2018 at 10:23

Wiktor Stribiżew

631k41 gold badges502 silver badges632 bronze badges

Sign up to request clarification or add additional context in comments.

Comments

Elle · Accepted Answer · 2018-06-18 19:12:07Z

The answer above me by @Wiktor-Stribiżew is almost right. he is missing the start and end-of-line characters. consider the following case with the input URL:

/es/features/en/index.aspx

according to the OP, this should not be classified as SEO. but with the the regexp '/[a-z]{2,4}/index[.]aspx' it will be. the correct regexp to use is '^/[a-z]{2,4}/index[.]aspx$'

select regexp_substr('/es/features/en/index.aspx','/[a-z]{2,4}/index[.]aspx');
>>> /en/index.aspx
select regexp_substr('/es/features/en/index.aspx','^/[a-z]{2,4}/index[.]aspx$');
>>> null
select regexp_substr('/en/index.aspx','^/[a-z]{2,4}/index[.]aspx$');
>>> /en/index.aspx
select regexp_substr('/es/features/mobile-apps/index.aspx','^/[a-z]{2,4}/index[.]aspx$')
>>> null

Collectives™ on Stack Overflow

SQL Regex substr function in amazon redshift

2 Answers 2

Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related