In my database, there is a table which shows the landing pages, and what I want to do is to classify if the traffic is from SEO channel. When the landing page is exactly equal to the pattern /countrycode/index.aspx, then it should be regarded as 'SEO', otherwise mark it as 'non-SEO'.
The table should be something like this
landing_page channel
/en/index.aspx SEO
/de/index.aspx SEO
/es/features/mobile-apps/index.aspx Non-SEO
/ja/products/product01123 Non-SEO
To do this, I write a regex exression in Redshift like
SELECT
landing_page,
CASE
WHEN
regexp_substr(landing_page, '/\/[a-z]{2,4}\/index.aspx') IS NULL
THEN 'Non-SEO' ELSE 'SEO'
END channel
FROM
marketing_table
I tested in the regextester, it works prefect for me. However, when I apply it into Redshift, the outcome is simply as follows
landing_page channel
/en/index.aspx SEO
/de/index.aspx SEO
/es/features/mobile-apps/index.aspx SEO
/ja/products/product01123 SEO
/download/testing NON-SEO
That means, all the strings that between the / and /index.aspx are all considered, and what I need is the exact match. Is there any suggestion that I can fix it?
Many thanks for your help!
Update: Sorry guys for the late update. The problem is still not yet solved. The most confusing point is, For the same landing page in different traffic, some of them is regarded as SEO and some of them are not e.g
landing_page channel
/en/index.aspx SEO
/en/index.aspx Non-SEO
We tried different method e.g not using regex but length of the string for example len(landing_page) in (12,13,14,15,16). Does anyone have any idea for that?
$to mastch end of string and^to match start of string.