0

I want to get rows that the urls contain one or more chinese character.I write a sql with regexp to do it.But i failed because the "/" fits the regexp.

The regexp is

SELECT "/" REGEXP '.*[^\x0f-\xff].*'

and the Sequel Pro returns 1

However, I find a pro-reg-testing-website to do the same regexp and it turns out 0.

Why it acts different with the same regexp in that website and the Sequel Pro?If the website has some optimization on it, then how to make it in the Sequel?

1
  • 1
    Well, you know: "The REGEXP and RLIKE operators work in byte-wise fashion, so they are not multibyte safe and may produce unexpected results with multibyte character sets.". You appear to be using some single-byte encoding though but you don't consider it relevant enough to mention which one so I guess that's the root problem: a byte, without context, can mean anything. Commented Jan 6, 2016 at 11:14

1 Answer 1

1
SELECT ...
    WHERE HEX(str) REGEXP '^(..)*E[3456789ABCD]';

will check for a variety of CJK characters. (This assumes str is CHARACTER SET utf8 or utf8mb4.) This may include Japanese and Korean characters, too.

I'm digging around for the 'extension' characters; seems like they begin with F0.

EDIT

Well, it turns out that Chinese is all over the place:

REGEXP
'^(..)*E2B[AB]|E380|E387|E38[89AB]|E38[CDEF]|E[34][9AB][0-9A-F]|E[456789]B[89ABCDEF]|EFA[456789AB]|EFB[89]|F0A[0123456789A][89][0-9A-F]|F0A[AB]9C|F0AB[9A][DEF0]|F0A[BC][AB][0-9A-F]|F0AFA[012345678]'
Sign up to request clarification or add additional context in comments.

1 Comment

I think the revised regexp isn't quite right; maybe I will work on it later.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.