0

I'm trying to match and extract all the table names and columns from any given MySQL query.

The given query is unquoted (back ticks) and According to MySQL the naming rules are:

Permitted characters in unquoted identifiers:

ASCII: [0-9,a-z,A-Z$_] (basic Latin letters, digits 0-9, dollar, underscore)

Extended: U+0080 .. U+FFFF

For a test case I'm using this query:

SELECT  users.id , users.first_name ,users.last_name,  roles.role,avatars.img_name,timezone.gmt_offset
FROM  users 
LEFT JOIN roles ON  users.role = roles.id 
LEFT JOIN  avatars ON users.avatar=avatars.id 
LEFT JOIN country ON users.country=country.country_code 
LEFT JOIN  timezone ON users.timezone = timezone.id 
WHERE (users.id >=2 AND  users.id <=4 ) OR (roles.role LIKE  'us%')
OR (roles.role =  'user(complex.sit )' && (timezone.gmt_offset >=7200
OR  users.last_name ='tryme'))
LIMIT 0 , 30

My Regex so far:

%[ .(),]?([a-z0-9_$]{2,})[ .(),]?(?!AND|OR|LIKE|SELECT|JOIN|ON)%i

I'm planning on capturing the group and replacing it with the match wrap with backticks The problem is That I cant filter out the reserved words that are being matched too (SELECT, JOIN....), I have tried adding a negative lookahead but it doesn't work.

The second problem is with values like in the example = 'user(complex.sit )' i dont want it to match those two words (complex sit).

Any suggestions?

2 Answers 2

1

Yet another question of the form, "How do I manipulate some little program in language XXX using Regexps (in host language YYY)?" The underlying question is, "How do I parse a program in language XXX using regexps?" The correct answer is almost always, DON'T.

You are embarking on a path of tears. Regexp is not designed to parse any but the most of trivial, limited of languages. You may find a regexp which meets your immediate needs, but then when another requirement comes in, you hit a brick wall. The regexp(s) grow longer and longer and become impossible to understand, much less maintain.

To parse a language, use a parser. At this point in time, it's not really an exaggeration to say that there are parsers available for virtually all languages for virtually all platforms.

I don't know what language/platform you are working in, so I won't suggest any specific parsers, but a query for "JavaScript SQL parser" brought up this thing right away: https://www.npmjs.com/package/simple-sql-parser. Just meant as an example.

Sign up to request clarification or add additional context in comments.

2 Comments

Hi, Thank you I totally agree but i'm not parsing a language here - I finished my new (PHP) Database class that handles composing advanced queries - this is just something I was wondering if its possible - The main reason is not for validating the queries its for handling unquoted and quoted naming rules.
It is totally valid to parse any stuff using regular expressions, as the use case may legitimate it. Using a blown up parser for simple tasks is not the right way to go.
0

Use the %g global modifier with the expression:

%[\s.(),]+?([a-z\d_$]{2,})[\s.(),]*?(?:AND|OR|LIKE|SELECT|JOIN|ON|)%g

5 Comments

Played with it before I posted when doing this it removes 'on' from timezone and more... I don't know why the results are odd most of the table names are not match
when I use the %g with the expression, i am getting the result as:regex101.com/r/dY3bG4/1
The g is for global no? the result you got is simillar to what i have on my computer but still the complex and sit are matched i need to avoid them since they are values and not table names or columns
Yes exactly... all names that can be backticked
You will also notice that when you add the i delimiter or A-Z it stops working and the the words are not escaped any more.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.