11

I've a text file containing large number of queries. I want to get all the distinct tables used in the entire file in all the queries. The table name can come after a FROM or JOIN. How can I extract them by doing a regex match. Can anyone suggest a regular expression to get the matches?

7 Answers 7

11

It depends on structure of your file. Try to use this:

(?<=from|join)(\s+\w+\b)

Also turn on options Multiline if your not split your file in array or smth else with singleline string members. Also try to turn on IgnorCase option.

Sign up to request clarification or add additional context in comments.

7 Comments

I think you have to move the \s+ into the positive lookbehind
-1 Regex is not the correct way to solve this problem. As tdammers states, a SQL parser of some description is required to effectively solve this problem.
@El Ronnoco I've asked for a solution using Regex and hence he has provided it. I just want a quick and dirty solution for this and i got it.
For support schema, i add few symbol: (?<=from|join)(\s+\w+\.+\w+\b)
I think it's not gonna work if there is a commented select in your string.
|
8

I'd use:

r = new Regex("(from|join)\s+(?<table>\S+)", RegexOptions.IgnoreCase);

once you have the Match object "m", you'll have the table name with

m.Groups["table"].Value

example:

string line = @"select * from tb_name join tb_name2 ON a=b WHERE x=y";
Regex r = new Regex(@"(from|join)\s+(?<table>\S+)",
         RegexOptions.IgnoreCase|RegexOptions.Compiled);

Match m = r.Match(line);
while (m.Success) {
   Console.WriteLine (m.Groups["table"].Value);
   m = m.NextMatch();
}

it will print: tb_table tb_table2

Comments

1

Something like this maybe:

/(from|join)\s+(\w*\.)*(?<tablename>\w+)/

It won't match escaped table names though, and you need to make the regex evaluation case-insensitive.

4 Comments

Sorry. It is not even returning one match.
absolutely wrong. You can't find table names with such regex. You will match words join and from also. Not only table names.
I haven't tested it, but it should match each occurrence of 'from' or 'join', followed by at least one whitespace, and then one or more identifiers separated with dots. There is one error though; it should start with a start-of-word assertion, otherwise it will also match things like foobarfrom blah. And it doesn't take escaping into account, because that's a DBMS-specific thing - MySQL uses backquotes, PostgreSQL uses double quotes, T-SQL uses square brackets.
Oh, and obviously, regexes are not a reliable way for doing this anyway. If you want reliable, you need a full-blown SQL parser.
1

Solutions that can help you.

1. Extract table names with alias from an SQL statement with Regex

Regular expression

/(from|join|into)\s+([`]\w+.*[`] *\w+|(\[)\w+.*(\]) *\w+|\w*\.*\w+ *\w+)/g
2. Extract table names from an SQL statement with Regex

Regular expression

/(from|join|into)\s+([`]\w+.+\w+\s*[`]|(\[)\w+.+\w+\s*(\])|\w+\s*\.+\s*\w*|\w+\b)/g

Test string

-------------------------------------------------------
    select * into [dbo].[table_temp] 
    from [dbo].[table_a] a inner join dbo.table_b b ...
    join table_c c on ...
    from dbo.table_d d ...
    from `dbo`.`table_e` e ...
    from table_f f ...
-------------------------------------------------------

Generated Code for C#

using System;
using System.Text.RegularExpressions;

public static class QueryExtension
    {
        public static List<string> GetTables(this string query)
        {
            List<string> tables = new List<string>();
            string pattern = @"(from|join|into)\s+([`]\w+.+\w+\s*[`]|(\[)\w+.+\w+\s*(\])|\w+\s*\.+\s*\w*|\w+\b)";            
            
            foreach (Match m in Regex.Matches(query, pattern))
            {                
                string name = m.Groups[2].Value;                                
                tables.Add(name);
            }

            return tables;
        }
        public static string Join(this IEnumerable<string> values, string separator) {
            return string.Join(separator, values);
        }
    }

How to use it.

string input = @"select * into [dbo].[table_temp] 
from [dbo].[table_a] a inner join dbo.table_b b ...
join table_c c on ...
from dbo.table_d d ...
from `dbo`.`table_e` e ...
from table_f f ...";

Console.WriteLine(input.GetTables().Join("\n"));

Output

[dbo].[table_temp]
[dbo].[table_a]
dbo.table_b
table_c
dbo.table_d
`dbo`.`table_e`
table_f
Extract column names from an SQL statement with Regex

Regular expression

/(\w*\.*\w+|`\w*.*\w`|(\[)\w*.*(\]))+(,|\s+,|\s+FROM|\s+from)/g

Test string

-------------------------------------------------------
SELECT  
    [a].[column_1],  
    `b`.`column_2`,  
    c.column_3,  
    col4 as column_4,  
    col5 as `column_5`,   
    col6 as [column_6],
    column_7,
    a.col8 column_8,    
    (select max(column_x) from table_d where column_y in ('1','2','3')) as column_9    
from table_a a
inner join table_b b on ...
inner join table_c c on ...
-------------------------------------------------------

Generated code for C#

public static class QueryExtension
    {        
        public static List<string> GetColumns(this string query)
        {
            List<string> columns = new List<string>();
            string pattern = @"(\w*\.*\w+|`\w*.*\w`|(\[)\w*.*(\]))+(,|\s+,|\s+FROM|\s+from)";

            foreach (Match m in Regex.Matches(query, pattern))
            {
                string name = m.Groups[1].Value;
                columns.Add(name);
            }

            return columns;
        }
        public static string Join(this IEnumerable<string> values, string separator) {
            return string.Join(separator, values);
        }
    }

How to use it

string input1 = @"SELECT  
    [a].[column_1],  
    `b`.`column_2`,  
    c.column_3,  
    col4 as column_4,  
    col5 as `column_5`,   
    col6 as [column_6],
    column_7,
    a.col8 column_8,    
    (select max(column_x) from table_d where column_y in ('1','2','3')) as column_9    
from table_a a
inner join table_b b on ...
inner join table_c c on ...
";

 Console.WriteLine(input1.GetColumns().Join("\n"));

Output

[a].[column_1]
`b`.`column_2`
c.column_3
column_4
`column_5`
[column_6]
column_7
column_8
column_9
References

Regular Expression Language - Quick Reference
Simplified Regular Expressions
Try regular expressions online

2 Comments

I was looking for some ideas for myself, but I did not find them as expected, so I have to leave something. I hope you enjoy it.
i don't get the[`] thing
1

I am working on the same problem and came across your question. My solution is in Python, however you might find it useful and might want to transfer my code to C#.

The idea is to first remove any comments and then extract table names by matching against the FROM and JOIN clauses using regular expressions.

My approach:

  1. Remove Comments: Remove both multiline (/* */) and single-line (--) comments to avoid false positives.
  2. Extract Tables: Use regular expressions to locate table names following FROM and JOIN. My solution also considers nested queries and subqueries, capturing tables at multiple levels.
import re

def remove_comments(sql):
    """
    Removes multiline (/* ... */) and single-line (--) comments from the SQL string.
    """
    # Remove multiline comments (DOTALL: . also matches newline characters)
    sql_no_multiline = re.sub(r'/\*.*?\*/', '', sql, flags=re.DOTALL)
    # Remove single-line comments (MULTILINE: ^ and $ match beginning and end of line)
    sql_no_comments = re.sub(r'--.*?$', '', sql_no_multiline, flags=re.MULTILINE)
    return sql_no_comments

def extract_tables(sql):
    """
    Extracts all table names from an SQL string, including those from subqueries.
    Comments are removed first.
    """
    # First, remove comments
    sql = remove_comments(sql)
    
    tables = []

    def parse_segment(segment):
        # Regex to capture the FROM clause up to a possible terminating SQL keyword
        from_pattern = re.compile(
            r"(?i)\bFROM\b\s+((?:(?!\b(?:WHERE|GROUP\s+BY|ORDER\s+BY|HAVING|JOIN|UNION|INTERSECT|EXCEPT)\b).)+)",
            re.DOTALL
        )
        for match in from_pattern.finditer(segment):
            clause = match.group(1)
            # Within the FROM clause: Extract all table names, even if separated by commas
            table_pattern = re.compile(r"(?i)(?:^|,)\s*(?:(?:\w+\.)?(\w+))(?:@\w+)?\b")
            found = table_pattern.findall(clause)
            tables.extend(found)

    def recursive_parse(s):
        i = 0
        current_segment = ""
        while i < len(s):
            if s[i] == '(':
                if current_segment:
                    parse_segment(current_segment)
                    current_segment = ""
                depth = 1
                j = i + 1
                while j < len(s) and depth > 0:
                    if s[j] == '(':
                        depth += 1
                    elif s[j] == ')':
                        depth -= 1
                    j += 1
                inner = s[i+1:j-1]
                recursive_parse(inner)
                i = j
            else:
                current_segment += s[i]
                i += 1
        if current_segment:
            parse_segment(current_segment)

    recursive_parse(sql)
    return tables

# Example SQL that contains comments and subqueries:
sql_string = """
    /* Multiline comment at the beginning
       that should be removed
       SELECT * FROM should_not_be_detected */
    SELECT 
        a.col1, 
        b.col2,
        CONCAT('This is a string with -- not a comment', a.col3) AS combined_col
    FROM schema1.table1 AS a, -- Single-line comment after the first table
         table2 AS b,
         ( 
             SELECT 
                 c.col1, 
                 c.col2 
             FROM schema2.table3 c 
             WHERE c.col4 = 'A value with /* not a comment */'
         ) AS subquery1,
         ( -- Subquery with nested function
             SELECT 
                 d.col1, 
                 COUNT(d.col2) 
             FROM (
                 SELECT * FROM table4 WHERE dummycol = 'Test -- should remain'
             ) d
             GROUP BY d.col1
         ) AS subquery2,
         (SELECT * FROM table5) AS subquery3
    WHERE a.id = b.id
      AND a.col1 IN (SELECT col FROM table6 WHERE col LIKE '%--notComment%')
      AND a.col2 = FUNCTION_CALL(a.col2, (SELECT MAX(value) FROM table7));
      
    -- Comment at the end that should also be removed
"""

clean_sql = remove_comments(sql_string)
print("SQL without comments:\n", clean_sql)
tables_found = extract_tables(sql_string)
tables_found = set(tables_found)
print("Found tables:", tables_found)

One advantage is that it can detect multiple table names listed after FROM and separated by commas, which happens quite often in the files I work with.

For a C# equivalent you could similarly:

  • Remove SQL comments using C# regex.
  • Use a regex similar to the one above with Regex.Matches to extract table names.

Hope it helps you, or others that might come along your question.

Comments

0

can try this but it doesnt work for all the types of query,

  public void Main()
    {
        // TODO: Add your code here

        string Line = string.Empty;

        using (StreamReader sr = new StreamReader(@"D:\ssis\queryfile.txt"))//reading the filename
        {

            var text = string.Empty;

            do
            {
                //     MessageBox.Show(Line);
                text = Line = sr.ReadToEnd();// storing it in a variable by reading till end

                MessageBox.Show(Line);



            } while ((Line = sr.ReadLine()) != null);



            var text1 = text.Replace("[", string.Empty).Replace("]", string.Empty);//replacing brackets with empty space

            MessageBox.Show(text1);



            Regex r = new Regex(@"(?<=from|join)\s+(?<table>\S+)", RegexOptions.IgnoreCase | RegexOptions.Compiled);//regex for extracting the tablename after from and join

            Match m = r.Match(text1);//creating match object

            MessageBox.Show(m.Groups[1].Value);

            var v = string.Empty;



            while (m.Success)
            {

                v = m.Groups[0].Value;

                m = m.NextMatch();


                StreamWriter wr = new StreamWriter(@"D:\ssis\writefile.txt", true);// writing the match to the file

                var text2 = v.Replace(".", " ,"); // replace the . with , seperated values


                wr.WriteLine(text2);

                sr.Close();
                wr.Close();

            }
        }
    }

Comments

-1
(from|join)\s(\w+)

2 Comments

Nope. That will match either just "from" or "join table_name", but not "from table_name". The pipe splits the entire pattern, not just the first part.
Hmm. So can i wrap from|join with any parenthesis or something?

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.