I've a text file containing large number of queries. I want to get all the distinct tables used in the entire file in all the queries. The table name can come after a FROM or JOIN. How can I extract them by doing a regex match. Can anyone suggest a regular expression to get the matches?
7 Answers
It depends on structure of your file. Try to use this:
(?<=from|join)(\s+\w+\b)
Also turn on options Multiline if your not split your file in array or smth else with singleline string members. Also try to turn on IgnorCase option.
7 Comments
I'd use:
r = new Regex("(from|join)\s+(?<table>\S+)", RegexOptions.IgnoreCase);
once you have the Match object "m", you'll have the table name with
m.Groups["table"].Value
example:
string line = @"select * from tb_name join tb_name2 ON a=b WHERE x=y";
Regex r = new Regex(@"(from|join)\s+(?<table>\S+)",
RegexOptions.IgnoreCase|RegexOptions.Compiled);
Match m = r.Match(line);
while (m.Success) {
Console.WriteLine (m.Groups["table"].Value);
m = m.NextMatch();
}
it will print: tb_table tb_table2
Comments
Something like this maybe:
/(from|join)\s+(\w*\.)*(?<tablename>\w+)/
It won't match escaped table names though, and you need to make the regex evaluation case-insensitive.
4 Comments
foobarfrom blah. And it doesn't take escaping into account, because that's a DBMS-specific thing - MySQL uses backquotes, PostgreSQL uses double quotes, T-SQL uses square brackets.Solutions that can help you.
1. Extract table names with alias from an SQL statement with Regex
Regular expression
/(from|join|into)\s+([`]\w+.*[`] *\w+|(\[)\w+.*(\]) *\w+|\w*\.*\w+ *\w+)/g
2. Extract table names from an SQL statement with Regex
Regular expression
/(from|join|into)\s+([`]\w+.+\w+\s*[`]|(\[)\w+.+\w+\s*(\])|\w+\s*\.+\s*\w*|\w+\b)/g
Test string
-------------------------------------------------------
select * into [dbo].[table_temp]
from [dbo].[table_a] a inner join dbo.table_b b ...
join table_c c on ...
from dbo.table_d d ...
from `dbo`.`table_e` e ...
from table_f f ...
-------------------------------------------------------
Generated Code for C#
using System;
using System.Text.RegularExpressions;
public static class QueryExtension
{
public static List<string> GetTables(this string query)
{
List<string> tables = new List<string>();
string pattern = @"(from|join|into)\s+([`]\w+.+\w+\s*[`]|(\[)\w+.+\w+\s*(\])|\w+\s*\.+\s*\w*|\w+\b)";
foreach (Match m in Regex.Matches(query, pattern))
{
string name = m.Groups[2].Value;
tables.Add(name);
}
return tables;
}
public static string Join(this IEnumerable<string> values, string separator) {
return string.Join(separator, values);
}
}
How to use it.
string input = @"select * into [dbo].[table_temp]
from [dbo].[table_a] a inner join dbo.table_b b ...
join table_c c on ...
from dbo.table_d d ...
from `dbo`.`table_e` e ...
from table_f f ...";
Console.WriteLine(input.GetTables().Join("\n"));
Output
[dbo].[table_temp]
[dbo].[table_a]
dbo.table_b
table_c
dbo.table_d
`dbo`.`table_e`
table_f
Extract column names from an SQL statement with Regex
Regular expression
/(\w*\.*\w+|`\w*.*\w`|(\[)\w*.*(\]))+(,|\s+,|\s+FROM|\s+from)/g
Test string
-------------------------------------------------------
SELECT
[a].[column_1],
`b`.`column_2`,
c.column_3,
col4 as column_4,
col5 as `column_5`,
col6 as [column_6],
column_7,
a.col8 column_8,
(select max(column_x) from table_d where column_y in ('1','2','3')) as column_9
from table_a a
inner join table_b b on ...
inner join table_c c on ...
-------------------------------------------------------
Generated code for C#
public static class QueryExtension
{
public static List<string> GetColumns(this string query)
{
List<string> columns = new List<string>();
string pattern = @"(\w*\.*\w+|`\w*.*\w`|(\[)\w*.*(\]))+(,|\s+,|\s+FROM|\s+from)";
foreach (Match m in Regex.Matches(query, pattern))
{
string name = m.Groups[1].Value;
columns.Add(name);
}
return columns;
}
public static string Join(this IEnumerable<string> values, string separator) {
return string.Join(separator, values);
}
}
How to use it
string input1 = @"SELECT
[a].[column_1],
`b`.`column_2`,
c.column_3,
col4 as column_4,
col5 as `column_5`,
col6 as [column_6],
column_7,
a.col8 column_8,
(select max(column_x) from table_d where column_y in ('1','2','3')) as column_9
from table_a a
inner join table_b b on ...
inner join table_c c on ...
";
Console.WriteLine(input1.GetColumns().Join("\n"));
Output
[a].[column_1]
`b`.`column_2`
c.column_3
column_4
`column_5`
[column_6]
column_7
column_8
column_9
References
Regular Expression Language - Quick Reference
Simplified Regular Expressions
Try regular expressions online
2 Comments
I am working on the same problem and came across your question. My solution is in Python, however you might find it useful and might want to transfer my code to C#.
The idea is to first remove any comments and then extract table names by matching against the FROM and JOIN clauses using regular expressions.
My approach:
- Remove Comments: Remove both multiline (
/* */) and single-line (--) comments to avoid false positives. - Extract Tables: Use regular expressions to locate table names following
FROMandJOIN. My solution also considers nested queries and subqueries, capturing tables at multiple levels.
import re
def remove_comments(sql):
"""
Removes multiline (/* ... */) and single-line (--) comments from the SQL string.
"""
# Remove multiline comments (DOTALL: . also matches newline characters)
sql_no_multiline = re.sub(r'/\*.*?\*/', '', sql, flags=re.DOTALL)
# Remove single-line comments (MULTILINE: ^ and $ match beginning and end of line)
sql_no_comments = re.sub(r'--.*?$', '', sql_no_multiline, flags=re.MULTILINE)
return sql_no_comments
def extract_tables(sql):
"""
Extracts all table names from an SQL string, including those from subqueries.
Comments are removed first.
"""
# First, remove comments
sql = remove_comments(sql)
tables = []
def parse_segment(segment):
# Regex to capture the FROM clause up to a possible terminating SQL keyword
from_pattern = re.compile(
r"(?i)\bFROM\b\s+((?:(?!\b(?:WHERE|GROUP\s+BY|ORDER\s+BY|HAVING|JOIN|UNION|INTERSECT|EXCEPT)\b).)+)",
re.DOTALL
)
for match in from_pattern.finditer(segment):
clause = match.group(1)
# Within the FROM clause: Extract all table names, even if separated by commas
table_pattern = re.compile(r"(?i)(?:^|,)\s*(?:(?:\w+\.)?(\w+))(?:@\w+)?\b")
found = table_pattern.findall(clause)
tables.extend(found)
def recursive_parse(s):
i = 0
current_segment = ""
while i < len(s):
if s[i] == '(':
if current_segment:
parse_segment(current_segment)
current_segment = ""
depth = 1
j = i + 1
while j < len(s) and depth > 0:
if s[j] == '(':
depth += 1
elif s[j] == ')':
depth -= 1
j += 1
inner = s[i+1:j-1]
recursive_parse(inner)
i = j
else:
current_segment += s[i]
i += 1
if current_segment:
parse_segment(current_segment)
recursive_parse(sql)
return tables
# Example SQL that contains comments and subqueries:
sql_string = """
/* Multiline comment at the beginning
that should be removed
SELECT * FROM should_not_be_detected */
SELECT
a.col1,
b.col2,
CONCAT('This is a string with -- not a comment', a.col3) AS combined_col
FROM schema1.table1 AS a, -- Single-line comment after the first table
table2 AS b,
(
SELECT
c.col1,
c.col2
FROM schema2.table3 c
WHERE c.col4 = 'A value with /* not a comment */'
) AS subquery1,
( -- Subquery with nested function
SELECT
d.col1,
COUNT(d.col2)
FROM (
SELECT * FROM table4 WHERE dummycol = 'Test -- should remain'
) d
GROUP BY d.col1
) AS subquery2,
(SELECT * FROM table5) AS subquery3
WHERE a.id = b.id
AND a.col1 IN (SELECT col FROM table6 WHERE col LIKE '%--notComment%')
AND a.col2 = FUNCTION_CALL(a.col2, (SELECT MAX(value) FROM table7));
-- Comment at the end that should also be removed
"""
clean_sql = remove_comments(sql_string)
print("SQL without comments:\n", clean_sql)
tables_found = extract_tables(sql_string)
tables_found = set(tables_found)
print("Found tables:", tables_found)
One advantage is that it can detect multiple table names listed after FROM and separated by commas, which happens quite often in the files I work with.
For a C# equivalent you could similarly:
- Remove SQL comments using C# regex.
- Use a regex similar to the one above with Regex.Matches to extract table names.
Hope it helps you, or others that might come along your question.
Comments
can try this but it doesnt work for all the types of query,
public void Main()
{
// TODO: Add your code here
string Line = string.Empty;
using (StreamReader sr = new StreamReader(@"D:\ssis\queryfile.txt"))//reading the filename
{
var text = string.Empty;
do
{
// MessageBox.Show(Line);
text = Line = sr.ReadToEnd();// storing it in a variable by reading till end
MessageBox.Show(Line);
} while ((Line = sr.ReadLine()) != null);
var text1 = text.Replace("[", string.Empty).Replace("]", string.Empty);//replacing brackets with empty space
MessageBox.Show(text1);
Regex r = new Regex(@"(?<=from|join)\s+(?<table>\S+)", RegexOptions.IgnoreCase | RegexOptions.Compiled);//regex for extracting the tablename after from and join
Match m = r.Match(text1);//creating match object
MessageBox.Show(m.Groups[1].Value);
var v = string.Empty;
while (m.Success)
{
v = m.Groups[0].Value;
m = m.NextMatch();
StreamWriter wr = new StreamWriter(@"D:\ssis\writefile.txt", true);// writing the match to the file
var text2 = v.Replace(".", " ,"); // replace the . with , seperated values
wr.WriteLine(text2);
sr.Close();
wr.Close();
}
}
}