2

In the following code I need to get table names as tokens using SQL reserved words as separators. The tokens should contain the table name or the table name followed by a suffix.

For example, given

table1 t1 inner join table2 t2 outer join table3

The code should return three tokens:

Token 1: table1 t1
Token 2: table2 t2 
Token 3: table3

This code instead uses the first reserved word as token, without discarding any other following reserved words:

    String str = "table1 t1 inner join table2 t2 outer join table3";
    String [] tokens = sql2.split("\\son\\s|\\sinner\\s|\\souter\\s|\\sjoin\\s");
    
    for (int i = 0; i<tokens.length; i++)
         System.out.println("Token "+(i+1)+":"+tokens[i]);

This returns:

Token 1:table1 t1
Token 2:join table2 t2 
Token 3:join table3

What is the problem and how to make this work?

6
  • This is probably going to be more complex than it seems. Regular expressions are not going to be powerful enough for the job. What should happen if you see ...inner join table1 on table1.value = table2.value and table1.value = 'inner value'. At any rate, you're not going to want to split, because you don't want every word that isn't a reserved word, you just want the word that follows it. This might require a library, but I'm afraid I don't have on to recommend to you. Commented Jul 31, 2020 at 21:02
  • You are right, anything that is not a table, including things such as table1.column1 should be omitted Commented Jul 31, 2020 at 21:06
  • stackoverflow.com/questions/16616034/… Commented Jul 31, 2020 at 21:06
  • Using the JDBC result set didn’t work for me Commented Jul 31, 2020 at 21:07
  • 1
    It is a lot harder than you think. I used ANTLR4 a few years back to analyse static SQL from a DB2 catalog, and you need a proper SQL grammar to generate a parser. After which, it isn't too bad... Commented Jul 31, 2020 at 21:17

4 Answers 4

2

This works for the general case for a series of joined tables:

String[] tableNames = str.split(" (?=inner|outer|left|join|right|cross|full).*?join ");
Sign up to request clarification or add additional context in comments.

2 Comments

Wow! This is much simpler than my solution and yet produces the result correctly!
@Stephen @mypetition first I will remove any table1.column1 and as, then I will run this regex.
1

You can use (?:\w+\s+){1,2}(?=inner|outer join)|(?<=inner join)(?:\s+\w+){1,2}|(?<=outer join)(?:\s+\w+){1,2} as the regex.

Demo:

import java.util.regex.Matcher;
import java.util.regex.Pattern;

public class Main {
    public static void main(String[] args) {
        // Test strings
        String[] arr = { "table1 t1 inner join table2 t2 outer join table3",
                "table1 t1 inner join table2 t2 outer join table3 t3 inner join table2" };
        Pattern pattern = Pattern.compile(
                "(?:\\w+\\s+){1,2}(?=inner|outer join)|(?<=inner join)(?:\\s+\\w+){1,2}|(?<=outer join)(?:\\s+\\w+){1,2}");
        for (String s : arr) {
            System.out.println("Processing: " + s);
            Matcher matcher = pattern.matcher(s);
            while (matcher.find()) {
                System.out.println(matcher.group().trim());
            }
        }
    }
}

Output:

Processing: table1 t1 inner join table2 t2 outer join table3
table1 t1
table2 t2
table3
Processing: table1 t1 inner join table2 t2 outer join table3 t3 inner join table2
table1 t1
table2 t2
table3 t3
table2

2 Comments

Fails in this case: regex101.com/r/YfJah6/1
@mypetlion - You've made a mistake in your string. As per the requirement, join table3 should be join table3 t3. I've extended the regex from my original answer to cover your string (once you correct it) as well.
0

I found this question that is essentially the same as yours.

You can mostly solve this problem by going through a regex nightmare, or you can try to use an external library like Zql which actually parses the SQL statement for you.

Comments

0

I've tested your pattern (unescaped, it's \son\s|\sinner\s|\souter\s|\sjoin\s) against this test string: table1 t1 inner join table2 t2 outer join table3 on regex101.com and the only match I got is for inner and outer. So since you're splitting the string by these tokens, you get your result.

Perhaps this can help you for your specific case. I have went for a regex approach, instead of splitting the data.

public class PatternChecker {

    public static void main(String[] args) {
        String str = "table1 t1 inner join table2 t2 outer join table3";
        Pattern p = Pattern.compile("(table[0-9]+( [a-zA-Z0-9]+ )?)");
        Matcher m = p.matcher(str);

        while(m.find()) {
            System.out.println(m.group(0));
        }
    }

}

Later edit

The split pattern \\son\\s|\\sinner\\s|\\souter\\s|\\sjoin\\s did not work because of the mandatory whitespaces used. For instance, you are searching for *on* or *inner* or *outer* or *join* (whitespaces are marked with an asterisk). The whitespaces are part of the keywords you're splitting with. *join* could not be matched since its left-side whitespace was already picked up by the *outer* and *inner* right-side whitespace matches.

Going back to the split solution, one fix would be to mark the left-side whitespace of join as optional via the ? quantifier; this would be the new pattern: \\son\\s|\\sinner\\s|\\souter\\s|\\s?join\\s. This yields some empty tokens that can be filtered out

Another idea would be to consider aggregations using join (i.e. inner join, outer join) as full search criteria, which would lead to \\son\\s|\\sinner join\\s|\\souter join\\s. No empty tokens are generated.

public class PatternChecker {

    public static void main(String[] args) {
        String str = "employee t1 inner join department t2 outer join job join table4 history on a=b";

        String[] tokens = str.split("\\son\\s|\\sinner join\\s|\\souter join\\s|\\sjoin\\s");

        for(String token : tokens) {
            System.out.println(token);
        }

        // Output
        // employee t1
        // department t2
        // job
        // table4 history
        // a=b

    }

}

Note that, since you're also including on, you can filter out all the matched tokens containing the equals symbol.

For a generic fix, you would need to isolate the string contained between from and where and apply the idea above.


6 Comments

Wouldn't this only work if the table name starts with the word table? I think OP is trying to determine what words are table identifiers based on their context in the SQL statement.
@Badr Of course, that is why I have specified that this works for the specific case OP was talking about. I will try to think of a solution covering any table name and update my answer.
Fails on this case: regex101.com/r/Sr6GzI/1
@xsylver - You've done a commendable job. Do not get demotivated by the d o w n v o t e s. Keep it up!
@Arvind Thank you for the kind words :)
|

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.