1

I need to parse some information from a SQL DDL statement using regex. The SQL statement looks like this:

CREATE TABLE default.table1 (DATA4 BIGINT, DATA5 BIGINT, DATA2 BIGINT, DATA3 BIGINT)
USING parquet
OPTIONS (
  serialization.format '1'
)
PARTITIONED BY (DATA2, DATA3)

I need to parse it in Python and pull out the columns named in the PARTITIONED BY clause. I've figured out a regex to achieve it after removing newline characters but I can't get it to work if there are newline chars in there. here is some demo code:

import re
def print_partition_columns_if_found(ddl_string):
    regex = r'CREATE +?(TEMPORARY +)?TABLE *(?P<db>.*?\.)?(?P<table>.*?)\((?P<col>.*?)\).*?USING +([^\s]+)( +OPTIONS *\([^)]+\))?( *PARTITIONED BY \((?P<pcol>.*?)\))?'
    match = re.search(regex, ddl_string, re.MULTILINE | re.DOTALL)
    if match.group("pcol"):
        print match.group("pcol").strip()
    else:
        print 'did not find any pcols in {matches}'.format(matches=match.groups())        


ddl_string1 = """
CREATE TABLE default.table1 (DATA4 BIGINT, DATA5 BIGINT, DATA2 BIGINT, DATA3 BIGINT)
USING parquet OPTIONS (serialization.format '1') PARTITIONED BY (DATA2, DATA3)"""
print_partition_columns_if_found(ddl_string1)

print "--------"

ddl_string2 = """
CREATE TABLE default.table1 (DATA4 BIGINT, DATA5 BIGINT, DATA2 BIGINT, DATA3 BIGINT)
USING parquet
OPTIONS (
  serialization.format '1'
)
PARTITIONED BY (DATA2, DATA3)
"""
print_partition_columns_if_found(ddl_string2)

That returns:

DATA2, DATA3
--------
did not find any pcols in (None, 'default.', 'table1 ', 'DATA4 BIGINT, DATA5 BIGINT, DATA2 BIGINT, DATA3 BIGINT', 'parquet', None, None, None)

Any regex experts care to help me out?

1 Answer 1

3

Lets check python sqlparse documentation: Documentation - getting started

>>> import sqlparse
>>> ddl_string2 = """
... CREATE TABLE default.table1 (DATA4 BIGINT, DATA5 BIGINT, DATA2 BIGINT, DATA3 BIGINT)
... USING parquet
... OPTIONS (
...   serialization.format '1'
... )
... PARTITIONED BY (DATA2, DATA3)
... """
>>> ddl_string1 = """
... CREATE TABLE default.table1 (DATA4 BIGINT, DATA5 BIGINT, DATA2 BIGINT, DATA3 BIGINT)
... USING parquet OPTIONS (serialization.format '1') PARTITIONED BY (DATA2, DATA3)"""
>>> def print_partition_columns_if_found(sql):
...     parse = sqlparse.parse(sql)
...     data = next(item for item in reversed(parse[0].tokens) if item.ttype is None)[1]
...     print(data)
...
>>> print_partition_columns_if_found(ddl_string1)
DATA2, DATA3
>>> print_partition_columns_if_found(ddl_string2)
DATA2, DATA3
>>>
Sign up to request clarification or add additional context in comments.

1 Comment

Interesting, I'll take a deeper look. Thanks.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.