I need to parse some information from a SQL DDL statement using regex. The SQL statement looks like this:
CREATE TABLE default.table1 (DATA4 BIGINT, DATA5 BIGINT, DATA2 BIGINT, DATA3 BIGINT)
USING parquet
OPTIONS (
serialization.format '1'
)
PARTITIONED BY (DATA2, DATA3)
I need to parse it in Python and pull out the columns named in the PARTITIONED BY clause. I've figured out a regex to achieve it after removing newline characters but I can't get it to work if there are newline chars in there. here is some demo code:
import re
def print_partition_columns_if_found(ddl_string):
regex = r'CREATE +?(TEMPORARY +)?TABLE *(?P<db>.*?\.)?(?P<table>.*?)\((?P<col>.*?)\).*?USING +([^\s]+)( +OPTIONS *\([^)]+\))?( *PARTITIONED BY \((?P<pcol>.*?)\))?'
match = re.search(regex, ddl_string, re.MULTILINE | re.DOTALL)
if match.group("pcol"):
print match.group("pcol").strip()
else:
print 'did not find any pcols in {matches}'.format(matches=match.groups())
ddl_string1 = """
CREATE TABLE default.table1 (DATA4 BIGINT, DATA5 BIGINT, DATA2 BIGINT, DATA3 BIGINT)
USING parquet OPTIONS (serialization.format '1') PARTITIONED BY (DATA2, DATA3)"""
print_partition_columns_if_found(ddl_string1)
print "--------"
ddl_string2 = """
CREATE TABLE default.table1 (DATA4 BIGINT, DATA5 BIGINT, DATA2 BIGINT, DATA3 BIGINT)
USING parquet
OPTIONS (
serialization.format '1'
)
PARTITIONED BY (DATA2, DATA3)
"""
print_partition_columns_if_found(ddl_string2)
That returns:
DATA2, DATA3
--------
did not find any pcols in (None, 'default.', 'table1 ', 'DATA4 BIGINT, DATA5 BIGINT, DATA2 BIGINT, DATA3 BIGINT', 'parquet', None, None, None)
Any regex experts care to help me out?