I have a dataframe that have about 200 million rows. the example of dataframe is like this:
date query
29-03-2019 SELECT * FROM table WHERE ..
30-03-2019 SELECT * FROM ... JOIN ... ON ...WHERE ..
.... ....
20-05-2019 SELECT ...
I have a function to get table(s) name, attribute(s) name from dataframe above and append to new dataframe.
import sqlparse
from sqlparse.tokens import Keyword, DML
def getTableName(sql):
def getTableKey(parsed):
findFrom = False
wordKey = ['FROM','JOIN', 'LEFT JOIN', 'INNER JOIN', 'RIGHT JOIN', 'OUTER JOIN', 'FULL JOIN']
for word in parsed.tokens:
if word.is_group:
for f in getTableKey(word):
yield f
if findFrom:
if isSelect(word):
for f in getTableKey(word):
yield f
elif word.ttype is Keyword:
findFrom = False
StopIteration
else:
yield word
if word.ttype is Keyword and word.value.upper() in wordKey:
findFrom = True
tableName = []
query = (sqlparse.parse(sql))
for word in query:
if word.get_type() != 'UNKNOWN':
stream = getTableKey(word)
table = set(list(getWord(stream)))
for item in table:
tabl = re.sub(r'^.+?(?<=[.])','',item)
tableName.append(tabl)
return tableName
and the function to get attribute is just like getTableName the different is the wordKey.
function to process dataframe is like this:
import pandas as pd
def getTableAttribute(dataFrame, queryCol, date):
tableName = []
attributeName = []
df = pd.DataFrame()
for row in dataFrame[queryCol]:
table = getTableName(row)
tableJoin = getJoinTable(row)
attribute = getAttribute(row)
#append into list
tableName.append(table+tableJoin)
attributeName.append(attribute)
df = dataFrame[[date]].copy()
df['tableName'] = tableName
df['attributeName'] = attributeName
print('Done')
return df
The result of the function is like this:
date tableName attributeName
29-03-2019 tableN attributeM
30-03-2019 tableA attributeB
.... ... ...
20-05-2019 tableF attributeG
But as this is my first try, I need an opinion about what I've tried, because my code runs slow with large file.
datethe index ofdataFrame? or is it just a column? \$\endgroup\$getTableNameorgetTableNameFrom? \$\endgroup\$