I need query 200+ tables in database. By using spark.sql = f"" select ... " statement i get col(0) (because result of the query give me specific information about column that i've retrive) and result of calculation for particulare table, like this:
| col(0) |
|---|
| 1 |
My goal is to have 1 csv file, with name of table and the result of calculation:
| Table name | Count |
|---|---|
| accounting | 3 |
| sales | 1 |
So far my main part of my code:
list_tables = ['accounting', 'sales',...]
for table in list_tables:
df = spark.sql(
f""" select distinct errors as counts from {database}.{table} where errors is not null""")
df.repartition(1).write.mode("append").option("header","true").csv(f"s3:.......)
rename_part_file(dir,output,newdir)
I'm kinda new to PySpark and all structures included. Soo far i'm confused because i heard iteration dataframe isn't best idea.
By using following code i get only 1 csv with last recent record, not all processed tables from my list_tables. Im stuck, don't know if there is possibility to pack all of it into 1 dataframe, or i should union dataframe?