building a shell command inside python function based on different conditions

Question

I have written the below python function which is working fine but there are lot of code which looks redundant to me and hence i want to make it better by using best coding guidelines.

Basically I am trying to build either a gcloud command or a simple bash command to execute a python script which depends upon the spark_flag parameter which is an input to the function and is FALSE by default.

def build_command(table_info1, table_info2, date_folder, timestamp,spark_flag):
    try:
        run_cmd_str = " nohup /usr/local/airflow/dags/batch_ingestion.py -- "
        if table_info1[0] == 'db2':
            app_name = "data-pipeline-" + table_info1[0] + "-" + table_info1[5] + "-" + table_info1[
                6] + "-" + timestamp + table_info2[10]
            if spark_flag:
                cmd_str = "gcloud dataproc jobs submit pyspark --cluster={}  --region={} --id {} --properties spark.submit.deployMode=cluster,spark.driver.memory=512m,spark.executor.memory=512m,spark.executor.cores=1,spark.executor.instances=1 --jars /usr/local/airflow/dags/batch_ingestion.py -- ".format(
                table_info1[10], table_info1[11], app_name)
            else:
                cmd_str = run_cmd_str

        elif table_info1[0] == 'sql_server' or table_info1[0] == 'azure_sql':
            if '.' in table_info1[6]:
                table = table_info1[6].split('.')
                app_name = "data-pipeline-" + table_info1[0] + "-" + table_info1[5] + "-" + table[0] + "_" + table[
                    1] + "-" + timestamp + table_info2[10]
            else:
                app_name = "data-pipeline-" + table_info1[0] + "-" + table_info1[5] + "-" + table_info1[
                    6] + "-" + timestamp + table_info2[10]
            if spark_flag:
                cmd_str = "gcloud dataproc jobs submit pyspark --cluster={}  --region={} --id {} --properties spark.submit.deployMode=cluster,spark.driver.memory=512m,spark.executor.memory=512m,spark.executor.cores=1,spark.executor.instances=1 --jars /usr/local/airflow/dags/batch_ingestion.py -- ".format(
                table_info1[10], table_info1[11], app_name)
            else:
                cmd_str=run_cmd_str

        elif table_info1[0] == 'abc_informix' or table_info1[0] == 'def_informix':

            if table_info1[7] != '-1':
                app_name = "data-pipeline-" + table_info1[0] + "-" + table_info1[5] + "-" + table_info1[6] + "-" + \
                           table_info1[7] + "-" + timestamp + table_info2[10]
            elif table_info1[7] == '-1' and table_info1[0] == 'def_informix':
                app_name = "data-pipeline-" + table_info1[0] + "-" + table_info1[5] + "-" + table_info1[
                    6] + "-" + timestamp + table_info2[10]
            if spark_flag:
                cmd_str = "gcloud dataproc jobs submit pyspark --cluster={}  --region={} --id {}  --properties spark.submit.deployMode=cluster,spark.driver.memory=512m,spark.executor.memory=512m,spark.executor.cores=1,spark.executor.instances=1 /usr/local/airflow/dags/batch_ingestion.py -- ".format(
                table_info1[10], table_info1[11], app_name)
            else:
                cmd_str = run_cmd_str

        last_run_dated = str(table_info2[1]).split(None, 1)[0]
        cmd_string = " ".join(
            [cmd_str, table_info1[0], table_info1[5], table_info1[6], table_info1[7], table_info1[1], table_info1[2],
             table_info1[3], table_info1[8], table_info1[4], table_info2[0], last_run_dated, table_info2[2],
             date_folder, table_info2[5], table_info2[6], table_info2[7], table_info2[3], table_info2[4],
             table_info2[9]])

        return cmd_string, app_name

    except Exception as e:
        print(e)
        raise

Welcome to Code Review! Please edit your question so that the title describes the purpose of the code, rather than its mechanism. We really need to understand the motivational context to give good reviews. Thanks! — Toby Speight
– Toby Speight, Commented Feb 8, 2022 at 19:21
Don't put a line break in the middle of a subscript -- table_info1[6] should not be split across lines. — Barmar
– Barmar, Commented Feb 9, 2022 at 0:04
Assign meaningful names to the list elements, rather than using all those hard-coded subscripts. name1, name2, name3, ... = table_info1 — Barmar
– Barmar, Commented Feb 9, 2022 at 0:07
Which Python version do you use? If not 3.10, can you change to a higher version? — md2perpe
– md2perpe, Commented Feb 9, 2022 at 10:04

md2perpe · Accepted Answer · 2022-02-09 10:21:53Z

The spark_flag if-statement can be factored out. Then we can set a default app_name and overwrite it when needed.

def build_command(table_info1, table_info2, date_folder, timestamp, spark_flag):

    # Default app_name
    app_name = f"data-pipeline-{table_info1[0]}-{table_info1[5]}-{table_info1[6]}-{timestamp}{table_info2[10]}"

    # Change app_name in a couple of cases
    if table_info1[0] in ('sql_server', 'azure_sql',) and '.' in table_info1[6]:
        table = table_info1[6].split('.')
        app_name = f"data-pipeline-{table_info1[0]}-{table_info1[5]}-{table[0]}_{table[1]}-{timestamp}{table_info2[10]}"

    elif table_info1[0] in ('abc_informix', 'def_informix',) and table_info1[7] != '-1' and table_info1[0] != 'def_informix':
        app_name = f"data-pipeline-{table_info1[0]}-{table_info1[5]}-{table_info1[6]}-{table_info1[7]}-{timestamp}{table_info2[10]}"


    if spark_flag:
        cmd_str = f"gcloud dataproc jobs submit pyspark --cluster={table_info1[10]}  --region={table_info1[11]} --id {app_name} --properties spark.submit.deployMode=cluster,spark.driver.memory=512m,spark.executor.memory=512m,spark.executor.cores=1,spark.executor.instances=1 --jars /usr/local/airflow/dags/batch_ingestion.py -- "
    else:
        cmd_str = " nohup /usr/local/airflow/dags/batch_ingestion.py -- "


    last_run_dated = str(table_info2[1]).split(None, 1)[0]

    cmd_string = " ".join([
        cmd_str,
        table_info1[0], table_info1[5], table_info1[6], table_info1[7], table_info1[1], table_info1[2],
        table_info1[3], table_info1[8], table_info1[4], table_info2[0], last_run_dated, table_info2[2],
        date_folder, table_info2[5], table_info2[6], table_info2[7], table_info2[3], table_info2[4],
        table_info2[9]
    ])

    return cmd_string, app_name

But as others have commented, the different table_info parts should be given descriptive names.

what name should i give in place of table_info ?i am a bit confused — Travelling Days
– Travelling Days, Commented Feb 15, 2022 at 18:35
The name table_info is okay, but you could introduce a variable for table_info1[0] and another for table_info1[1] and so on. What does table_info1[0] contain? What does table_info1[1] contain? — md2perpe
– md2perpe, Commented Feb 15, 2022 at 19:48

Stack Exchange Network

building a shell command inside python function based on different conditions

1 Answer 1

You must log in to answer this question.

Hot Network Questions

building a shell command inside python function based on different conditions

1 Answer 1

You must log in to answer this question.

Related

Hot Network Questions