Pythonic way to iterate over multiple queries (and avoid bloating my code)

Question

I have the following code block:

from jira import JIRA
import pandas as pd

cert_path = 'C:\\cert.crt'

start_date = '2020-10-01'
end_date = '2020-10-31'

a_session = JIRA(server='https://jira.myinstance-A.com', options={'verify': cert_path}, kerberos=True)

b_session = JIRA(server='https://jira.myinstance-B.com', options={'verify': cert_path}, kerberos=True)

c_session = JIRA(server='https://jira.myinstance-C.com', options={'verify': cert_path}, kerberos=True)



query_1 = 'project = \"Test Project 1\" and issuetype = Incident and resolution = Resolved and updated >= {} and updated <= {}'.format(start_date, end_date)

query_2 = 'project = \"Test Project 2\" and issuetype = Incident and resolution = Resolved and updated >= {} and updated <= {}'.format(start_date, end_date)

query_3 = 'project = \"Test Project 3\" and issuetype = Defect and resolution = Resolved and releasedate >= {} and releasedate <= {}'.format(start_date, end_date)

query_4 = 'project = \"Test Project 4\" and issuetype = Enhancement and resolution = Done and completed >= {} and completed <= {}'.format(start_date, end_date)

block_size = 100
block_num = 0

all_issues = []
while True:
    start = block_num * block_size
    issues = a_session.search_issues(query_1, start, block_size)
    if len(issues) == 0:
        break
    block_num += 1
    for issue in issues:
        all_issues.append(issue)

issues = pd.DataFrame()

for issue in all_issues:
    d = {
        'key' : issue.key,
        'type' : issue.fields.type,
        'creator' : issue.fields.creator,
        'resolution' : issue.fields.resolution
    }

    issues = issues.append(d, ignore_index=True)

This code runs fine and allows me to:

retrieve data associated with only query_1 (which connects to a_session)
save that data into a Pandas dataframe

Now, I would like to be able to:

a. retrieve the data associated with query_2 (which also onnects to a_session) and save it to the issues dataframe

b. retrieve the data associated with query_3 (which connects to b_session) and save it to the issues dataframe

c. retrieve the data associated with query_4 (which connects to c_session) and save it to the issues dataframe

Notice that the structure of query_3 and query_4 is different than that of query_1 and query_2 (the field names are different, among other things).

I could write one GIANT script (which would probably work). But, I'm sure there is a more elegant way of approaching this (perhaps with a nested loop).

What's the best way of adapting this code block such that it treats cases a, b, and c above?

Any help would be much appreciated by this Python novice! Thanks in advance!

UPDATE 1:

I used the (very elegant) solution suggested by @Nick ODell. The code runs fine, but for whatever reason, I get a None result. I spent the past few hours trying to debug this and my leading theory is that the field names are not passed (as they are in d in the original code block I posted).

I tried to amend the get_all_issues function as follows:

def get_all_issues(session, query):
    start = 0
    all_issues = []
    while True:
        issues = session.search_issues(query, start, block_size)
        if len(issues) == 0:
            # No more issues
            break
        start += len(issues)
        for issue in issues:
            all_issues.append(issue)

    issues = pd.DataFrame

    for issue in all_issues:
        d = {
            'key' : issue.key,
            'type' : issue.fields.type,
            'creator' : issue.fields.creator,
            'resolution' : issue.fields.resolution
             }

    issues = issues.append(d, ignore_index=True)

But, now there is an error message saying:

ValueError:  All objects passed were None.

How would we amend the get_all_issues() function such that we can nest the following for loop and pass in the name fields, as follows?

for issue in all_issues:
    d = {
        'key' : issue.key,
        'type' : issue.fields.type,
        'creator' : issue.fields.creator,
        'resolution' : issue.fields.resolution
    }

    issues = issues.append(d, ignore_index=True)

UPDATE 2:

Instead of using pd.json_normalize(issues), I used pd.DataFrame(issues) and added a dictionary of field names. The following code works ** because all fields exist in a_session, b_session, and c_session**:

def get_all_issues(session, query):

    block_size = 50
    block_num = 0
    
    start = 0
    all_issues = []
    while True:
        issues = session.search_issues(query, start, block_size)
        if len(issues) == 0:
            # No more issues
            break
        start += len(issues)
        for issue in issues:
            all_issues.append(issue)

    issues = pd.DataFrame(issues)

    for issue in all_issues:
        d = {
            'key' : issue.key,
            'type' : issue.fields.type,
            'creator' : issue.fields.creator,
            'resolution' : issue.fields.resolution
             }

        issues = issues.append(d, ignore_index=True)

    return issues

Then, I added 3 new custom fields to the dictionary:

    for issue in all_issues:
        d = {
            'key' : issue.key,
            'type' : issue.fields.type,
            'creator' : issue.fields.creator,
            'resolution' : issue.fields.resolution,
            'system_change' : issue.fields.customfield_123,
            'system_resources' : issue.fields.customfield_456,
            'system_backup' : issue.fields.customfield_789
             }

Custom field 123 exists in a_session and b_session, but not in c_session. Custom field 456 exists only in c_session. And, custom field 789 exists in b_session and c_session.

Doing so results in the following error: AttributeError: type object 'PropertyHolder' has no attribute 'customfield_123'.

Can anyone suggest an elegant solution to handle this? (i.e. the ability to have a dictionary with any number of fields, and the code 'understands' which fields relate to a given session) Thanks!

Nick ODell · Accepted Answer · 2020-11-13 02:40:32Z

1

Here's how I would approach your problem. I don't have a Jira instance to test against, so this code is untested.

First, define a function to fetch all issues from a given session for a given query:

def get_all_issues(session, query):
    start = 0
    all_issues = []
    while True:
        issues = session.search_issues(query, start, block_size)
        if len(issues) == 0:
            # No more issues
            break
        start += len(issues)
        for issue in issues:
            all_issues.append(issue)
    # Flatten JSON
    # This idea is from
    # https://levelup.gitconnected.com/jira-api-with-python-and-pandas-c1226fd41219
    return pd.json_normalize(issues)

Next, I would make a list of queries, and the corresponding backend.

queries = [
    (a_session, query_1),
    (a_session, query_2),
    (b_session, query_3),
    (c_session, query_4),
]

Next, I would loop over each pair of session and query, calling the function I just defined, and save the dataframe I get each time.

dataframes = []

for session, query in queries:
    dataframe = get_all_issues(session, query)
    dataframes.append(dataframe)

Now, the field names for each of these dataframes won't be the same. However, Pandas is actually tolerant to this, and if a column is present in one dataset but not in another, Pandas will fill in the missing column with NaN values. So, just concatenate the rows from each dataframe together:

all = pd.concat(dataframes)

... and that's it!

answered Nov 13, 2020 at 2:40

Nick ODell

28.1k7 gold badges52 silver badges93 bronze badges

Sign up to request clarification or add additional context in comments.

5 Comments

equanimity Over a year ago

For whatever reason, using the approach suggested by @Nick ODell results in an empty set. My leading theory is that the field names are not passed (hence the None result). How would we use pd.DataFrame instead of pd.json_normalize and pass the field names (as I've done above)? Thanks in advance!

Nick ODell Over a year ago

@equanimity Not sure. I can't actually test this code, so it's hard for me to say what might be wrong. Could you give me a copy of the contents of the issues variable? e.g. print(repr(issues)) (Assuming that you're allowed to post it.)

equanimity Over a year ago

when I print(repr(issues)), I get: Name Error: name 'issues' is not defined. I have the code working, but without using pd.json_normalize(issues). Now, the problem I'm facing is that some fields exist in certain sessions, but not in others. I'll post another update above.

Nick ODell Over a year ago

I see you changed issues to all_issues in your version of the code. Therefore, can you post the contents of print(repr(all_issues))?

equanimity Over a year ago

when I print(repr(all_issues)), I get: Name Error: name 'all_issues' is not defined.

John Gordon · Accepted Answer · 2020-11-13 02:11:40Z

0

Instead of making separate variables for the queries, put them in a list:

queries = [
    'query 1 here...',
    'query 2 here...',
]

And then iterate over the list:

for query in queries:
    process(query)

answered Nov 13, 2020 at 2:11

John Gordon

33.8k9 gold badges48 silver badges72 bronze badges

1 Comment

equanimity Over a year ago

How would I deal with the fact that 3 of the 4 queries reference different <x>_session variables?

Collectives™ on Stack Overflow

Pythonic way to iterate over multiple queries (and avoid bloating my code)

2 Answers 2

5 Comments

1 Comment

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

5 Comments

1 Comment

Your Answer

Sign up or log in

Post as a guest

Related