0

I have an Oracle database that I cannot add new tables inside, hence on Django I've created a sqlite database basically just to sync items from the Oracle database to sqlite.

Currently there's about 0.5 million items in the Oracle database.

All of the primary keys in the Oracle database is incremental, however, there's no guarantee. Sometimes there's network hiccups that breaks the connection between Django and the Oracle database, and I would miss some values when when synchronizing.

Hence, I came up with a model inside Django:

class sequential_missing(models.Model):
    database = models.CharField(max_length=200, primary_key=True)
    row = models.IntegerField(primary_key=True)

Basically there is a row in a database, which is missing from the Oracle side, and I will compare between the missing sequential in the sqlite database, and figure out that missing sequential number is actually empty inside the Oracle database. Hence speedup the process of not actually checking ALL the missing sequential values.

The whole function is as follows:

def checkMissing(maxValue, databaseObjects, databaseName):
    missingValues = []

    #############SECTION 1##########################
    print "Database:" + databaseName
    print "Checking for Missing Sequential Numbers"
    set_of_pk_values = set(databaseObjects.objects.all().values_list('pk', flat=True))
    set_one_to_max_value = set(xrange(1, maxValue+1))
    missingValues = set_one_to_max_value.difference(set_of_pk_values)
    #############SECTION 1##########################

    #Even though missingValues could be enough, but the problem is that not even Oracle can
    #guarantee the automatic incremented number is sequential, hence we would look up the values
    #we thought it was missing, and remove them from missingValues, which should be faster than
    #checking all of them in the oracle database

    #############SECTION 2##########################
    print "Checking for numbers that are empty, Current Size:" + str(len(missingValues))
    emptyRow = []
    for idx, val in enumerate(missingValues):
        found = False
        for items in sequential_missing.objects.all():
            if(items.row == val and items.database == databaseName):
                found = True
                #print "Database:" + str(items.row) + ", Same as Empty Row:" + str(val)
        if(found == True):
            emptyRow.append(val)
    #############SECTION 2##########################

    #############SECTION 3##########################
    print "Removing empty numbers, Current Size:" + str(len(missingValues)) + ", Empty Row:" + str(len(emptyRow))
    missingValuesCompared = []
    for idx, val in enumerate(missingValues):
        found = False
        for items in emptyRow:
            if(val == items):
                found = True
                #print "Empty Row:" + str(items) + ", same as Missing Values:" + str(val)
        if(found == False):
            missingValuesCompared.append(val)

    print "New Size:" + str(len(missingValuesCompared))
    return missingValuesCompared
    #############SECTION 3##########################

The code is split into 3 sections:

  1. Figures out what sequential value is missing

  2. Checks for the values between the model, if there's any that is matching, and is the same

  3. Create a new array that does not include the row that is included in section 2.

The problem is that section 2 takes a long time O(n^2), because it has to iterate through the whole database and to check whether if the row is originally empty.

Is there a faster way to do this, whilst consuming minimal memory?

Edit:

Using ROW IN is much better,

setItem = []
for items in missingValues:
    setItem.append(items)
print "Items in setItem:" + str(len(setItem))

currentCounter = 0
currentEndCounter = 500
counterIncrement = 500
emptyRowAppend = []
end = False
firstPass = False
while(end == False):
    emptyRow = sequential_missing.objects.filter(database=databaseName, row__in = setItem[currentCounter:currentEndCounter])
    for items in emptyRow:
        emptyRowAppend.append(items.row)
    if(firstPass == True):
        end = True
    if ((currentEndCounter+counterIncrement)>maxValue):
        currentCounter += counterIncrement
        currentEndCounter = maxValue
        firstPass = True
    else:
        currentCounter += counterIncrement
        currentEndCounter += counterIncrement


print "Removing empty numbers," + "Empty Row Append Size:" + str(len(emptyRowAppend)) + ", Missing Value Size:" + str(len(missingValues)) + ", Set Item Size:" + str(len(setItem)) +  ", Empty Row:" + str(len(emptyRowAppend))
missingValuesCompared = []
for idx, val in enumerate(missingValues):
    found = False
    for items in emptyRowAppend:
        if(val == items):
            found = True
            break
    if(found == False):
        missingValuesCompared.append(val)

1 Answer 1

1

You can replace this code

emptyRow = []
for idx, val in enumerate(missingValues):
    found = False
    for items in sequential_missing.objects.all():
        if(items.row == val and items.database == databaseName):
            found = True
            #print "Database:" + str(items.row) + ", Same as Empty Row:" + str(val)
    if(found == True):
        emptyRow.append(val)

with

emptyRow = sequential_missing.objects.filter(database=databaseName,row__in = missingValues)

so that you issue a single query to the database. However this will concatenate all missingValues in a string which must be inserted in the query. You should try and see if it is viable.

Otherwise you should order both missingValues and sequential_missing.objects by val, so that you can find items in a linear time. Something like:

sort(missingValues)
val_index = 0
for item in sequential_missing.objects.all().order_by('row'):
  while (val_index < len(missingValues) and item.row>missingValues[val_index]):
    val_index += 1
  if (item.row == missingValues[val_index]): 
    emptyRow.append(item.row)
Sign up to request clarification or add additional context in comments.

3 Comments

The second solution works fine, however, the speedup isn't amazing, when you mean all the values will be concat, what do you mean by that?
Every time you issue a query to the database server a string is composed with the SQL code for the query. In the first solution I propose, such string contains all the values contained in missingValues. This might result in a huge string... The second solution should be fine if your database has an index on the 'row' column. Check the time it takes to execute the query with/without the order_by clause.
I think the first solution is much better, however, I had to limit the number of missingRows that goes into the filter query. With sqlite, the maximum was around 500. But in terms of speed, it was 2 hours vs 10 seconds, literally.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.