Pandas + SQLite "cannot use index" error

Question

I am using pandas, sqlite, and sqlalchemy to search a bunch of strings for substrings. This project is inspired by this tutorial.

First, I create a sqlite database with one column of strings. Then I iterate through a separate file of strings and search for those strings in the database.

I have found the process to be slow, so I did some research and found that I needed to build an index on my column. When I followed the instructions provided here in the sqlite shell, everything seemed to work just fine.

However, when I try to make an index in my python script, I get the "cannot use index" error.

import pandas as pd
from sqlalchemy import create_engine # database connection
import datetime as dt



def load_kmer_db(disk_engine, chunk_size, encoding='utf-8'):
    start = dt.datetime.now()
    j = 0
    index_start = 1
    for df in pd.read_csv('fake.kmers.csv', chunksize=chunk_size, iterator=True, encoding=encoding):
        df.index += index_start
        j += 1
        df.to_sql('data', disk_engine.raw_connection(), if_exists='append', index=True, index_label='kmer_index')
        index_start = df.index[-1] + 1


def search_db_for_subsequence(disk_engine, sequence):
    """

    :param disk_engine: Disk engine for database containing query sequences
    :param sequence: Sequence for finding subsequences in the database
    :return: A data frame with the subsequences of sequence
    """
return pd.read_sql_query("SELECT kmer FROM data INDEXED BY kmer_index WHERE '" + sequence + "' LIKE '%' || kmer || '%'", disk_engine)

if __name__ == "__main__":
    import argparse

    parser = argparse.ArgumentParser()
    parser.add_argument('kmers', type=str, metavar='<kmer_file.txt>', help='text file with kmers')
    parser.add_argument('reads', type=str, metavar='<reads.fastq>', help='Reads to filter by input kmers')

    # Get the command line arguments.
    args = parser.parse_args()
    kmer_file = args.kmers
    reads_file = args.reads

    # Initialize database with filename 311_8M.db
    disk_engine = create_engine('sqlite:///311_8M.db') # This requires ipython to be installed

    load_kmer_db(disk_engine, 200)

    #****** Try explicitly calling the create index command
    #****** using the sqlite module.
    import sqlite3
    conn = sqlite3.connect('311_8M.db')
    c = conn.cursor()
    c.execute("CREATE INDEX kmer_index ON data(kmer);")

    reads = SeqReader(reads_file)
    for read in reads.parse_fastq():
        count += 1
        sequence = read[1]
        df = search_db_for_subsequence(
            disk_engine,
            sequence
        )

One can see that I first tried to create an index by passing the proper keyword arguments to the to_sql method. When I did that alone, I got an error stating that the index could not be found. Then I explicitly made the index through the sqlite3 module, which yielded the "cannot use index" error.

So now it appears that I have made my index, but for some reason, I am not able to use it. Why would that be? And how does one create an index using the pandas api instead of having to use the sqlite3 module?

That error message "cannot use index" seems to relate to the pd.read_sql_query() call and not the part where you create the index directly using the sqlite3 module. — mechanical_meat
– mechanical_meat, Commented Jul 26, 2016 at 1:28
Yes it appears that I am successfully creating the index, so why is it that I am unable to use it? — Malonge
– Malonge, Commented Jul 26, 2016 at 1:41
E.g. queries like this LIKE '[some term]%' can use an index but LIKE '%[some term]%' cannot. — mechanical_meat
– mechanical_meat, Commented Jul 26, 2016 at 1:51
Interesting, after testing int he sqlite shell it appears that you are correct. So I will just have to look into when one can use an index when using the LIKE syntax. Thank you — Malonge
– Malonge, Commented Jul 26, 2016 at 2:13

mechanical_meat · Accepted Answer · 2016-07-26 02:16:42Z

1

That error message "cannot use index" seems to relate to the pd.read_sql_query() call and not the part where you create the index directly using the sqlite3 module.

A query with some_col LIKE '%[some term]%' cannot use an index on some_col. Queries with some_col LIKE '[some_term]%' on the other hand can make use of an index on some_col.

answered Jul 26, 2016 at 2:16

mechanical_meat

170k25 gold badges237 silver badges231 bronze badges

Sign up to request clarification or add additional context in comments.

Collectives™ on Stack Overflow

Pandas + SQLite "cannot use index" error

1 Answer 1

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related