1

I am using pandas, sqlite, and sqlalchemy to search a bunch of strings for substrings. This project is inspired by this tutorial.

First, I create a sqlite database with one column of strings. Then I iterate through a separate file of strings and search for those strings in the database.

I have found the process to be slow, so I did some research and found that I needed to build an index on my column. When I followed the instructions provided here in the sqlite shell, everything seemed to work just fine.

However, when I try to make an index in my python script, I get the "cannot use index" error.

import pandas as pd
from sqlalchemy import create_engine # database connection
import datetime as dt



def load_kmer_db(disk_engine, chunk_size, encoding='utf-8'):
    start = dt.datetime.now()
    j = 0
    index_start = 1
    for df in pd.read_csv('fake.kmers.csv', chunksize=chunk_size, iterator=True, encoding=encoding):
        df.index += index_start
        j += 1
        df.to_sql('data', disk_engine.raw_connection(), if_exists='append', index=True, index_label='kmer_index')
        index_start = df.index[-1] + 1


def search_db_for_subsequence(disk_engine, sequence):
    """

    :param disk_engine: Disk engine for database containing query sequences
    :param sequence: Sequence for finding subsequences in the database
    :return: A data frame with the subsequences of sequence
    """
return pd.read_sql_query("SELECT kmer FROM data INDEXED BY kmer_index WHERE '" + sequence + "' LIKE '%' || kmer || '%'", disk_engine)

if __name__ == "__main__":
    import argparse

    parser = argparse.ArgumentParser()
    parser.add_argument('kmers', type=str, metavar='<kmer_file.txt>', help='text file with kmers')
    parser.add_argument('reads', type=str, metavar='<reads.fastq>', help='Reads to filter by input kmers')

    # Get the command line arguments.
    args = parser.parse_args()
    kmer_file = args.kmers
    reads_file = args.reads

    # Initialize database with filename 311_8M.db
    disk_engine = create_engine('sqlite:///311_8M.db') # This requires ipython to be installed

    load_kmer_db(disk_engine, 200)

    #****** Try explicitly calling the create index command
    #****** using the sqlite module.
    import sqlite3
    conn = sqlite3.connect('311_8M.db')
    c = conn.cursor()
    c.execute("CREATE INDEX kmer_index ON data(kmer);")

    reads = SeqReader(reads_file)
    for read in reads.parse_fastq():
        count += 1
        sequence = read[1]
        df = search_db_for_subsequence(
            disk_engine,
            sequence
        )

One can see that I first tried to create an index by passing the proper keyword arguments to the to_sql method. When I did that alone, I got an error stating that the index could not be found. Then I explicitly made the index through the sqlite3 module, which yielded the "cannot use index" error.

So now it appears that I have made my index, but for some reason, I am not able to use it. Why would that be? And how does one create an index using the pandas api instead of having to use the sqlite3 module?

6
  • That error message "cannot use index" seems to relate to the pd.read_sql_query() call and not the part where you create the index directly using the sqlite3 module. Commented Jul 26, 2016 at 1:28
  • Yes it appears that I am successfully creating the index, so why is it that I am unable to use it? Commented Jul 26, 2016 at 1:41
  • I think it has to do with your use of LIKE '%[some term]%' Commented Jul 26, 2016 at 1:50
  • E.g. queries like this LIKE '[some term]%' can use an index but LIKE '%[some term]%' cannot. Commented Jul 26, 2016 at 1:51
  • Interesting, after testing int he sqlite shell it appears that you are correct. So I will just have to look into when one can use an index when using the LIKE syntax. Thank you Commented Jul 26, 2016 at 2:13

1 Answer 1

1

That error message "cannot use index" seems to relate to the pd.read_sql_query() call and not the part where you create the index directly using the sqlite3 module.

A query with some_col LIKE '%[some term]%' cannot use an index on some_col. Queries with some_col LIKE '[some_term]%' on the other hand can make use of an index on some_col.

Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.