2

I am trying to understand Spark SQL concepts and am wondering if I could use Spark SQL as an in-memory database, similar to H2/SQLite?

Once I process all the records from 100 files, I could save the data in a tabular format and I could query the tables for result instead of searching the files. Does this make any sense?

Dataset<Row> results = spark.sql("SELECT distinct(name) FROM mylogs");

At runtime if a user opts to get distinct names from a table 'mylogs', it should fetch from tables (not from the underlying files that the tables are derived from).

What I noticed is that Spark SQL does scans over files to get data again and till it scans all 100 files and fetches the data, user has to wait for the response.

Is this a use case for Spark? Is there any better way to achieve this?

2
  • Use cache/persist temp table or view in memory if size is not too large and table is read-only for most of its lifetime. Commented Apr 24, 2018 at 11:21
  • 1
    You can, but it will be a very bad choice. It is not a database, and it is definitely not in memory database, even if shares some characteristics with these. Commented Apr 24, 2018 at 12:44

1 Answer 1

2

In theory it's doable and you could use Spark SQL as an in-memory database. I'd not be surprised if the data were gone at some point and you'd have to re-query the 100 files again.

You could have a configuration where you execute a query over the 100 files and then cache / persist the results to avoid scans.

That's how Spark Thrift Server works pretty much and so you should read the documentation at Running the Thrift JDBC/ODBC server.

Sign up to request clarification or add additional context in comments.

3 Comments

javaRDD.persist(StorageLevel.MEMORY_ONLY()); - after using this also, it's still scanning files at runtime. Thats when i posted this question.
Show the whole sequence of commands you are executing. How do you know that the scan is done at runtime? Edit your question and add necessary details. Thanks!
Sorry, i am unable to replicate it now, after posting query i did few changes to the code. Execution of order changed, removed lazy loading, i guess one of those changes did the trick. But javaRDD.persist(StorageLevel.MEMORY_ONLY()); is there since beginning.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.