How to use Spark SQL as in-memory database?

Question

I am trying to understand Spark SQL concepts and am wondering if I could use Spark SQL as an in-memory database, similar to H2/SQLite?

Once I process all the records from 100 files, I could save the data in a tabular format and I could query the tables for result instead of searching the files. Does this make any sense?

Dataset<Row> results = spark.sql("SELECT distinct(name) FROM mylogs");

At runtime if a user opts to get distinct names from a table 'mylogs', it should fetch from tables (not from the underlying files that the tables are derived from).

What I noticed is that Spark SQL does scans over files to get data again and till it scans all 100 files and fetches the data, user has to wait for the response.

Is this a use case for Spark? Is there any better way to achieve this?

Use cache/persist temp table or view in memory if size is not too large and table is read-only for most of its lifetime. — Nachiket Kate
– Nachiket Kate, Commented Apr 24, 2018 at 11:21
You can, but it will be a very bad choice. It is not a database, and it is definitely not in memory database, even if shares some characteristics with these. — zero323
– zero323, Commented Apr 24, 2018 at 12:44

Jacek Laskowski · Accepted Answer · 2018-04-24 13:36:00Z

2

In theory it's doable and you could use Spark SQL as an in-memory database. I'd not be surprised if the data were gone at some point and you'd have to re-query the 100 files again.

You could have a configuration where you execute a query over the 100 files and then cache / persist the results to avoid scans.

That's how Spark Thrift Server works pretty much and so you should read the documentation at Running the Thrift JDBC/ODBC server.

answered Apr 24, 2018 at 13:36

Jacek Laskowski

75k28 gold badges253 silver badges440 bronze badges

Sign up to request clarification or add additional context in comments.

3 Comments

Molay Over a year ago

javaRDD.persist(StorageLevel.MEMORY_ONLY()); - after using this also, it's still scanning files at runtime. Thats when i posted this question.

Jacek Laskowski Over a year ago

Show the whole sequence of commands you are executing. How do you know that the scan is done at runtime? Edit your question and add necessary details. Thanks!

Molay Over a year ago

Sorry, i am unable to replicate it now, after posting query i did few changes to the code. Execution of order changed, removed lazy loading, i guess one of those changes did the trick. But javaRDD.persist(StorageLevel.MEMORY_ONLY()); is there since beginning.

Collectives™ on Stack Overflow

How to use Spark SQL as in-memory database?

1 Answer 1

3 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

3 Comments

Your Answer

Sign up or log in

Post as a guest

Related