-1

I have a large Data set which i want to import into databricks to do some analytics using scala. The data set is available in this link : https://drive.google.com/open?id=1g4YYALk3nArN8bX2uFS70IpbdSf_Efqj

I want to import this data set such that , the document ID is in the first column and the other test data in the other column.

But when i import the data using following code , it looks like this

val df = spark.read.text("FileStore/tables/plot_summaries.txt")

df.select("value").show()

enter image description here

Can anyone help me to import this in the proper way ? Any help would be highly appreciated. Thank you

2

2 Answers 2

4

This will solve your issue.

spark.read.option("sep", "\t").text("FileStore/tables/plot_summaries.txt")
Sign up to request clarification or add additional context in comments.

Comments

3

You have data with tab, so you need to provide a delimiter externally.

scala> import org.apache.spark.sql.types._
scala> val schema = new StructType().add("DocumentID", LongType, true).add("Description", StringType, true)

scala> val df = spark.read.format("csv").option("delimiter", "\t").schema(schema).load("/plot_summaries.txt")

scala> df.show(10)
+----------+--------------------+
|DocumentID|         Description|
+----------+--------------------+
|  23890098|Shlykov, a hard-w...|
|  31186339|The nation of Pan...|
|  20663735|Poovalli Induchoo...|
|   2231378|The Lemon Drop Ki...|
|    595909|Seventh-day Adven...|
|   5272176|The president is ...|
|   1952976|{{plot}} The film...|
|  24225279|The story begins ...|
|   2462689|Infuriated at bei...|
|  20532852|A line of people ...|
+----------+--------------------+

1 Comment

can you help and suggest how to handle this stackoverflow.com/questions/62036791/…

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.