I am having a .csv with few columns, and I wish to skip 4 (or 'n' in general) lines when importing this file into a dataframe using spark.read.csv() function. I have a .csv file like this -
ID;Name;Revenue
Identifier;Customer Name;Euros
cust_ID;cust_name;€
ID132;XYZ Ltd;2825
ID150;ABC Ltd;1849
In normal Python, when using read_csv() function, it's simple and can be done using skiprow=n option like -
import pandas as pd
df=pd.read_csv('filename.csv',sep=';',skiprows=3) # Since we wish to skip top 3 lines
With PySpark, I am importing this .csv file as follows -
df=spark.read.csv("filename.csv",sep=';')
This imports the file as -
ID |Name |Revenue
Identifier |Customer Name|Euros
cust_ID |cust_name |€
ID132 |XYZ Ltd |2825
ID150 |ABC Ltd 1849
This is not correct, because I wish to ignore first three lines. I can't use option 'header=True' because it will only exclude the first line. One can use 'comment=' option, but for that one needs the lines to start with a particular character and that is not the case with my file. I could not find anything in the documentation. Is there any way this can be accomplished?
RDDand then convertRDDintoDF. Still, with his answer, I need to investigate if the order of rows read from.csvfile is respected or not. 2. Apropos to your comment that SPARK has along way to go, well, may be we have such an arrangement by design. InDFwe can't really be sure as to which partition a row lands up. That's why we do not haveix/.locin SPARK, as opposed to Pandas. I am no expert