How to skip multiple lines using read.csv in PySpark

Question

I am having a .csv with few columns, and I wish to skip 4 (or 'n' in general) lines when importing this file into a dataframe using spark.read.csv() function. I have a .csv file like this -

ID;Name;Revenue
Identifier;Customer Name;Euros
cust_ID;cust_name;€
ID132;XYZ Ltd;2825
ID150;ABC Ltd;1849

In normal Python, when using read_csv() function, it's simple and can be done using skiprow=n option like -

import pandas as pd
df=pd.read_csv('filename.csv',sep=';',skiprows=3) # Since we wish to skip top 3 lines

With PySpark, I am importing this .csv file as follows -

df=spark.read.csv("filename.csv",sep=';') 
This imports the file as -
ID          |Name         |Revenue
Identifier  |Customer Name|Euros
cust_ID     |cust_name    |€
ID132       |XYZ Ltd      |2825
ID150       |ABC Ltd      1849

This is not correct, because I wish to ignore first three lines. I can't use option 'header=True' because it will only exclude the first line. One can use 'comment=' option, but for that one needs the lines to start with a particular character and that is not the case with my file. I could not find anything in the documentation. Is there any way this can be accomplished?

one option is to filter the rows which you don't want to hold, check here — Prasad Khode
– Prasad Khode, Commented Oct 4, 2018 at 10:49
Thanks Prasad for the answer! Well, that way I have to manually code the filter statement and if the header is written differently the next time, then the code will fail because we are basically checking strings. The code written this way will not be robust and must be avoided. — cph_sto
– cph_sto, Commented Oct 4, 2018 at 10:59
That there is no simple answer on how do to this really shows that Spark has a long way to go... — Thomas
– Thomas, Commented Dec 10, 2018 at 12:28
Let me make 2 points - 1. After spending a lot of time investigating this issue, I think the answer provided by Mayank Agarwak below is the best option. Import the file via schema using RDD and then convert RDD into DF. Still, with his answer, I need to investigate if the order of rows read from .csv file is respected or not. 2. Apropos to your comment that SPARK has along way to go, well, may be we have such an arrangement by design. In DF we can't really be sure as to which partition a row lands up. That's why we do not have ix/.loc in SPARK, as opposed to Pandas. I am no expert — cph_sto
– cph_sto, Commented Dec 10, 2018 at 12:42

mayank agrawal · Accepted Answer · 2018-10-04 13:50:32Z

6

I couldnt find a simple solution for your problem. Although this will work no matter how the header is written,

df = spark.read.csv("filename.csv",sep=';')\
          .rdd.zipWithIndex()\
          .filter(lambda x: x[1] > n)\
          .map(lambda x: x[0]).toDF()

answered Oct 4, 2018 at 13:50

mayank agrawal

2,5552 gold badges16 silver badges33 bronze badges

Sign up to request clarification or add additional context in comments.

3 Comments

cph_sto Over a year ago

After investigating this issue for quite some time, I think that your procedure is the best option. But, the question remains - will .zipWithIndex() respect the order of rows? If the order is not respected, then this procedure comes to naught - i.e; header line 3, which being read ends up as some other line. That's my only concern.

Manish Mehra Over a year ago

did you get a better answer for this which ensures that order of the rows is also respected? @cph_sto?

cph_sto Over a year ago

You can't maintain the order in spark, as the data is distributed. That was my learning atleast.

Alexander Stepanenko · Accepted Answer · 2023-01-16 17:34:35Z

2

Filter first 2 lines from csv file:

_rdd = (
    spark.read
        .text(csv_file_path_with_bad_header)
        .rdd
        .zipWithIndex()
        .filter(lambda x: x[1] > 2)
        .map(lambda x: x[0][0])
     )

df = (
  spark.read.csv(
    _rdd, 
    header=True, 
    sep="\t", 
    inferSchema = False, 
    quote="\\", 
    ignoreLeadingWhiteSpace=True, 
    ignoreTrailingWhiteSpace=True
  )
)

edited Jan 16, 2023 at 17:34

answered Jan 16, 2023 at 17:30

Alexander Stepanenko

212 bronze badges

Comments

Max · Accepted Answer · 2022-06-10 09:29:19Z

1

I've been trying to find a solution to this problem for the past couple of days as well. The solution I implemented is probably not the fastest, but it works:

with open("filename.csv", 'r') as fin:
    data = fin.read().splitlines(True)
with open("filename.csv", 'w') as fout:
    fout.writelines(data[3:]) # Select the number of rows you want to skip, in this case we skip the first 3

df = spark.read.format("csv") \
          .option("header", "true") \
          .option("sep", ";") \
          .load("filename.csv")

answered Jun 10, 2022 at 9:29

Max

217 bronze badges

Collectives™ on Stack Overflow

How to skip multiple lines using read.csv in PySpark

3 Answers 3

3 Comments

Comments

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

3 Comments

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related