Unit test pyspark code using python

Question

I have script in pyspark like below. I want to unit test a function in this script.

def rename_chars(column_name):
    chars = ((' ', '_&'), ('.', '_$'))
    new_cols = reduce(lambda a, kv: a.replace(*kv), chars, column_name)
    return new_cols


def column_names(df):
    changed_col_names = df.schema.names
    for cols in changed_col_names:
        df = df.withColumnRenamed(cols, rename_chars(cols))
    return df

I wrote a unittest like below to test the function.

But I don't know how to submit the unittest. I have done spark-submit which doesn't do anything.

import unittest
from my_script import column_names

from pyspark import SparkContext, SparkConf
from pyspark.sql import HiveContext
conf = SparkConf()
sc = SparkContext(conf=conf)
sqlContext = HiveContext(sc)

cols = ['ID', 'NAME', 'last.name', 'abc test']
val = [(1, 'Sam', 'SMITH', 'eng'), (2, 'RAM', 'Reddy', 'turbine')]
df = sqlContext.createDataFrame(val, cols)


class RenameColumnNames(unittest.TestCase):
    def test_column_names(self):
        df1 = column_names(df)
        result = df1.schema.names
        expected = ['ID', 'NAME', 'last_$name', 'abc_&test']
        self.assertEqual(result, expected)

How can I integrate this script to work as a unittest

what can I run this on a node where I have pyspark installed?

The unittest issue seems to be resolved on local machine, how to use pip anaconda to create a virtualenv on the server is a different topic, you might create a different thread to make the installation, test and developing on the server, — Gang
– Gang, Commented Mar 23, 2018 at 0:41

Dave Voyles · Accepted Answer · 2020-04-14 17:23:49Z

8

Pyspark Unittests guide

1.You need to download Spark distribution from site and unpack it. Or if you already have a working distribution of Spark and Python just install pyspark: pip install pyspark

2.Set system variables like this if needed:

export SPARK_HOME="/home/eugene/spark-1.6.0-bin-hadoop2.6"
export PYTHONPATH="$SPARK_HOME/python/:$SPARK_HOME/python/lib/py4j-0.9-src.zip:$PYTHONPATH"
export PATH="SPARK_HOME/bin:$PATH"

I added this in .profile in my home directory. If you already have an working distribution of Spark this variables may be set.

3.Additionally you may need to setup:

PYSPARK_SUBMIT_ARGS="--jars path/to/hive/jars/jar.jar,path/to/other/jars/jar.jar --conf spark.driver.userClassPathFirst=true --master local[*] pyspark-shell"
PYSPARK_PYTHON="/home/eugene/anaconda3/envs/ste/bin/python3"

Python and jars? Yes. Pyspark uses py4j to communicate with java part of Spark. And if you want to solve more complicated situation like run Kafka server with tests in Python or use TestHiveContext from Scala like in the example you should specify jars. I did it through Idea run configuration environment variables.

4.And you could to use pyspark/tests.py, pyspark/streaming/tests.py, pyspark/sql/tests.py, pyspark/ml/tests.py, pyspark/mllib/tests.pyscripts wich contain various TestCase classes and examples for testing pyspark apps. In your case you could do (example from pyspark/sql/tests.py):

class HiveContextSQLTests(ReusedPySparkTestCase):

    @classmethod
    def setUpClass(cls):
        ReusedPySparkTestCase.setUpClass()
        cls.tempdir = tempfile.NamedTemporaryFile(delete=False)
        try:
            cls.sc._jvm.org.apache.hadoop.hive.conf.HiveConf()
        except py4j.protocol.Py4JError:
            cls.tearDownClass()
            raise unittest.SkipTest("Hive is not available")
        except TypeError:
            cls.tearDownClass()
            raise unittest.SkipTest("Hive is not available")
        os.unlink(cls.tempdir.name)
        _scala_HiveContext =\
            cls.sc._jvm.org.apache.spark.sql.hive.test.TestHiveContext(cls.sc._jsc.sc())
        cls.sqlCtx = HiveContext(cls.sc, _scala_HiveContext)
        cls.testData = [Row(key=i, value=str(i)) for i in range(100)]
        cls.df = cls.sc.parallelize(cls.testData).toDF()

    @classmethod
    def tearDownClass(cls):
        ReusedPySparkTestCase.tearDownClass()
        shutil.rmtree(cls.tempdir.name, ignore_errors=True)

but you need to specify --jars with Hive libs in PYSPARK_SUBMIT_ARGS as described earlier

or without Hive:

class SQLContextTests(ReusedPySparkTestCase):
    def test_get_or_create(self):
        sqlCtx = SQLContext.getOrCreate(self.sc)
        self.assertTrue(SQLContext.getOrCreate(self.sc) is sqlCtx)

As I know if pyspark have been installed through pip, you haven't tests.py described in example. In this case just download the distribution from Spark site and copy code examples.

Now you could run your TestCase as a normal: python -m unittest test.py

update: Since spark 2.3 using of HiveContext and SqlContext is deprecated. You could use SparkSession Hive API.

edited Apr 14, 2020 at 17:23

Dave Voyles

4,6158 gold badges35 silver badges45 bronze badges

answered Mar 22, 2018 at 6:47

Eugene Lopatkin

2,8471 gold badge27 silver badges36 bronze badges

Sign up to request clarification or add additional context in comments.

4 Comments

User12345 Over a year ago

For me the problem is I cannot install spark in my edgenode. I have native python installed in my edgenode and cloudera provided anaconda to use for pyspark I want to run the unittest on the edge node using spark provided by cloudera.

Gang Over a year ago

just curious, what is the jar file related here?

Eugene Lopatkin Over a year ago

@Gang I've described in answer.

Eugene Lopatkin Over a year ago

@user9367133 as ksindi said, just install pyspark in your anaconda distribution and run tests trough python -m unittest script.py

Powers · Accepted Answer · 2020-06-13 17:28:10Z

Here's a lightweight way to test your function. You don't need to download Spark to run PySpark tests like the accepted answer outlines. Downloading Spark is an option, but it's not necessary. Here's the test:

import pysparktestingexample.stackoverflow as SO
from chispa import assert_df_equality
import pyspark.sql.functions as F

def test_column_names(spark):
    source_data = [
        ("jose", "oak", "switch")
    ]
    source_df = spark.createDataFrame(source_data, ["some first name", "some.tree.type", "a gaming.system"])

    actual_df = SO.column_names(source_df)

    expected_data = [
        ("jose", "oak", "switch")
    ]
    expected_df = spark.createDataFrame(expected_data, ["some_&first_&name", "some_$tree_$type", "a_&gaming_$system"])

    assert_df_equality(actual_df, expected_df)

The SparkSession used by the test is defined in the tests/conftest.py file:

import pytest
from pyspark.sql import SparkSession

@pytest.fixture(scope='session')
def spark():
    return SparkSession.builder \
      .master("local") \
      .appName("chispa") \
      .getOrCreate()

The test uses the assert_df_equality function defined in the chispa library.

Here's your code and the test in a GitHub repo.

pytest is generally preferred in the Python community over unittest. This blog post explains how to test PySpark programs and ironically has a modify_column_names function that'd let you rename these columns more elegantly.

def modify_column_names(df, fun):
    for col_name in df.columns:
        df = df.withColumnRenamed(col_name, fun(col_name))
    return df

Kamil Sindi · Accepted Answer · 2018-03-22 13:45:44Z

3

Here's one way to do it. In the CLI call:

python -m unittest my_unit_test_script.py

Code

import functools
import unittest

from pyspark import SparkContext, SparkConf
from pyspark.sql import HiveContext


def rename_chars(column_name):
    chars = ((' ', '_&'), ('.', '_$'))
    new_cols = functools.reduce(lambda a, kv: a.replace(*kv), chars, column_name)
    return new_cols


def column_names(df):
    changed_col_names = df.schema.names
    for cols in changed_col_names:
        df = df.withColumnRenamed(cols, rename_chars(cols))
    return df


class RenameColumnNames(unittest.TestCase):
    def setUp(self):
        conf = SparkConf()
        sc = SparkContext(conf=conf)
        self.sqlContext = HiveContext(sc)

    def test_column_names(self):
        cols = ['ID', 'NAME', 'last.name', 'abc test']
        val = [(1, 'Sam', 'SMITH', 'eng'), (2, 'RAM', 'Reddy', 'turbine')]
        df = self.sqlContext.createDataFrame(val, cols)
        result = df.schema.names
        expected = ['ID', 'NAME', 'last_$name', 'abc_&test']
        self.assertEqual(result, expected)

answered Mar 22, 2018 at 13:45

Kamil Sindi

23k19 gold badges101 silver badges122 bronze badges

6 Comments

Eugene Lopatkin Over a year ago

It doesn't work: Expected :['ID', 'NAME', 'last.name', 'abc test'] Actual :['ID', 'NAME', 'last_$name', 'abc_&test']

Kamil Sindi Over a year ago

@EugeneLopatkin the question is not about fixing the unittests correctness :-)

User12345 Over a year ago

@ksindi Your solution work if I have pyspark and hadoop installed in my local machine. But for me the problem is if I run this script on edge node then the job fails with No module error pyspark as in my python I don't have pyspark installed. How can I use the existing Hadoop environment and cloudera given anaconda to run the unittest

Kamil Sindi Over a year ago

@user9367133 you can pip install pyspark as of spark 2.3. maybe add it as a dependency?

User12345 Over a year ago

@ksindi if I install using pip will it get necessary binaries

|

Jorge Leitao · Accepted Answer · 2019-02-22 16:49:03Z

Assuming you have pyspark installed (e.g. pip install pyspark on a venv), you can use the class below for unit testing it in unittest:

import unittest
import pyspark


class PySparkTestCase(unittest.TestCase):

    @classmethod
    def setUpClass(cls):
        conf = pyspark.SparkConf().setMaster("local[*]").setAppName("testing")
        cls.sc = pyspark.SparkContext(conf=conf)
        cls.spark = pyspark.SQLContext(cls.sc)

    @classmethod
    def tearDownClass(cls):
        cls.sc.stop()

Example:

class SimpleTestCase(PySparkTestCase):

    def test_with_rdd(self):
        test_input = [
            ' hello spark ',
            ' hello again spark spark'
        ]

        input_rdd = self.sc.parallelize(test_input, 1)

        from operator import add

        results = input_rdd.flatMap(lambda x: x.split()).map(lambda x: (x, 1)).reduceByKey(add).collect()
        self.assertEqual(results, [('hello', 2), ('spark', 3), ('again', 1)])

    def test_with_df(self):
        df = self.spark.createDataFrame(data=[[1, 'a'], [2, 'b']], 
                                        schema=['c1', 'c2'])
        self.assertEqual(df.count(), 2)

Note that this creates a context per class. Use setUp instead of setUpClass to get a context per test. This typically adds a lot of overhead time on the execution of the tests, as creating a new spark context is currently expensive.

Collectives™ on Stack Overflow

Unit test pyspark code using python

4 Answers 4

4 Comments

Comments

6 Comments

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

4 Answers 4

4 Comments

Comments

6 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related