I have a business requirement to use Spark using Sagemaker Processing, I need to run a distributed code using pandas, numpy, gensim and sklearn. I generated a .zip file of all the packages installed and put it on s3 and then passed the file as an input parameter in my PySpark process run method. But I am facing an issue where spark doesn't acknowledge gensim and sklearn. I am able to load pandas and numpy without any issue. But for gensim and sklearn i get error:
No module name sklearn
No module name gensim
I tried to solve this problem in 4 steps:
- Step1 (preparing our requirements.txt file):
numpy==1.23.5
pandas==1.5.3
scikit-learn==1.2.2
pyarrow==11.0.0
gensim==4.3.1
- Step2 (installing it and packaging it together) in sagemaker terminal inside sagemaker notebook:
cd /home/ec2-user/SageMaker/python_dependencies
pip install -r requirements.txt -t ./packages"
zip -r dependencies.zip ./packages
aws s3 cp dependencies.zip s3://data-science/code/dependencies/
- Step3 (Updating our processor code):
#Define the PySparkProcessor:
spark_processor = PySparkProcessor(
base_job_name="sm-spark",
framework_version="3.1",
role=role,
instance_count=default_instance_count, # Adjust the instance count as needed
instance_type=default_instance, # Adjust the instance type as needed
max_runtime_in_seconds=1200)
# Setting input bucket:
input_bucket = 'data-science”
# Define the number of records wanted:
number = "100" # Change this as needed
# Run the Spark job:
spark_processor.run(
submit_app="process.py",
arguments=[input_bucket, number],
submit_py_files=["s3://data-science/code/dependencies/dependencies.zip"],
spark_event_logs_s3_uri="s3://data-science/spark_event_logs",
logs=False,)
- Step4(Our process.py file):
# Initialize Spark session:
spark = SparkSession.builder \
.appName("Spark Processing Job") \
.getOrCreate()
print("Spark session initialized with optimized configuration.")
# setting the spark context to pick the py file for dependencies
sc = SparkContext.getOrCreate()
sc.addPyFile(local_dependencies_path)
print("Py file added")
# Import all python dependencies:
try:
import pandas as pd
print(f"Pandas version: {pd.__version__}")
import numpy as np
print(f"Numpy version: {np.__version__}")
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.metrics.pairwise import euclidean_distances
print("Sklearn metrics loaded")
from gensim.models.doc2vec import Doc2Vec, TaggedDocument
print("Gensim models loaded")
except ImportError as e:
# Log the error and terminate the job
print(f"Dependency loading error: {e}")
raise SystemExit(f"Job terminated due to missing dependencies: {e}")