0

I have a business requirement to use Spark using Sagemaker Processing, I need to run a distributed code using pandas, numpy, gensim and sklearn. I generated a .zip file of all the packages installed and put it on s3 and then passed the file as an input parameter in my PySpark process run method. But I am facing an issue where spark doesn't acknowledge gensim and sklearn. I am able to load pandas and numpy without any issue. But for gensim and sklearn i get error:

No module name sklearn
No module name gensim

I tried to solve this problem in 4 steps:

  • Step1 (preparing our requirements.txt file):
numpy==1.23.5
pandas==1.5.3
scikit-learn==1.2.2
pyarrow==11.0.0
gensim==4.3.1
  • Step2 (installing it and packaging it together) in sagemaker terminal inside sagemaker notebook:
cd /home/ec2-user/SageMaker/python_dependencies

pip install -r requirements.txt -t ./packages"

zip -r dependencies.zip ./packages

aws s3 cp dependencies.zip s3://data-science/code/dependencies/
  • Step3 (Updating our processor code):
#Define the PySparkProcessor:
spark_processor = PySparkProcessor(
    base_job_name="sm-spark",
    framework_version="3.1",
    role=role,
    instance_count=default_instance_count,  # Adjust the instance count as needed
    instance_type=default_instance,  # Adjust the instance type as needed
    max_runtime_in_seconds=1200)


# Setting input bucket:
input_bucket = 'data-science”

# Define the number of records wanted:
number = "100"  # Change this as needed

# Run the Spark job:
spark_processor.run(
    submit_app="process.py",
    arguments=[input_bucket, number],
    submit_py_files=["s3://data-science/code/dependencies/dependencies.zip"],
    spark_event_logs_s3_uri="s3://data-science/spark_event_logs",
    logs=False,)
  • Step4(Our process.py file):
    # Initialize Spark session:
    spark = SparkSession.builder \
        .appName("Spark Processing Job") \
        .getOrCreate()
    
    print("Spark session initialized with optimized configuration.")
    
    # setting the spark context to pick the py file for dependencies
    sc = SparkContext.getOrCreate()
    sc.addPyFile(local_dependencies_path)
    print("Py file added")

    
    # Import all python dependencies:
    try:
        import pandas as pd
        print(f"Pandas version: {pd.__version__}")
        import numpy as np
        print(f"Numpy version: {np.__version__}")
        from sklearn.metrics.pairwise import cosine_similarity
        from sklearn.metrics.pairwise import euclidean_distances
        print("Sklearn metrics loaded")
        from gensim.models.doc2vec import Doc2Vec, TaggedDocument
        print("Gensim models loaded")       
    except ImportError as e:
        # Log the error and terminate the job
        print(f"Dependency loading error: {e}")
        raise SystemExit(f"Job terminated due to missing dependencies: {e}")

0

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.