Skip to main content
We’ve updated our Terms of Service. A new AI Addendum clarifies how Stack Overflow utilizes AI interactions.
Filter by
Sorted by
Tagged with
1 vote
1 answer
35 views

DataprocSparkSession package in python error - "RuntimeError: Error while creating Dataproc Session"

I am using below code to create Dataproc Spark Session to run a job from google.cloud.dataproc_spark_connect import DataprocSparkSession from google.cloud.dataproc_v1 import Session session = Session(...
Siddiq Syed's user avatar
0 votes
0 answers
50 views

What connector can be used for Google Cloud pubsub along with Cloud dataproc ( Spark 3.5.x )

I am using Spark 3.5.x and would like to use readStream() API to read structured streaming using Java . I don't see any pubsub connector available. Couldn't try pubsub lite because it is deprecated ...
Sunil's user avatar
  • 441
0 votes
1 answer
63 views

ModuleNotFoundError in GCP after trying to sumbit a job

new to GCP, I am trying to submit a job inside Dataproc with a .py file & attached also pythonproject.zip file (it is a project) but I am getting the below error ModuleNotFoundError: No module ...
SofiaNiki's user avatar
3 votes
1 answer
112 views

Facing bigquery write failure after upgrading spark and dataproc. "Schema mismatch : corresponding field path to Parquet column has 0 repeated field"s

We are currently undergoing migration from spark 2.4 to spark 3.5 (and dataproc 1 to 2), and our workflows are failing with the following error Caused by: com.google.cloud.spark.bigquery.repackaged....
Anshul Dubey's user avatar
1 vote
0 answers
45 views

Google Cloud Dataproc Cluster Creation Fails: "Failed to validate permissions for default service account"

Despite the Default Compute Engine Service Account having the necessary roles and being explicitly specified in my cluster creation command, I am still encountering the "Failed to validate ...
Lê Văn Đức's user avatar
2 votes
1 answer
135 views

Spark memory error in thread spark-listener-group-eventLog

I have a pyspark application which is using Graphframes to compute connected components on a DataFrame. The edges DataFrame I generate has 2.7M records. When I run the code it is slow, but slowly ...
Jesus Diaz Rivero's user avatar
1 vote
0 answers
74 views

Out of memory for a smaller dataset

I have a pyspark job reading the input data volume of just ~50-55GB Parquet data from a delta table. Job is using n2-highmem-4 GCP VM and 1-15 worker with autoscaling. Each workerVM of type n2-highmem-...
user16798185's user avatar
1 vote
2 answers
83 views

How do you run Python Hadoop Jobs on Dataproc?

I am trying to run my Python code for Hadoop job on Dataproc. I have a mapper.py and a reducer.py file. I am running this command on the terminal - gcloud dataproc jobs submit hadoop \ --cluster=my-...
The Beast's user avatar
  • 163
2 votes
0 answers
54 views

Why does Spark raises an IOException while running a aggregation on a streaming dataframe in Dataproc 2.2

I try to migrate a job that is running on Dataproc 2.1 images (Spark 3.3, Python 3.10) to Dataproc 2.2 images (Spark 3.5, Python 3.11). However I encounter an error on one of my queries. After further ...
AlexisBRENON's user avatar
  • 3,169
1 vote
0 answers
34 views

Paritial records being read in Pyspark through Dataproc

I have a Google Dataproc job that reads a CSV file from Google Cloud Storage, containing the following headers Content-type : application/octet-stream Content-encoding : gzip FileName: gs://...
Bob's user avatar
  • 383
2 votes
1 answer
112 views

Default behavior of spark3.5.1 when writing Numeric/Bignumeric to BigQuery

As per documentation of spark 3.5.1 using latest spark-bigquery-connector, Spark Decimal(38,0) should be written as Numeric in BigQuery. https://github.com/GoogleCloudDataproc/spark-bigquery-connector?...
Abhilash's user avatar
1 vote
2 answers
134 views

How to pass arguments from GCP Workflows into Dataproc

I'm using GCP Workflows to define steps for a data engineering project. The input of the workflow consists out of multiple parameters which are provided from through the workflow API. I defined a GCP ...
54m's user avatar
  • 777
1 vote
0 answers
104 views

error "google.api_core.exceptions.InvalidArgument: 400 Cluster name is required" while trying to use airflow DataprocSubmitJobOperator

I'm trying to use the Dataproc submit job operator from Airflow (https://airflow.apache.org/docs/apache-airflow-providers-google/stable/_api/airflow/providers/google/cloud/operators/dataproc/index....
Abhijit Aravind's user avatar
3 votes
2 answers
391 views

GoogleHadoopOutputStream: hflush(): No-op due to rate limit: Increase in class A operation for gcs bucket

We are running our spark ingestion jobs which process multiple files in batches. We read csv or tsv files in batches and create a dataframe and do some transformations before loading it into big query ...
Vikrant Singh Rana's user avatar
1 vote
1 answer
236 views

Dataproc PySpark Job Fails with BigQuery Connector Issue - java.util.ServiceConfigurationError

I'm trying to run a PySpark job on Google Cloud Dataproc that reads data from BigQuery, processes it, and writes it back. However, the job keeps failing with the following error: java.util....
Shima K's user avatar
  • 156
1 vote
0 answers
41 views

Dataproc Serverless Failure on Rerun

I have written a spark job to read from kafka topic, do some processing and dump the data in avro format to GCS. I am deploying this JAVA application dataproc serverless using the TriggerOnce mode so ...
Ravi Jain's user avatar
  • 138
1 vote
1 answer
84 views

java.lang.ClassNotFoundException: com.microsoft.sqlserver.jdbc.SQLServerDriver while running JAR on google Dataproc cluster

i have the driver dependency in the POM.xml and i am using maven shade plugin to create an Uber Jar. i do see the driver dependency correctly listed in the JAR file. Jar runs fine in intellij but on ...
xOneOne's user avatar
  • 81
1 vote
0 answers
41 views

Dataproc Hive Job - OutOfMemoryError: Java heap space

I have a dataproc cluster, we are running INSERT OVERWRITE QUERY through HIVE CLI which fails with OutOfMemoryError: Java heap space. We adjusted memory configurations for reducers and Tez tasks, ...
Parmeet Singh's user avatar
1 vote
0 answers
80 views

Google Cloud Data Fusion streaming pipeline and Spark jobs with empty rows

I have a Google Cloud Data Fusion streaming pipeline that receives data from Google Pub/Sub. Micro-batching is performed every 5 seconds. Since data doesn’t always arrive consistently, I see many ...
alexanoid's user avatar
  • 26.1k
1 vote
3 answers
633 views

Caused by: java.lang.IllegalStateException: This connector was made for Scala null, it was not meant to run on Scala 2.12

The pointed error code as follow String strJsonContent = SessionContext .getSparkSession() .read() .json(filePath) .toJSON() .first(); And I'm using Maven to build the package without ...
Delevin Zhong's user avatar
1 vote
1 answer
227 views

Creating Dataproc Cluster with public-ip-address using DataprocCreateClusterOperator in Airflow

I am trying to create a Dataproc Cluster in my GCP project within an Airflow DAG using the DataprocCreateClusterOperator. I am using the ClusterGenerator to generate the config for the cluster. ...
Mads's user avatar
  • 105
1 vote
1 answer
316 views

How can I debug an InactiveRpcError with status INVALID_ARGUMENT when submitting a serverless batch job?

When submitting a dataproc serverless batch request, we have been getting errors like: grpc._channel._InactiveRpcError: <_InactiveRpcError of RPC that terminated with: status = StatusCode....
Jacob Promisel's user avatar
2 votes
1 answer
294 views

Spark The deserializer is not supported: need a(n) "ARRAY" field but got "MAP<STRING, STRING>"

Recently we have been migrated to dataproc image 2.2 version along with supporting scala 2.12.18 and spark 3.5 version. package test import org.apache.spark.sql.SparkSession import test.Model._ ...
Vikrant Singh Rana's user avatar
2 votes
0 answers
276 views

UDF method with return type parameter is deprecated since Spark 3.0.0.how to return struct type from UDF function

We are upgrading the gcp dataproc cluster to 2.2debian12 image with spark version is 3.5.0 scala version is 2.12.18 but with these version 1 major change is udf method with return type parameter is ...
chanchal ahuja's user avatar
2 votes
2 answers
488 views

Error upgrading Dataproc Serverless version from 2.1 to 2.2

I have changed the version of a Dataproc Serverless from 2.1 to 2.2 and now when I run it I get the following error: Exception in thread "main" java.util.ServiceConfigurationError: org....
Chaos's user avatar
  • 21
2 votes
0 answers
141 views

How to reduce GCS A and B operations in a Spark Structured Streaming pipeline in Dataproc?

I'm running a data pipeline where a NiFi on-premise flow writes JSON files in streaming to a GCS bucket. I have 5 tables, each with their own path, generating around 140k objects per day. The bucket ...
Puredepatata's user avatar
1 vote
1 answer
316 views

Accessing BigQuery Dataset from a Different GCP Project Using PySpark on Dataproc

I am working with BigQuery, Dataproc, Workflows, and Cloud Storage in Google Cloud using Python. I have two GCP projects: gcp-project1: contains the BigQuery dataset gcp-project1.my_dataset.my_table ...
Henry Xiloj Herrera's user avatar
1 vote
0 answers
48 views

How do i configure right the staging and temp buckets at DataprocCreateClusterOperator?

I'm trying to find how to set the temp and staging buckets at the DataprocOperator. I've searched for all the internet and didnt find a good awnser. import pendulum from datetime import timedelta ...
GuilhermeMP's user avatar
2 votes
1 answer
73 views

Spark on Dataproc: Slow Data Insertion into BigQuery for Large Datasets (~30M Records)

I have a Scala Spark job running on Google Cloud Dataproc that sources and writes data to Google BigQuery (BQ) tables. The code works fine for smaller datasets, but when processing larger volumes (e.g....
Sekar Ramu's user avatar
1 vote
1 answer
395 views

Pyspark performance problems when writing to table in Bigquery

I am new to the world of PySpark, I am experiencing serious performance problems when writing data from a dataframe to a table in Bigquery. I have tried everything I have read, recommendations, using ...
aleretgub's user avatar
2 votes
0 answers
105 views

Delta compatibility problem with PySpark in Dataproc GKE env

I've created a dataproc cluster using GKE and a custom image with pyspark 3.5.0. but can't get it to work with delta The custom image docker file is this: FROM us-central1-docker.pkg.dev/cloud-...
Pedro's user avatar
  • 21
2 votes
0 answers
28 views

How can I install html5lib on a dataproc cluster

I have a dataproc pipeline with which I do webscraping and store data in gcp. Task setting is something like this: create_dataproc_cluster = DataprocCreateClusterOperator( task_id='...
Sara 's user avatar
  • 75
6 votes
0 answers
3k views

spark_catalog requires a single-part namespace in dbt python incremental model

Description: Using the dbt functionality that allows one to create a python model, I created a model that reads from some BigQuery table, performs some calculations and writes back to BigQuery. It ...
Carlos Veríssimo's user avatar
2 votes
1 answer
548 views

How to resolve: java.lang.NoSuchMethodError: 'scala.collection.Seq org.apache.spark.sql.types.StructType.toAttributes()'

Running a simple ETL PySpark job on Dataproc 2.2 with job property spark.jars.packages set to io.delta:delta-core_2.12:2.4.0 . Other settings are set to default. I have the following config: conf = ( ...
dbkoop's user avatar
  • 101
1 vote
0 answers
31 views

Create external Hive table for multiline Json

I am trying to create Hive Table for for given multiline JSON. But actual result is not similar to expected result. Sample JSON file: { "name": "Adil Abro", "...
yac's user avatar
  • 11
1 vote
1 answer
77 views

not able to set the spark log properties programmatically via Python

I was trying to suppress the spark logging and specifying my own log4j.properties file. gcloud dataproc jobs submit spark \ --cluster test-dataproc-cluster \ --region europe-north1 \ --files gs://...
Vikrant Singh Rana's user avatar
3 votes
0 answers
146 views

Bigtable Read and Write using DataProc with compute engine results in Key not found

I am experimenting with reading and writing data in cloud BigTable using the DataProc compute engine and PySpark Job using spark-bigtable-connector. I got an example from spark-bigtable repo and ...
Suga Raj's user avatar
  • 591
1 vote
1 answer
108 views

Not able to connect flink to kafka

Getting below error Caused by: java.lang.NoClassDefFoundError: org/apache/kafka/clients/admin/AdminClient while connecting flink to kafka I am using flink 1.17 and using flink-sql-connector-kafka-1....
Om Prakash's user avatar
2 votes
0 answers
42 views

How to evaluate total execution connection time in BigTable from Spark by scala

I am attempting to measure the total execution time from Spark to BigTable. However, when I wrap the following code around the BigTable related function, it consistently shows only 0.5 seconds, ...
Kuengaer's user avatar
1 vote
1 answer
100 views

Composer v2.6.6 - job completing successfully, Task exited with Negsignal.SIGKIL

we have Composer 2.6.6(Airflow 2.5.3), and a job VANI-UEBA3 which is running on Dataproc Serverless Batches ... the job runs through fine (as shown on the Dataproc Serverless UI), but the composer UI ...
Karan Alang's user avatar
  • 1,101
1 vote
1 answer
281 views

Trigger rule "one_success" doesn't work for "DataprocCreateClusterOperator"

I have a case that one of my operators which is DataprocCreateClusterOperator never triggers as if "all_success" was still set for it. It runs fine if it's the very first task but I don't ...
Aleksander Lipka's user avatar
2 votes
0 answers
121 views

disabling yarn logs - dataproc cluster

I am trying to reduce the Class A operations on a gcs bucket which is being configured to store yarn and spark history logs. This is costing us a lot. I disabled spark logs editing the spark-defaults....
Vikrant Singh Rana's user avatar
1 vote
0 answers
90 views

Selecting a Dataproc Cluster Size with autoscaling ON

i am new to the GCP cloud and has probably a very basic question. We are running our PySpark jobs in Dataproc ephemeral cluster with autoscaling property on for the cluster. In our code we have used ...
Kaushik Ghosh's user avatar
1 vote
0 answers
154 views

life cycle policies on dataproc staging bucket sub folders

We have a dataproc cluster staging bucket wherein all the spark job logs are getting stored. eu-digi-pipe-dataproc-stage/google-cloud-dataproc-metainfo/d0decf20-21fd-4536-bbc4-5a4f829e49bf/jobs/google-...
Vikrant Singh Rana's user avatar
1 vote
0 answers
39 views

How to parse a huge amount of json files effectively in pyspark

I'm trying to parse around 100GB of small json files using pySpark. The files are stored in a google cloud bucket and they come zipped: *.jsonl.gz How can I do it effectively?
Aleksander Lipka's user avatar
1 vote
0 answers
164 views

bigquery write_truncate option doesn't work in spark bigquery connector

I'm trying to overwrite a BigQuery table using the WRITE_TRUNCATE option with the Spark BigQuery connector. I have verified that the target table is updated as the Last modified timestamp changes ...
lang's user avatar
  • 51
2 votes
0 answers
197 views

Dataproc Serverless Batch Spark Web UI Not Displaying Executors Logs

I want to run a Spark job using the JDBCToBigQuery template in Dataproc Batch on GCP. The job runs successfully, but there are no executor logs (stdout, stderr) on the Executors tab in the Web UI like ...
lang's user avatar
  • 51
1 vote
0 answers
79 views

java.lang.UnsupportedOperationException: org.apache.parquet.column.values.dictionary.PlainValuesDictionary$PlainDoubleDictionary

I am trying to read multiple parquet files form gcs using a dataproc spark job. df = sc.read.option("mergeSchema", "true").parquet(remote_path) The above code throws error saying:-...
ak1234's user avatar
  • 221
1 vote
1 answer
142 views

Build CI CD pipeline Airflow spark

Using GCP Dataproc and Airflow composer, able to deploy the pyspark script and success the result. After this, I'm not aware of artifact preparation and Building CI CD pipeline, Which tool should i ...
Ajith Lakshmanan's user avatar
1 vote
0 answers
194 views

Why does my Spark Structured Streaming query writing into delta table fail with java.io.IOException: Error listing gs://bucket/_delta_log/

I have a Spark Structured streaming query which reads from a Kafka topic and writes the data into a delta table. This query was running fine for few months and there were no issues. But from last week ...
Isira Wijayarathne's user avatar

1
2 3 4 5
34