1,673 questions
1
vote
1
answer
35
views
DataprocSparkSession package in python error - "RuntimeError: Error while creating Dataproc Session"
I am using below code to create Dataproc Spark Session to run a job
from google.cloud.dataproc_spark_connect import DataprocSparkSession
from google.cloud.dataproc_v1 import Session
session = Session(...
0
votes
0
answers
50
views
What connector can be used for Google Cloud pubsub along with Cloud dataproc ( Spark 3.5.x )
I am using Spark 3.5.x and would like to use readStream() API to read structured streaming using Java .
I don't see any pubsub connector available. Couldn't try pubsub lite because it is deprecated ...
0
votes
1
answer
63
views
ModuleNotFoundError in GCP after trying to sumbit a job
new to GCP, I am trying to submit a job inside Dataproc with a .py file & attached also pythonproject.zip file (it is a project) but I am getting the below error ModuleNotFoundError: No module ...
3
votes
1
answer
112
views
Facing bigquery write failure after upgrading spark and dataproc. "Schema mismatch : corresponding field path to Parquet column has 0 repeated field"s
We are currently undergoing migration from spark 2.4 to spark 3.5 (and dataproc 1 to 2), and our workflows are failing with the following error
Caused by: com.google.cloud.spark.bigquery.repackaged....
1
vote
0
answers
45
views
Google Cloud Dataproc Cluster Creation Fails: "Failed to validate permissions for default service account"
Despite the Default Compute Engine Service Account having the necessary roles and being explicitly specified in my cluster creation command, I am still encountering the "Failed to validate ...
2
votes
1
answer
135
views
Spark memory error in thread spark-listener-group-eventLog
I have a pyspark application which is using Graphframes to compute connected components on a DataFrame.
The edges DataFrame I generate has 2.7M records.
When I run the code it is slow, but slowly ...
1
vote
0
answers
74
views
Out of memory for a smaller dataset
I have a pyspark job reading the input data volume of just ~50-55GB Parquet data from a delta table. Job is using n2-highmem-4 GCP VM and 1-15 worker with autoscaling. Each workerVM of type n2-highmem-...
1
vote
2
answers
83
views
How do you run Python Hadoop Jobs on Dataproc?
I am trying to run my Python code for Hadoop job on Dataproc. I have a mapper.py and a reducer.py file. I am running this command on the terminal -
gcloud dataproc jobs submit hadoop \
--cluster=my-...
2
votes
0
answers
54
views
Why does Spark raises an IOException while running a aggregation on a streaming dataframe in Dataproc 2.2
I try to migrate a job that is running on Dataproc 2.1 images (Spark 3.3, Python 3.10) to Dataproc 2.2 images (Spark 3.5, Python 3.11).
However I encounter an error on one of my queries. After further ...
1
vote
0
answers
34
views
Paritial records being read in Pyspark through Dataproc
I have a Google Dataproc job that reads a CSV file from Google Cloud Storage, containing the following headers
Content-type : application/octet-stream
Content-encoding : gzip
FileName: gs://...
2
votes
1
answer
112
views
Default behavior of spark3.5.1 when writing Numeric/Bignumeric to BigQuery
As per documentation of spark 3.5.1 using latest spark-bigquery-connector, Spark Decimal(38,0) should be written as Numeric in BigQuery.
https://github.com/GoogleCloudDataproc/spark-bigquery-connector?...
1
vote
2
answers
134
views
How to pass arguments from GCP Workflows into Dataproc
I'm using GCP Workflows to define steps for a data engineering project. The input of the workflow consists out of multiple parameters which are provided from through the workflow API.
I defined a GCP ...
1
vote
0
answers
104
views
error "google.api_core.exceptions.InvalidArgument: 400 Cluster name is required" while trying to use airflow DataprocSubmitJobOperator
I'm trying to use the Dataproc submit job operator from Airflow (https://airflow.apache.org/docs/apache-airflow-providers-google/stable/_api/airflow/providers/google/cloud/operators/dataproc/index....
3
votes
2
answers
391
views
GoogleHadoopOutputStream: hflush(): No-op due to rate limit: Increase in class A operation for gcs bucket
We are running our spark ingestion jobs which process multiple files in batches.
We read csv or tsv files in batches and create a dataframe and do some transformations before loading it into big query ...
1
vote
1
answer
236
views
Dataproc PySpark Job Fails with BigQuery Connector Issue - java.util.ServiceConfigurationError
I'm trying to run a PySpark job on Google Cloud Dataproc that reads data from BigQuery, processes it, and writes it back. However, the job keeps failing with the following error:
java.util....
1
vote
0
answers
41
views
Dataproc Serverless Failure on Rerun
I have written a spark job to read from kafka topic, do some processing and dump the data in avro format to GCS.
I am deploying this JAVA application dataproc serverless using the TriggerOnce mode so ...
1
vote
1
answer
84
views
java.lang.ClassNotFoundException: com.microsoft.sqlserver.jdbc.SQLServerDriver while running JAR on google Dataproc cluster
i have the driver dependency in the POM.xml and i am using maven shade plugin to create an Uber Jar. i do see the driver dependency correctly listed in the JAR file. Jar runs fine in intellij but on ...
1
vote
0
answers
41
views
Dataproc Hive Job - OutOfMemoryError: Java heap space
I have a dataproc cluster, we are running INSERT OVERWRITE QUERY through HIVE CLI which fails with OutOfMemoryError: Java heap space.
We adjusted memory configurations for reducers and Tez tasks, ...
1
vote
0
answers
80
views
Google Cloud Data Fusion streaming pipeline and Spark jobs with empty rows
I have a Google Cloud Data Fusion streaming pipeline that receives data from Google Pub/Sub. Micro-batching is performed every 5 seconds. Since data doesn’t always arrive consistently, I see many ...
1
vote
3
answers
633
views
Caused by: java.lang.IllegalStateException: This connector was made for Scala null, it was not meant to run on Scala 2.12
The pointed error code as follow
String strJsonContent = SessionContext
.getSparkSession()
.read()
.json(filePath)
.toJSON()
.first();
And I'm using Maven to build the package without ...
1
vote
1
answer
227
views
Creating Dataproc Cluster with public-ip-address using DataprocCreateClusterOperator in Airflow
I am trying to create a Dataproc Cluster in my GCP project within an Airflow DAG using the DataprocCreateClusterOperator. I am using the ClusterGenerator to generate the config for the cluster. ...
1
vote
1
answer
316
views
How can I debug an InactiveRpcError with status INVALID_ARGUMENT when submitting a serverless batch job?
When submitting a dataproc serverless batch request, we have been getting errors like:
grpc._channel._InactiveRpcError: <_InactiveRpcError of RPC that terminated with:
status = StatusCode....
2
votes
1
answer
294
views
Spark The deserializer is not supported: need a(n) "ARRAY" field but got "MAP<STRING, STRING>"
Recently we have been migrated to dataproc image 2.2 version along with supporting scala 2.12.18 and spark 3.5 version.
package test
import org.apache.spark.sql.SparkSession
import test.Model._
...
2
votes
0
answers
276
views
UDF method with return type parameter is deprecated since Spark 3.0.0.how to return struct type from UDF function
We are upgrading the gcp dataproc cluster to 2.2debian12 image with
spark version is 3.5.0
scala version is 2.12.18
but with these version 1 major change is udf method with return type parameter is ...
2
votes
2
answers
488
views
Error upgrading Dataproc Serverless version from 2.1 to 2.2
I have changed the version of a Dataproc Serverless from 2.1 to 2.2 and now when I run it I get the following error:
Exception in thread "main" java.util.ServiceConfigurationError: org....
2
votes
0
answers
141
views
How to reduce GCS A and B operations in a Spark Structured Streaming pipeline in Dataproc?
I'm running a data pipeline where a NiFi on-premise flow writes JSON files in streaming to a GCS bucket. I have 5 tables, each with their own path, generating around 140k objects per day. The bucket ...
1
vote
1
answer
316
views
Accessing BigQuery Dataset from a Different GCP Project Using PySpark on Dataproc
I am working with BigQuery, Dataproc, Workflows, and Cloud Storage in Google Cloud using Python.
I have two GCP projects:
gcp-project1: contains the BigQuery dataset gcp-project1.my_dataset.my_table
...
1
vote
0
answers
48
views
How do i configure right the staging and temp buckets at DataprocCreateClusterOperator?
I'm trying to find how to set the temp and staging buckets at the DataprocOperator. I've searched for all the internet and didnt find a good awnser.
import pendulum
from datetime import timedelta
...
2
votes
1
answer
73
views
Spark on Dataproc: Slow Data Insertion into BigQuery for Large Datasets (~30M Records)
I have a Scala Spark job running on Google Cloud Dataproc that sources and writes data to Google BigQuery (BQ) tables. The code works fine for smaller datasets, but when processing larger volumes (e.g....
1
vote
1
answer
395
views
Pyspark performance problems when writing to table in Bigquery
I am new to the world of PySpark, I am experiencing serious performance problems when writing data from a dataframe to a table in Bigquery. I have tried everything I have read, recommendations, using ...
2
votes
0
answers
105
views
Delta compatibility problem with PySpark in Dataproc GKE env
I've created a dataproc cluster using GKE and a custom image with pyspark 3.5.0. but can't get it to work with delta
The custom image docker file is this:
FROM us-central1-docker.pkg.dev/cloud-...
2
votes
0
answers
28
views
How can I install html5lib on a dataproc cluster
I have a dataproc pipeline with which I do webscraping and store data in gcp.
Task setting is something like this:
create_dataproc_cluster = DataprocCreateClusterOperator(
task_id='...
6
votes
0
answers
3k
views
spark_catalog requires a single-part namespace in dbt python incremental model
Description:
Using the dbt functionality that allows one to create a python model, I created a model that reads from some BigQuery table, performs some calculations and writes back to BigQuery.
It ...
2
votes
1
answer
548
views
How to resolve: java.lang.NoSuchMethodError: 'scala.collection.Seq org.apache.spark.sql.types.StructType.toAttributes()'
Running a simple ETL PySpark job on Dataproc 2.2 with job property spark.jars.packages set to io.delta:delta-core_2.12:2.4.0 . Other settings are set to default. I have the following config:
conf = (
...
1
vote
0
answers
31
views
Create external Hive table for multiline Json
I am trying to create Hive Table for for given multiline JSON. But actual result is not similar to expected result.
Sample JSON file:
{
"name": "Adil Abro",
"...
1
vote
1
answer
77
views
not able to set the spark log properties programmatically via Python
I was trying to suppress the spark logging and specifying my own log4j.properties file.
gcloud dataproc jobs submit spark \
--cluster test-dataproc-cluster \
--region europe-north1 \
--files gs://...
3
votes
0
answers
146
views
Bigtable Read and Write using DataProc with compute engine results in Key not found
I am experimenting with reading and writing data in cloud BigTable using the DataProc compute engine and PySpark Job using spark-bigtable-connector. I got an example from spark-bigtable repo and ...
1
vote
1
answer
108
views
Not able to connect flink to kafka
Getting below error
Caused by: java.lang.NoClassDefFoundError: org/apache/kafka/clients/admin/AdminClient
while connecting flink to kafka
I am using flink 1.17 and using
flink-sql-connector-kafka-1....
2
votes
0
answers
42
views
How to evaluate total execution connection time in BigTable from Spark by scala
I am attempting to measure the total execution time from Spark to BigTable. However, when I wrap the following code around the BigTable related function, it consistently shows only 0.5 seconds, ...
1
vote
1
answer
100
views
Composer v2.6.6 - job completing successfully, Task exited with Negsignal.SIGKIL
we have Composer 2.6.6(Airflow 2.5.3), and a job VANI-UEBA3 which is running on Dataproc Serverless Batches ... the job runs through fine (as shown on the Dataproc Serverless UI),
but the composer UI ...
1
vote
1
answer
281
views
Trigger rule "one_success" doesn't work for "DataprocCreateClusterOperator"
I have a case that one of my operators which is DataprocCreateClusterOperator never triggers as if "all_success" was still set for it. It runs fine if it's the very first task but I don't ...
2
votes
0
answers
121
views
disabling yarn logs - dataproc cluster
I am trying to reduce the Class A operations on a gcs bucket which is being configured to store yarn and spark history logs.
This is costing us a lot. I disabled spark logs editing the spark-defaults....
1
vote
0
answers
90
views
Selecting a Dataproc Cluster Size with autoscaling ON
i am new to the GCP cloud and has probably a very basic question. We are running our PySpark jobs in Dataproc ephemeral cluster with autoscaling property on for the cluster. In our code we have used ...
1
vote
0
answers
154
views
life cycle policies on dataproc staging bucket sub folders
We have a dataproc cluster staging bucket wherein all the spark job logs are getting stored.
eu-digi-pipe-dataproc-stage/google-cloud-dataproc-metainfo/d0decf20-21fd-4536-bbc4-5a4f829e49bf/jobs/google-...
1
vote
0
answers
39
views
How to parse a huge amount of json files effectively in pyspark
I'm trying to parse around 100GB of small json files using pySpark. The files are stored in a google cloud bucket and they come zipped: *.jsonl.gz
How can I do it effectively?
1
vote
0
answers
164
views
bigquery write_truncate option doesn't work in spark bigquery connector
I'm trying to overwrite a BigQuery table using the WRITE_TRUNCATE option with the Spark BigQuery connector. I have verified that the target table is updated as the Last modified timestamp changes ...
2
votes
0
answers
197
views
Dataproc Serverless Batch Spark Web UI Not Displaying Executors Logs
I want to run a Spark job using the JDBCToBigQuery template in Dataproc Batch on GCP.
The job runs successfully, but there are no executor logs (stdout, stderr) on the Executors tab in the Web UI like ...
1
vote
0
answers
79
views
java.lang.UnsupportedOperationException: org.apache.parquet.column.values.dictionary.PlainValuesDictionary$PlainDoubleDictionary
I am trying to read multiple parquet files form gcs using a dataproc spark job.
df = sc.read.option("mergeSchema", "true").parquet(remote_path)
The above code throws error saying:-...
1
vote
1
answer
142
views
Build CI CD pipeline Airflow spark
Using GCP Dataproc and Airflow composer, able to deploy the pyspark script and success the result. After this, I'm not aware of artifact preparation and Building CI CD pipeline, Which tool should i ...
1
vote
0
answers
194
views
Why does my Spark Structured Streaming query writing into delta table fail with java.io.IOException: Error listing gs://bucket/_delta_log/
I have a Spark Structured streaming query which reads from a Kafka topic and writes the data into a delta table. This query was running fine for few months and there were no issues. But from last week ...