Newest 'google-cloud-dataproc' Questions

1 vote

1 answer

35 views

DataprocSparkSession package in python error - "RuntimeError: Error while creating Dataproc Session"

I am using below code to create Dataproc Spark Session to run a job from google.cloud.dataproc_spark_connect import DataprocSparkSession from google.cloud.dataproc_v1 import Session session = Session(...

Siddiq Syed

11

asked Oct 2 at 8:07

0 votes

0 answers

50 views

What connector can be used for Google Cloud pubsub along with Cloud dataproc ( Spark 3.5.x )

I am using Spark 3.5.x and would like to use readStream() API to read structured streaming using Java . I don't see any pubsub connector available. Couldn't try pubsub lite because it is deprecated ...

Sunil

441

asked Sep 24 at 13:07

0 votes

1 answer

63 views

ModuleNotFoundError in GCP after trying to sumbit a job

new to GCP, I am trying to submit a job inside Dataproc with a .py file & attached also pythonproject.zip file (it is a project) but I am getting the below error ModuleNotFoundError: No module ...

SofiaNiki

1

asked Sep 3 at 11:16

3 votes

1 answer

112 views

Facing bigquery write failure after upgrading spark and dataproc. "Schema mismatch : corresponding field path to Parquet column has 0 repeated field"s

We are currently undergoing migration from spark 2.4 to spark 3.5 (and dataproc 1 to 2), and our workflows are failing with the following error Caused by: com.google.cloud.spark.bigquery.repackaged....

Anshul Dubey

305

asked Jul 7 at 18:05

1 vote

0 answers

45 views

Google Cloud Dataproc Cluster Creation Fails: "Failed to validate permissions for default service account"

Despite the Default Compute Engine Service Account having the necessary roles and being explicitly specified in my cluster creation command, I am still encountering the "Failed to validate ...

Lê Văn Đức

11

asked Jun 7 at 11:00

2 votes

1 answer

135 views

Spark memory error in thread spark-listener-group-eventLog

I have a pyspark application which is using Graphframes to compute connected components on a DataFrame. The edges DataFrame I generate has 2.7M records. When I run the code it is slow, but slowly ...

Jesus Diaz Rivero

332

asked May 7 at 12:16

1 vote

0 answers

74 views

Out of memory for a smaller dataset

I have a pyspark job reading the input data volume of just ~50-55GB Parquet data from a delta table. Job is using n2-highmem-4 GCP VM and 1-15 worker with autoscaling. Each workerVM of type n2-highmem-...

user16798185

377

asked May 5 at 16:07

1 vote

2 answers

83 views

How do you run Python Hadoop Jobs on Dataproc?

I am trying to run my Python code for Hadoop job on Dataproc. I have a mapper.py and a reducer.py file. I am running this command on the terminal - gcloud dataproc jobs submit hadoop \ --cluster=my-...

The Beast

163

asked Apr 29 at 11:03

2 votes

0 answers

54 views

Why does Spark raises an IOException while running a aggregation on a streaming dataframe in Dataproc 2.2

I try to migrate a job that is running on Dataproc 2.1 images (Spark 3.3, Python 3.10) to Dataproc 2.2 images (Spark 3.5, Python 3.11). However I encounter an error on one of my queries. After further ...

AlexisBRENON

3,169

asked Mar 17 at 11:48

1 vote

0 answers

34 views

Paritial records being read in Pyspark through Dataproc

I have a Google Dataproc job that reads a CSV file from Google Cloud Storage, containing the following headers Content-type : application/octet-stream Content-encoding : gzip FileName: gs://...

Bob

383

asked Mar 15 at 22:26

2 votes

1 answer

112 views

Default behavior of spark3.5.1 when writing Numeric/Bignumeric to BigQuery

As per documentation of spark 3.5.1 using latest spark-bigquery-connector, Spark Decimal(38,0) should be written as Numeric in BigQuery. https://github.com/GoogleCloudDataproc/spark-bigquery-connector?...

Abhilash

31

asked Mar 10 at 18:15

1 vote

2 answers

134 views

How to pass arguments from GCP Workflows into Dataproc

I'm using GCP Workflows to define steps for a data engineering project. The input of the workflow consists out of multiple parameters which are provided from through the workflow API. I defined a GCP ...

54m

777

asked Mar 5 at 16:36

1 vote

0 answers

104 views

error "google.api_core.exceptions.InvalidArgument: 400 Cluster name is required" while trying to use airflow DataprocSubmitJobOperator

I'm trying to use the Dataproc submit job operator from Airflow (https://airflow.apache.org/docs/apache-airflow-providers-google/stable/_api/airflow/providers/google/cloud/operators/dataproc/index....

Abhijit Aravind

11

asked Mar 4 at 22:11

3 votes

2 answers

391 views

GoogleHadoopOutputStream: hflush(): No-op due to rate limit: Increase in class A operation for gcs bucket

We are running our spark ingestion jobs which process multiple files in batches. We read csv or tsv files in batches and create a dataframe and do some transformations before loading it into big query ...

Vikrant Singh Rana

4,729

asked Feb 28 at 3:34

1 vote

1 answer

236 views

Dataproc PySpark Job Fails with BigQuery Connector Issue - java.util.ServiceConfigurationError

I'm trying to run a PySpark job on Google Cloud Dataproc that reads data from BigQuery, processes it, and writes it back. However, the job keeps failing with the following error: java.util....

Shima K

156

asked Jan 29 at 16:46

1 vote

0 answers

41 views

Dataproc Serverless Failure on Rerun

I have written a spark job to read from kafka topic, do some processing and dump the data in avro format to GCS. I am deploying this JAVA application dataproc serverless using the TriggerOnce mode so ...

Ravi Jain

138

asked Jan 10 at 12:14

1 vote

1 answer

84 views

java.lang.ClassNotFoundException: com.microsoft.sqlserver.jdbc.SQLServerDriver while running JAR on google Dataproc cluster

i have the driver dependency in the POM.xml and i am using maven shade plugin to create an Uber Jar. i do see the driver dependency correctly listed in the JAR file. Jar runs fine in intellij but on ...

xOneOne

81

asked Nov 29, 2024 at 17:20

1 vote

0 answers

41 views

Dataproc Hive Job - OutOfMemoryError: Java heap space

I have a dataproc cluster, we are running INSERT OVERWRITE QUERY through HIVE CLI which fails with OutOfMemoryError: Java heap space. We adjusted memory configurations for reducers and Tez tasks, ...

Parmeet Singh

11

asked Nov 11, 2024 at 19:56

1 vote

0 answers

80 views

Google Cloud Data Fusion streaming pipeline and Spark jobs with empty rows

I have a Google Cloud Data Fusion streaming pipeline that receives data from Google Pub/Sub. Micro-batching is performed every 5 seconds. Since data doesn’t always arrive consistently, I see many ...

alexanoid

26.1k

asked Oct 28, 2024 at 17:38

1 vote

3 answers

633 views

Caused by: java.lang.IllegalStateException: This connector was made for Scala null, it was not meant to run on Scala 2.12

The pointed error code as follow String strJsonContent = SessionContext .getSparkSession() .read() .json(filePath) .toJSON() .first(); And I'm using Maven to build the package without ...

Delevin Zhong

23

asked Oct 16, 2024 at 7:14

1 vote

1 answer

227 views

Creating Dataproc Cluster with public-ip-address using DataprocCreateClusterOperator in Airflow

I am trying to create a Dataproc Cluster in my GCP project within an Airflow DAG using the DataprocCreateClusterOperator. I am using the ClusterGenerator to generate the config for the cluster. ...

Mads

105

asked Sep 30, 2024 at 9:08

1 vote

1 answer

316 views

How can I debug an InactiveRpcError with status INVALID_ARGUMENT when submitting a serverless batch job?

When submitting a dataproc serverless batch request, we have been getting errors like: grpc._channel._InactiveRpcError: <_InactiveRpcError of RPC that terminated with: status = StatusCode....

Jacob Promisel

21

asked Sep 24, 2024 at 17:44

2 votes

1 answer

294 views

Spark The deserializer is not supported: need a(n) "ARRAY" field but got "MAP<STRING, STRING>"

Recently we have been migrated to dataproc image 2.2 version along with supporting scala 2.12.18 and spark 3.5 version. package test import org.apache.spark.sql.SparkSession import test.Model._ ...

Vikrant Singh Rana

4,729

asked Sep 24, 2024 at 15:55

2 votes

0 answers

276 views

UDF method with return type parameter is deprecated since Spark 3.0.0.how to return struct type from UDF function

We are upgrading the gcp dataproc cluster to 2.2debian12 image with spark version is 3.5.0 scala version is 2.12.18 but with these version 1 major change is udf method with return type parameter is ...

chanchal ahuja

81

asked Sep 24, 2024 at 10:19

2 votes

2 answers

488 views

Error upgrading Dataproc Serverless version from 2.1 to 2.2

I have changed the version of a Dataproc Serverless from 2.1 to 2.2 and now when I run it I get the following error: Exception in thread "main" java.util.ServiceConfigurationError: org....

Chaos

21

asked Sep 17, 2024 at 18:27

2 votes

0 answers

141 views

How to reduce GCS A and B operations in a Spark Structured Streaming pipeline in Dataproc?

I'm running a data pipeline where a NiFi on-premise flow writes JSON files in streaming to a GCS bucket. I have 5 tables, each with their own path, generating around 140k objects per day. The bucket ...

Puredepatata

33

asked Sep 16, 2024 at 18:50

1 vote

1 answer

316 views

Accessing BigQuery Dataset from a Different GCP Project Using PySpark on Dataproc

I am working with BigQuery, Dataproc, Workflows, and Cloud Storage in Google Cloud using Python. I have two GCP projects: gcp-project1: contains the BigQuery dataset gcp-project1.my_dataset.my_table ...

Henry Xiloj Herrera

386

asked Sep 14, 2024 at 2:28

1 vote

0 answers

48 views

How do i configure right the staging and temp buckets at DataprocCreateClusterOperator?

I'm trying to find how to set the temp and staging buckets at the DataprocOperator. I've searched for all the internet and didnt find a good awnser. import pendulum from datetime import timedelta ...

GuilhermeMP

11

asked Sep 12, 2024 at 14:34

2 votes

1 answer

73 views

Spark on Dataproc: Slow Data Insertion into BigQuery for Large Datasets (~30M Records)

I have a Scala Spark job running on Google Cloud Dataproc that sources and writes data to Google BigQuery (BQ) tables. The code works fine for smaller datasets, but when processing larger volumes (e.g....

Sekar Ramu

483

asked Aug 28, 2024 at 6:03

1 vote

1 answer

395 views

Pyspark performance problems when writing to table in Bigquery

I am new to the world of PySpark, I am experiencing serious performance problems when writing data from a dataframe to a table in Bigquery. I have tried everything I have read, recommendations, using ...

aleretgub

53

asked Aug 23, 2024 at 12:02

2 votes

0 answers

105 views

Delta compatibility problem with PySpark in Dataproc GKE env

I've created a dataproc cluster using GKE and a custom image with pyspark 3.5.0. but can't get it to work with delta The custom image docker file is this: FROM us-central1-docker.pkg.dev/cloud-...

Pedro

21

asked Aug 8, 2024 at 20:45

2 votes

0 answers

28 views

How can I install html5lib on a dataproc cluster

I have a dataproc pipeline with which I do webscraping and store data in gcp. Task setting is something like this: create_dataproc_cluster = DataprocCreateClusterOperator( task_id='...

Sara

75

asked Jul 25, 2024 at 19:31

6 votes

0 answers

3k views

spark_catalog requires a single-part namespace in dbt python incremental model

Description: Using the dbt functionality that allows one to create a python model, I created a model that reads from some BigQuery table, performs some calculations and writes back to BigQuery. It ...

Carlos Veríssimo

61

asked Jul 23, 2024 at 8:55

2 votes

1 answer

548 views

How to resolve: java.lang.NoSuchMethodError: 'scala.collection.Seq org.apache.spark.sql.types.StructType.toAttributes()'

Running a simple ETL PySpark job on Dataproc 2.2 with job property spark.jars.packages set to io.delta:delta-core_2.12:2.4.0 . Other settings are set to default. I have the following config: conf = ( ...

dbkoop

101

asked Jul 15, 2024 at 2:33

1 vote

0 answers

31 views

Create external Hive table for multiline Json

I am trying to create Hive Table for for given multiline JSON. But actual result is not similar to expected result. Sample JSON file: { "name": "Adil Abro", "...

yac

11

asked Jun 30, 2024 at 22:40

1 vote

1 answer

77 views

not able to set the spark log properties programmatically via Python

I was trying to suppress the spark logging and specifying my own log4j.properties file. gcloud dataproc jobs submit spark \ --cluster test-dataproc-cluster \ --region europe-north1 \ --files gs://...

Vikrant Singh Rana

4,729

asked Jun 27, 2024 at 13:10

3 votes

0 answers

146 views

Bigtable Read and Write using DataProc with compute engine results in Key not found

I am experimenting with reading and writing data in cloud BigTable using the DataProc compute engine and PySpark Job using spark-bigtable-connector. I got an example from spark-bigtable repo and ...

Suga Raj

591

asked Jun 26, 2024 at 17:07

1 vote

1 answer

108 views

Not able to connect flink to kafka

Getting below error Caused by: java.lang.NoClassDefFoundError: org/apache/kafka/clients/admin/AdminClient while connecting flink to kafka I am using flink 1.17 and using flink-sql-connector-kafka-1....

Om Prakash

23

asked Jun 25, 2024 at 14:17

2 votes

0 answers

42 views

How to evaluate total execution connection time in BigTable from Spark by scala

I am attempting to measure the total execution time from Spark to BigTable. However, when I wrap the following code around the BigTable related function, it consistently shows only 0.5 seconds, ...

Kuengaer

21

asked Jun 24, 2024 at 11:36

1 vote

1 answer

100 views

Composer v2.6.6 - job completing successfully, Task exited with Negsignal.SIGKIL

we have Composer 2.6.6(Airflow 2.5.3), and a job VANI-UEBA3 which is running on Dataproc Serverless Batches ... the job runs through fine (as shown on the Dataproc Serverless UI), but the composer UI ...

Karan Alang

1,101

asked Jun 10, 2024 at 22:14

1 vote

1 answer

281 views

Trigger rule "one_success" doesn't work for "DataprocCreateClusterOperator"

I have a case that one of my operators which is DataprocCreateClusterOperator never triggers as if "all_success" was still set for it. It runs fine if it's the very first task but I don't ...

Aleksander Lipka

528

asked Jun 7, 2024 at 17:57

2 votes

0 answers

121 views

disabling yarn logs - dataproc cluster

I am trying to reduce the Class A operations on a gcs bucket which is being configured to store yarn and spark history logs. This is costing us a lot. I disabled spark logs editing the spark-defaults....

Vikrant Singh Rana

4,729

asked Jun 5, 2024 at 0:40

1 vote

0 answers

90 views

Selecting a Dataproc Cluster Size with autoscaling ON

i am new to the GCP cloud and has probably a very basic question. We are running our PySpark jobs in Dataproc ephemeral cluster with autoscaling property on for the cluster. In our code we have used ...

Kaushik Ghosh

131

asked Jun 4, 2024 at 6:06

1 vote

0 answers

154 views

life cycle policies on dataproc staging bucket sub folders

We have a dataproc cluster staging bucket wherein all the spark job logs are getting stored. eu-digi-pipe-dataproc-stage/google-cloud-dataproc-metainfo/d0decf20-21fd-4536-bbc4-5a4f829e49bf/jobs/google-...

Vikrant Singh Rana

4,729

asked May 31, 2024 at 4:39

1 vote

0 answers

39 views

How to parse a huge amount of json files effectively in pyspark

I'm trying to parse around 100GB of small json files using pySpark. The files are stored in a google cloud bucket and they come zipped: *.jsonl.gz How can I do it effectively?

Aleksander Lipka

528

asked May 27, 2024 at 10:28

1 vote

0 answers

164 views

bigquery write_truncate option doesn't work in spark bigquery connector

I'm trying to overwrite a BigQuery table using the WRITE_TRUNCATE option with the Spark BigQuery connector. I have verified that the target table is updated as the Last modified timestamp changes ...

lang

51

asked May 23, 2024 at 11:21

2 votes

0 answers

197 views

Dataproc Serverless Batch Spark Web UI Not Displaying Executors Logs

I want to run a Spark job using the JDBCToBigQuery template in Dataproc Batch on GCP. The job runs successfully, but there are no executor logs (stdout, stderr) on the Executors tab in the Web UI like ...

lang

51

asked May 17, 2024 at 7:01

1 vote

0 answers

79 views

java.lang.UnsupportedOperationException: org.apache.parquet.column.values.dictionary.PlainValuesDictionary$PlainDoubleDictionary

I am trying to read multiple parquet files form gcs using a dataproc spark job. df = sc.read.option("mergeSchema", "true").parquet(remote_path) The above code throws error saying:-...

ak1234

221

asked May 16, 2024 at 13:35

1 vote

1 answer

142 views

Build CI CD pipeline Airflow spark

Using GCP Dataproc and Airflow composer, able to deploy the pyspark script and success the result. After this, I'm not aware of artifact preparation and Building CI CD pipeline, Which tool should i ...

Ajith Lakshmanan

11

asked May 6, 2024 at 8:31

1 vote

0 answers

194 views

Why does my Spark Structured Streaming query writing into delta table fail with java.io.IOException: Error listing gs://bucket/_delta_log/

I have a Spark Structured streaming query which reads from a Kafka topic and writes the data into a delta table. This query was running fine for few months and there were no issues. But from last week ...

Isira Wijayarathne

11

asked May 3, 2024 at 5:44

Collectives™ on Stack Overflow