Skip to main content

Questions tagged [google-cloud-dataflow]

Google Cloud Dataflow is a fully managed cloud service for creating and evaluating data processing pipelines at scale. Dataflow pipelines are based on the Apache Beam programming model and can operate in both batch and streaming modes. Cloud Dataflow is part of the Google Cloud Platform.

google-cloud-dataflow
0 votes
0 answers
5 views

Snowflake vs BigQuery – How to Choose the Right Cloud Platform

When deciding between Snowflake and BigQuery for your cloud data warehousing needs, it's essential to consider factors such as performance, scalability, and pricing. Snowflake offers a multi-cluster ...
Onix Net's user avatar
-1 votes
0 answers
22 views

How can I capture exceptions thrown by a Java application within Airflow's on_failure_callback?

I am using Airflow to run KubernetesPodOperator tasks that run java application image. The java application use beam dataflow. My custom airflow operators inherit KubernetesPodOperator. I am trying to ...
Maria Dorohin's user avatar
0 votes
0 answers
28 views

Apache Beam -> BigQuery: Storage Write API doesn't respect Primary Key

I have a BigQuery table created using the following DDL: CREATE TABLE mytable AS ( id STRING, source STRING, PRIMARY KEY (id) NOT ENFORCED ); As you can see, id is set as the table Primary Key. ...
SeeBeeOss's user avatar
0 votes
0 answers
63 views

Azure Data Factory Logging Values of data flow (metrics) makes no sense

I have a data flow which takes parquet from adls. Then takes the data in source activity then comes a derived column, then perform a aggregation (count) which join back (join activity/inner) to the ...
Sann's user avatar
  • 1
0 votes
1 answer
42 views

How can I efficiently insert more than 1 million records into Firestore?

Description: I am working on a project where I need to insert more than 1 million records into Google Firestore. Currently, my approach is not efficient enough and the process is extremely slow. I am ...
frfernandezdev's user avatar
0 votes
2 answers
50 views

Apache beam streaming process with time base windows

I have a dataflow pipeline that reads messages from kafka, process them, and insert them into bigquery. I want that the processing / bigquery insertion will happen in time based batches, so that on ...
asafal's user avatar
  • 45
-1 votes
0 answers
27 views

Apache beam Pipeline | kafka to Dataproc

I have created a beam pipeline to read data from the Kafka topic and then insert it to hive tables in DataProc cluster I consumed the data and converted it to the HcatRecord as below p....
Dulanga Heshan's user avatar
0 votes
1 answer
62 views

Azure Synapse Dataflow - Unable to use a parameter in Pre SQL scripts

In my Sink I'm running a Pre SQL Script to delete data after a certain date but I'm having trouble using parameters in the script. Here is my date parameter variable inside dataflow and below is my ...
noobatexcel's user avatar
1 vote
0 answers
53 views

Transfer/Stream data/CSV Files from POS (Point of sales) to GCS Buckets and then to Big query

I am working on a project where I have to transfer/Stream the data/CSV file from on prem POS to GCS buckets and then data will be first saved to Big Query external table and then moved to other ...
Codegator's user avatar
  • 629
0 votes
0 answers
39 views

Dataflow Job Fails with Cannot create PoolableConnectionFactory and PERMISSION_DENIED Errors

I'm working on a data migration personal project where I need to transfer multiple tables from Microsoft SQL Server AdventureWorks2019 database to Google BigQuery using Dataflow SQL Server to BQ ...
Cjizzle's user avatar
-1 votes
1 answer
46 views

GCP datflow job connecting to a GCP VM instance running an HTTP Server

i have a GCP Dataflow application and i am planning invoking REST commands to a FAST API server running on one of my VM hosts I am planning for my VM not to expose an external IP address (as i dont ...
user1068378's user avatar
0 votes
2 answers
92 views

Install Artifact Registry Python package from Dockerfile with Cloud Build

I have a python package located in my Artifact Registry repository. My Dataflow Flex Template is packaged within a Docker image with the following command: gcloud builds submit --tag $CONTAINER_IMAGE ....
Grégoire Borel's user avatar
0 votes
0 answers
20 views

Apache Beam Parallel Shared State

Considering a pipeline Apache Beam with 2 parallel transformations # Transformation 1 p | read_from_pubsub_subscription_1 | save_current_state | write_to_pubsub # Transformation 2 p | ...
Yak O'Poe's user avatar
  • 822
0 votes
1 answer
33 views

Cleaning up staging buckets for a GCP dataflow run via flex template

I am creating GCP dataflow jobs via flex template, using Cloud Build to generate templates etc. This results in brand new buckets being created every single time. eg i have a dataflow-staging-us-...
user1068378's user avatar
0 votes
0 answers
15 views

Why Beam AfterCount trigger behaving differently? Can anyone explain the output?

I am learning apache-beam triggers. I have written a apache beam code, which have 30 second fixed window, and a afterCount trigger of 3, and accumulation_mode as trigger.AccumulationMode.ACCUMULATING. ...
Amar Jeet's user avatar

15 30 50 per page
1
2 3 4 5
370