Skip to main content

Questions tagged [apache-hudi]

Apache Hudi is a transactional data lake platform with a focus on batch and event processing (with ACID support). Use this tag for questions specific to problems with Apache-Hudi. Do not use this tag for common issues with topic data lake or delta lake.

apache-hudi
0 votes
0 answers
38 views

Apache Hudi: Ingesting protobuf data from Kafka

I am exploring using Apache Hudi HoodieStreamer to ingest protobuf messages from Kafka into Hudi. Despite a lot of attempts I have hit a roadblock. I get an exception while the HoodieStreamer tries ...
Gaurav's user avatar
  • 73
0 votes
0 answers
35 views

Unable to sync non-partitioned Hudi table with BigQuery

I'm trying to to write my structured streaming data to Apache Hudi in a non-partitioned table and then sync it with BigQuery. But even though it is a new table and I've set no partitioning ...
Vinayak Gupta's user avatar
0 votes
1 answer
57 views

Issues while writing xml data to hudi table in azure synapse notebook

I've successfully read blob data (XML) from container in azure synapse notebook and displayed dataframe df as per my need however while writing it as hudi table in azure data lake storage Gen2 I've ...
Vishal Patwardhan's user avatar
0 votes
0 answers
16 views

How to detect and create a alarm for a hudi job failure using hoodie metrics via Prometheus

Problem: While using multi delta streamer for kafka ingestion, out of many tables, if one of the table ingestion fails, job succeeds. There is no way to check for success/failure for a particular ...
Roobal Jindal's user avatar
1 vote
3 answers
90 views

Unexplained s3 slowdowns when ingesting data to hudi tables using spark/python Glue jobs

I'm using AWS Glue Spark/python jobs to ingest data into hudi tables in a s3 bucket. I'm hitting major s3 slowdown issues, in a way that goes beyond reasonable, but unable to pin down the root cause. ...
Aamit's user avatar
  • 211
1 vote
0 answers
67 views

Spark-Hudi: Unable to perform Hard delete using Pyspark on HUDI table from AWS Glue

I am trying to perform a hard delete operation on a HUDI table, but unable to delete the data in the table. My setup is pretty straightforward I use a normal glue Job to create the hudi tables and use ...
Yashaswi Dondapati's user avatar
0 votes
0 answers
26 views

Apache Hudi - MOR | Getting same number of records in table and table_rt after each run

We are running an ingestion job on AWS-Glue using Pyspark which reads data from the source and writes it in HUDI | MOR. The HUDI configurations that we are using are as follows: "hoodie.table....
Harsh Kumar's user avatar
0 votes
0 answers
14 views

spark seesion unable to handle multiple apis hitting at the same time

I have an API naming , getVisualier , when I am hitting it multiple times in milliseconds , I am not getting any response, but when I am hitting the same api , singly ,using Replay XHR ,it is working ...
Shikhar Malviya's user avatar
0 votes
0 answers
34 views

Spark application running on AWS EMR throws error "this.fileSystem" is null while writing to hudi

We have a spark application running on aws emr, which computes results, but while writing to hudi it throws up below error. And also not sure if it is while writing to hudi or executing at the end(but ...
Gokul S's user avatar
  • 23
1 vote
1 answer
48 views

Is it possible to specifically handle Hudi exceptions in Pyspark

I am reading Hudi tables from s3 and sometimes the bucket or prefix may be empty and org.apache.hudi.exception.TableNotFoundException is thrown. is there a way for me to import and handle these ...
lollerskates's user avatar
  • 1,114
0 votes
1 answer
44 views

Unsupported options found for 'hudi'

I'm testing Apache Hudi with Flink SQL Client on Yarn cluster. When I'm trying to create a Hudi catalog (like described) I'm facing an error telling me that hive.conf.dir and mode options are not ...
Niko's user avatar
  • 603
0 votes
0 answers
70 views

How to print hudi logs in aws emr serverless application

I have created a emr serverless application to run hudi spark job but neither of driver and executor logs are having logs related to hudi. I tried setting applicationProperties of emr serverless app ...
Roobal Jindal's user avatar
0 votes
1 answer
47 views

"hoodie.parquet.max.file.size" and "hoodie.parquet.small.file.limit" Property is Being Ignored

I want my hoodie file size to be between small=50MB and max=100MB. The following configs are being used as map options for upsert: val hudiOptions = Map[String, String]( HoodieWriteConfig....
Amit Kumar's user avatar
1 vote
0 answers
101 views

pySpark hudi table partial updating with org.apache.hudi.common.model.PartialUpdateAvroPayload not working

I have two tables in S3 tableA with columns id, col1, col2 and col3. tableB with columns id, col4 and col5. I want to write this data into another s3 in Hudi format as tableC with columns id, col1, ...
JanakaRao's user avatar
  • 151
0 votes
1 answer
212 views

Using Minio, how to authenticate amazon s3 endpoint in java

So I have an Java app java -jar utilities-0.1.0-SNAPSHOT-bundled.jar --datasetConfig onetable.yaml I want it to connect to Minio export AWS_ACCESS_KEY_ID=admin export AWS_SECRET_ACCESS_KEY=password ...
Albert T. Wong's user avatar

15 30 50 per page
1
2 3 4 5
13