Questions tagged [apache-hudi]
Apache Hudi is a transactional data lake platform with a focus on batch and event processing (with ACID support). Use this tag for questions specific to problems with Apache-Hudi. Do not use this tag for common issues with topic data lake or delta lake.
apache-hudi
190
questions
0
votes
0
answers
38
views
Apache Hudi: Ingesting protobuf data from Kafka
I am exploring using Apache Hudi HoodieStreamer to ingest protobuf messages from Kafka into Hudi.
Despite a lot of attempts I have hit a roadblock.
I get an exception while the HoodieStreamer tries ...
0
votes
0
answers
35
views
Unable to sync non-partitioned Hudi table with BigQuery
I'm trying to to write my structured streaming data to Apache Hudi in a non-partitioned table and then sync it with BigQuery. But even though it is a new table and I've set no partitioning ...
0
votes
1
answer
57
views
Issues while writing xml data to hudi table in azure synapse notebook
I've successfully read blob data (XML) from container in azure synapse notebook and displayed dataframe df as per my need however while writing it as hudi table in azure data lake storage Gen2 I've ...
0
votes
0
answers
16
views
How to detect and create a alarm for a hudi job failure using hoodie metrics via Prometheus
Problem: While using multi delta streamer for kafka ingestion, out of many tables, if one of the table ingestion fails, job succeeds. There is no way to check for success/failure for a particular ...
1
vote
3
answers
90
views
Unexplained s3 slowdowns when ingesting data to hudi tables using spark/python Glue jobs
I'm using AWS Glue Spark/python jobs to ingest data into hudi tables in a s3 bucket. I'm hitting major s3 slowdown issues, in a way that goes beyond reasonable, but unable to pin down the root cause.
...
1
vote
0
answers
67
views
Spark-Hudi: Unable to perform Hard delete using Pyspark on HUDI table from AWS Glue
I am trying to perform a hard delete operation on a HUDI table, but unable to delete the data in the table. My setup is pretty straightforward I use a normal glue Job to create the hudi tables and use ...
0
votes
0
answers
26
views
Apache Hudi - MOR | Getting same number of records in table and table_rt after each run
We are running an ingestion job on AWS-Glue using Pyspark which reads data from the source and writes it in HUDI | MOR.
The HUDI configurations that we are using are as follows:
"hoodie.table....
0
votes
0
answers
14
views
spark seesion unable to handle multiple apis hitting at the same time
I have an API naming , getVisualier , when I am hitting it multiple times in milliseconds , I am not getting any response, but when I am hitting the same api , singly ,using Replay XHR ,it is working ...
0
votes
0
answers
34
views
Spark application running on AWS EMR throws error "this.fileSystem" is null while writing to hudi
We have a spark application running on aws emr, which computes results, but while writing to hudi it throws up below error. And also not sure if it is while writing to hudi or executing at the end(but ...
1
vote
1
answer
48
views
Is it possible to specifically handle Hudi exceptions in Pyspark
I am reading Hudi tables from s3 and sometimes the bucket or prefix may be empty and org.apache.hudi.exception.TableNotFoundException is thrown. is there a way for me to import and handle these ...
0
votes
1
answer
44
views
Unsupported options found for 'hudi'
I'm testing Apache Hudi with Flink SQL Client on Yarn cluster. When I'm trying to create a Hudi catalog (like described) I'm facing an error telling me that hive.conf.dir and mode options are not ...
0
votes
0
answers
70
views
How to print hudi logs in aws emr serverless application
I have created a emr serverless application to run hudi spark job but neither of driver and executor logs are having logs related to hudi.
I tried setting applicationProperties of emr serverless app ...
0
votes
1
answer
47
views
"hoodie.parquet.max.file.size" and "hoodie.parquet.small.file.limit" Property is Being Ignored
I want my hoodie file size to be between small=50MB and max=100MB.
The following configs are being used as map options for upsert:
val hudiOptions = Map[String, String](
HoodieWriteConfig....
1
vote
0
answers
101
views
pySpark hudi table partial updating with org.apache.hudi.common.model.PartialUpdateAvroPayload not working
I have two tables in S3 tableA with columns id, col1, col2 and col3. tableB with columns id, col4 and col5.
I want to write this data into another s3 in Hudi format as tableC with columns id, col1, ...
0
votes
1
answer
212
views
Using Minio, how to authenticate amazon s3 endpoint in java
So I have an Java app
java -jar utilities-0.1.0-SNAPSHOT-bundled.jar --datasetConfig onetable.yaml
I want it to connect to Minio
export AWS_ACCESS_KEY_ID=admin
export AWS_SECRET_ACCESS_KEY=password
...