Tags: linkedin/openhouse
Tags
Fix github build failure (#137) ## Summary The build failed because of missing `khttp` library that is required by `springdoc-openapi` plugin. This library has been removed from maven2 repo from 2022, so not sure why the previous build succeeded (maybe github caches some gradle libraries). The `springdoc-openapi` plugin has removed the usage of `khttp` too since `1.5.0`. I'm upgrading the version to `1.6.0` which is the latest that gradle 6 supports. More details in this issue: springdoc/springdoc-openapi-gradle-plugin#92. ## Changes - [ ] Client-facing API Changes - [ ] Internal API Changes - [x] Bug Fixes - [ ] New Features - [ ] Performance Improvements - [ ] Code Style - [ ] Refactoring - [ ] Documentation - [ ] Tests For all the boxes checked, please include additional details of the changes made in this pull request. ## Testing Done <!--- Check any relevant boxes with "x" --> - [ ] Manually Tested on local docker setup. Please include commands ran, and their output. - [ ] Added new tests for the changes made. - [ ] Updated existing tests to reflect the changes made. - [x] No tests added or updated. Please explain why. If unsure, please feel free to ask for help. - [ ] Some other form of testing like staging or soak time in production. Please explain. For all the boxes checked, include a detailed description of the testing done for the changes made in this pull request. # Additional Information - [ ] Breaking Changes - [ ] Deprecations - [ ] Large PR broken into smaller PRs, and PR plan linked in the description. For all the boxes checked, include additional details of the changes made in this pull request.
Fix tables service tests due to removal of getAllTables api (#135) ## Summary The [PR](#127) has not cleaned all the tests related to the getAllTables api (not sure why the build succeeded). This PR is to remove the tests to make github build succeed. ## Changes - [ ] Client-facing API Changes - [ ] Internal API Changes - [ ] Bug Fixes - [ ] New Features - [ ] Performance Improvements - [ ] Code Style - [ ] Refactoring - [ ] Documentation - [x] Tests For all the boxes checked, please include additional details of the changes made in this pull request. ## Testing Done <!--- Check any relevant boxes with "x" --> - [ ] Manually Tested on local docker setup. Please include commands ran, and their output. - [ ] Added new tests for the changes made. - [x] Updated existing tests to reflect the changes made. - [ ] No tests added or updated. Please explain why. If unsure, please feel free to ask for help. - [ ] Some other form of testing like staging or soak time in production. Please explain. For all the boxes checked, include a detailed description of the testing done for the changes made in this pull request. # Additional Information - [ ] Breaking Changes - [ ] Deprecations - [ ] Large PR broken into smaller PRs, and PR plan linked in the description. For all the boxes checked, include additional details of the changes made in this pull request.
Make snapshots expiration job leaner (#131) ## Summary Snapshots expiration job skips removing files as we wanted to localize files removal to one job. The job traverses the files tree though, and that is expensive and unnecessary. As the result of this change, it updates snapshots list in the metadata without traversing the files tree. ## Changes - [ ] Client-facing API Changes - [ ] Internal API Changes - [ ] Bug Fixes - [ ] New Features - [x] Performance Improvements - [ ] Code Style - [ ] Refactoring - [ ] Documentation - [ ] Tests For all the boxes checked, please include additional details of the changes made in this pull request. ## Testing Done <!--- Check any relevant boxes with "x" --> - [ ] Manually Tested on local docker setup. Please include commands ran, and their output. - [ ] Added new tests for the changes made. - [ ] Updated existing tests to reflect the changes made. - [x] No tests added or updated. Please explain why. If unsure, please feel free to ask for help. - [ ] Some other form of testing like staging or soak time in production. Please explain. No change in the job effect is expected, it's purely optimization. For all the boxes checked, include a detailed description of the testing done for the changes made in this pull request. # Additional Information - [ ] Breaking Changes - [ ] Deprecations - [ ] Large PR broken into smaller PRs, and PR plan linked in the description. For all the boxes checked, include additional details of the changes made in this pull request.
Enabling access to swagger-ui using client without auth (#130) ## Summary problem: swagger ui returns 401 when requesting api-docs via browser <img width="989" alt="image" src="https://github.com/linkedin/openhouse/assets/25903091/7f0c2f35-676f-4999-b11d-c9601c0969a7"> solution: expose swagger ui to unauthenticated access (just like the existing, non-ui version `/v3/api-docs`) ## Changes - [ ] Client-facing API Changes - [ ] Internal API Changes - [ ] Bug Fixes - [ ] New Features - [ ] Performance Improvements - [ ] Code Style - [ ] Refactoring - [x] Documentation - [ ] Tests For all the boxes checked, please include additional details of the changes made in this pull request. Swagger-ui is beneficial for browsing API configuration for features like, what are the required client headers for a given API endpoint. ## Testing Done <!--- Check any relevant boxes with "x" --> - [x] Manually Tested on local docker setup. Please include commands ran, and their output. - [ ] Added new tests for the changes made. - [ ] Updated existing tests to reflect the changes made. - [x] No tests added or updated. Please explain why. If unsure, please feel free to ask for help. - [ ] Some other form of testing like staging or soak time in production. Please explain. For all the boxes checked, include a detailed description of the testing done for the changes made in this pull request. I manually ran tests by configuring infra/recipes/docker-compose/oh-only/docker-compose.yml to also start the jobs rest service, then ```bash ➜ JAVA_HOME=/Library/Java/JavaVirtualMachines/jdk1.8.0_282-msft.jdk/Contents/Home ./gradlew clean build -x test -x javadoc ➜ docker compose -f infra/recipes/docker-compose/oh-only/docker-compose.yml down --rmi all ➜ docker compose -f infra/recipes/docker-compose/oh-only/docker-compose.yml up ``` then querying each endpoint: tables <img width="680" alt="image" src="https://github.com/linkedin/openhouse/assets/25903091/b5411c96-0f52-4010-96fc-ce4db8e34ac5"> housetables <img width="680" alt="image" src="https://github.com/linkedin/openhouse/assets/25903091/6a2dda20-f501-4298-be5a-4919ed4a1075"> jobs <img width="680" alt="image" src="https://github.com/linkedin/openhouse/assets/25903091/eb7a6cb8-1a28-4d23-b353-e2c6e0ca54a3"> > No tests added or updated. Please explain why. If unsure, please feel free to ask for help. When atttempting to add unittests, using e.g. MockMvcBuilder, the swagger endpoint returns 404 for our services. I believe this is because the service unittests use a Mocked version of the controller, but none of our controllers specify swagger. Swagger is configured at spring application start, so would require a different method of testing than using a mocked controller. I tried to use a tomcat/h2 local server but still was getting issues of 404. I would be open to continuing to try unittests with some pointers in the right direction for testing springboot application server with configured swagger. # Additional Information - [ ] Breaking Changes - [ ] Deprecations - [ ] Large PR broken into smaller PRs, and PR plan linked in the description. For all the boxes checked, include additional details of the changes made in this pull request.
Data layout optimization (strategy generation). Part 2: data source f… …or statistics/query logs (#109) ## Summary This is part 2 of a new feature: data layout optimization library, strategy generation. Added data source interface/implementation. This PR builds on top of #108 The following 3 components will be added eventually: 1) DLO library that has primitives for generating data layout optimization strategies 2) App that generates strategies for all tables 3) Scheduling of the app ## Changes - [ ] Client-facing API Changes - [ ] Internal API Changes - [ ] Bug Fixes - [x] New Features - [ ] Performance Improvements - [ ] Code Style - [ ] Refactoring - [ ] Documentation - [ ] Tests For all the boxes checked, please include additional details of the changes made in this pull request. ## Testing Done <!--- Check any relevant boxes with "x" --> - [ ] Manually Tested on local docker setup. Please include commands ran, and their output. - [x] Added new tests for the changes made. - [ ] Updated existing tests to reflect the changes made. - [ ] No tests added or updated. Please explain why. If unsure, please feel free to ask for help. - [ ] Some other form of testing like staging or soak time in production. Please explain. For all the boxes checked, include a detailed description of the testing done for the changes made in this pull request. # Additional Information - [ ] Breaking Changes - [ ] Deprecations - [x] Large PR broken into smaller PRs, and PR plan linked in the description. For all the boxes checked, include additional details of the changes made in this pull request.
Data layout optimization (strategy generation). Part 1: strategy clas… …s with persistence (#108) ## Summary This is part 1 of a new feature: data layout optimization library, strategy generation. Added strategy class and persistence utilities. Refactored existing compaction app to use the library config. The following 3 components will be added eventually: 1) DLO library that has primitives for generating data layout optimization strategies 2) App that generates strategies for all tables 3) Scheduling of the app ## Changes - [ ] Client-facing API Changes - [ ] Internal API Changes - [ ] Bug Fixes - [x] New Features - [ ] Performance Improvements - [ ] Code Style - [x] Refactoring - [x] Documentation - [ ] Tests For all the boxes checked, please include additional details of the changes made in this pull request. ## Testing Done <!--- Check any relevant boxes with "x" --> - [ ] Manually Tested on local docker setup. Please include commands ran, and their output. - [x] Added new tests for the changes made. - [ ] Updated existing tests to reflect the changes made. - [ ] No tests added or updated. Please explain why. If unsure, please feel free to ask for help. - [ ] Some other form of testing like staging or soak time in production. Please explain. For all the boxes checked, include a detailed description of the testing done for the changes made in this pull request. # Additional Information - [ ] Breaking Changes - [ ] Deprecations - [x] Large PR broken into smaller PRs, and PR plan linked in the description. For all the boxes checked, include additional details of the changes made in this pull request.
[PR4/5]: Add S3FileIO (#125) ## Summary This is the fourth PR in a sequence of PRs to add support for S3 storage. Openhouse catalog currently supports HDFS as the storage backend. The end goal of this effort is to add the integration with S3 so that the storage backend can be configured to be S3 vs HDFS based on storage type. The entire work is done via a series of PRs: 1. Add S3 Storage type and S3StorageClient. 2. Add base class for StorageClient and move common logic like validation of properties there to avoid code duplication. 3. Add S3Storage implementation that uses S3StorageClient. 4. Add support for using S3FileIO for S3 storage type. 5. Add a recipe for end-to-end testing in docker. This PR addresses 4 by adding S3FileIO. Sushant has already done 5. So, this marks the completion of S3 integration. ## Changes - [ ] Client-facing API Changes - [ ] Internal API Changes - [ ] Bug Fixes - [x] New Features - [ ] Performance Improvements - [ ] Code Style - [ ] Refactoring - [ ] Documentation - [ ] Tests For all the boxes checked, please include additional details of the changes made in this pull request. ## Testing Done Testing by running the oh-s3-spark recipe in docker. 1. Run docker: $ docker compose up -d [+] Building 0.0s (0/0) docker:desktop-linux [+] Running 16/16 ✔ Network oh-s3-spark_default Cre... 0.0s ✔ Container local.spark-master St... 0.2s ✔ Container oh-s3-spark-prometheus-1 Started 0.2s ✔ Container local.mysql Started 0.2s ✔ Container local.minioS3 Started 0.2s ✔ Container local.opa Started 0.2s ! opa The requested image's platform (linux/amd64) does not match the detected host platform (linux/arm64/v8) and no specific platform was requested 0.0s ! spark-master The requested image's platform (linux/amd64) does not match the detected host platform (linux/arm64/v8) and no specific platform was requested 0.0s ✔ Container local.spark-livy Star... 0.1s ✔ Container local.spark-worker-a Started 0.1s ✔ Container local.minioClient Sta... 0.0s ✔ Container local.openhouse-housetables Started 0.0s ! spark-worker-a The requested image's platform (linux/amd64) does not match the detected host platform (linux/arm64/v8) and no specific platform was requested 0.0s ! spark-livy The requested image's platform (linux/amd64) does not match the detected host platform (linux/arm64/v8) and no specific platform was requested 0.0s ✔ Container local.openhouse-jobs Started 0.0s ✔ Container local.openhouse-tables Started 0.0s lajain-mn2:oh-s3-spark lajain$ docker exec -it local.spark-master /bin/bash <img width="1120" alt="Screenshot 2024-06-13 at 3 47 32 PM" src="https://github.com/linkedin/openhouse/assets/114708561/6e827de9-19ac-46a6-9786-cf2b5403ace7"> 2. Login to MinIO. <img width="1402" alt="Screenshot 2024-06-13 at 3 48 36 PM" src="https://github.com/linkedin/openhouse/assets/114708561/4c49ad7d-82b7-45c2-b444-b2548ddfb43f"> <img width="1466" alt="Screenshot 2024-06-13 at 3 48 59 PM" src="https://github.com/linkedin/openhouse/assets/114708561/aafb55dc-d49e-45f9-9c31-60b6d88bbc4f"> 3. Run spark shell: openhouse@cff3c38358c5:/opt/spark$ export AWS_REGION=us-east-1 openhouse@cff3c38358c5:/opt/spark$ bin/spark-shell --packages org.apache.iceberg:iceberg-spark-runtime-3.1_2.12:1.2.0,software.amazon.awssdk:bundle:2.20.18,software.amazon.awssdk:url-connection-client:2.20.18 \ > --jars openhouse-spark-runtime_2.12-*-all.jar \ > --conf spark.sql.extensions=org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions,com.linkedin.openhouse.spark.extensions.OpenhouseSparkSessionExtensions \ > --conf spark.sql.catalog.openhouse=org.apache.iceberg.spark.SparkCatalog \ > --conf spark.sql.catalog.openhouse.catalog-impl=com.linkedin.openhouse.spark.OpenHouseCatalog \ > --conf spark.sql.catalog.openhouse.metrics-reporter-impl=com.linkedin.openhouse.javaclient.OpenHouseMetricsReporter \ > --conf spark.sql.catalog.openhouse.uri=http://openhouse-tables:8080 \ > --conf spark.sql.catalog.openhouse.auth-token=$(cat /var/config/$(whoami).token) \ > --conf spark.sql.catalog.openhouse.cluster=LocalS3Cluster \ > --conf spark.sql.catalog.openhouse.io-impl=org.apache.iceberg.aws.s3.S3FileIO \ > --conf spark.sql.catalog.openhouse.s3.endpoint=http://minioS3:9000 \ > --conf spark.sql.catalog.openhouse.s3.access-key-id=admin \ > --conf spark.sql.catalog.openhouse.s3.secret-access-key=password \ > --conf spark.sql.catalog.openhouse.s3.path-style-access=true :: loading settings :: url = jar:file:/opt/spark/jars/ivy-2.4.0.jar!/org/apache/ivy/core/settings/ivysettings.xml Ivy Default Cache set to: /home/openhouse/.ivy2/cache The jars for the packages stored in: /home/openhouse/.ivy2/jars org.apache.iceberg#iceberg-spark-runtime-3.1_2.12 added as a dependency software.amazon.awssdk#bundle added as a dependency software.amazon.awssdk#url-connection-client added as a dependency :: resolving dependencies :: org.apache.spark#spark-submit-parent-9d1e3e8a-c713-44f0-93b3-bd5029daa8e2;1.0 confs: [default] found org.apache.iceberg#iceberg-spark-runtime-3.1_2.12;1.2.0 in central found software.amazon.awssdk#bundle;2.20.18 in central found software.amazon.eventstream#eventstream;1.0.1 in central found software.amazon.awssdk#url-connection-client;2.20.18 in central found software.amazon.awssdk#utils;2.20.18 in central found org.reactivestreams#reactive-streams;1.0.3 in central found software.amazon.awssdk#annotations;2.20.18 in central found org.slf4j#slf4j-api;1.7.30 in central found software.amazon.awssdk#http-client-spi;2.20.18 in central found software.amazon.awssdk#metrics-spi;2.20.18 in central downloading https://repo1.maven.org/maven2/org/apache/iceberg/iceberg-spark-runtime-3.1_2.12/1.2.0/iceberg-spark-runtime-3.1_2.12-1.2.0.jar ... [SUCCESSFUL ] org.apache.iceberg#iceberg-spark-runtime-3.1_2.12;1.2.0!iceberg-spark-runtime-3.1_2.12.jar (966ms) downloading https://repo1.maven.org/maven2/software/amazon/awssdk/bundle/2.20.18/bundle-2.20.18.jar ... [SUCCESSFUL ] software.amazon.awssdk#bundle;2.20.18!bundle.jar (8889ms) downloading https://repo1.maven.org/maven2/software/amazon/awssdk/url-connection-client/2.20.18/url-connection-client-2.20.18.jar ... [SUCCESSFUL ] software.amazon.awssdk#url-connection-client;2.20.18!url-connection-client.jar (66ms) downloading https://repo1.maven.org/maven2/software/amazon/eventstream/eventstream/1.0.1/eventstream-1.0.1.jar ... [SUCCESSFUL ] software.amazon.eventstream#eventstream;1.0.1!eventstream.jar (59ms) downloading https://repo1.maven.org/maven2/software/amazon/awssdk/utils/2.20.18/utils-2.20.18.jar ... [SUCCESSFUL ] software.amazon.awssdk#utils;2.20.18!utils.jar (66ms) downloading https://repo1.maven.org/maven2/software/amazon/awssdk/annotations/2.20.18/annotations-2.20.18.jar ... [SUCCESSFUL ] software.amazon.awssdk#annotations;2.20.18!annotations.jar (62ms) downloading https://repo1.maven.org/maven2/software/amazon/awssdk/http-client-spi/2.20.18/http-client-spi-2.20.18.jar ... [SUCCESSFUL ] software.amazon.awssdk#http-client-spi;2.20.18!http-client-spi.jar (61ms) downloading https://repo1.maven.org/maven2/org/reactivestreams/reactive-streams/1.0.3/reactive-streams-1.0.3.jar ... [SUCCESSFUL ] org.reactivestreams#reactive-streams;1.0.3!reactive-streams.jar (58ms) downloading https://repo1.maven.org/maven2/org/slf4j/slf4j-api/1.7.30/slf4j-api-1.7.30.jar ... [SUCCESSFUL ] org.slf4j#slf4j-api;1.7.30!slf4j-api.jar (59ms) downloading https://repo1.maven.org/maven2/software/amazon/awssdk/metrics-spi/2.20.18/metrics-spi-2.20.18.jar ... [SUCCESSFUL ] software.amazon.awssdk#metrics-spi;2.20.18!metrics-spi.jar (56ms) :: resolution report :: resolve 153033ms :: artifacts dl 10382ms :: modules in use: org.apache.iceberg#iceberg-spark-runtime-3.1_2.12;1.2.0 from central in [default] org.reactivestreams#reactive-streams;1.0.3 from central in [default] org.slf4j#slf4j-api;1.7.30 from central in [default] software.amazon.awssdk#annotations;2.20.18 from central in [default] software.amazon.awssdk#bundle;2.20.18 from central in [default] software.amazon.awssdk#http-client-spi;2.20.18 from central in [default] software.amazon.awssdk#metrics-spi;2.20.18 from central in [default] software.amazon.awssdk#url-connection-client;2.20.18 from central in [default] software.amazon.awssdk#utils;2.20.18 from central in [default] software.amazon.eventstream#eventstream;1.0.1 from central in [default] --------------------------------------------------------------------- | | modules || artifacts | | conf | number| search|dwnlded|evicted|| number|dwnlded| --------------------------------------------------------------------- | default | 10 | 10 | 10 | 0 || 10 | 10 | --------------------------------------------------------------------- :: retrieving :: org.apache.spark#spark-submit-parent-9d1e3e8a-c713-44f0-93b3-bd5029daa8e2 confs: [default] 10 artifacts copied, 0 already retrieved (480145kB/974ms) 2024-06-14 00:51:50,318 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable Setting default log level to "WARN". To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel). Spark context Web UI available at http://cff3c38358c5:4040 Spark context available as 'sc' (master = local[*], app id = local-1718326322630). Spark session available as 'spark'. Welcome to ____ __ / __/__ ___ _____/ /__ _\ \/ _ \/ _ `/ __/ '_/ /___/ .__/\_,_/_/ /_/\_\ version 3.1.1 /_/ Using Scala version 2.12.10 (OpenJDK 64-Bit Server VM, Java 1.8.0_232) Type in expressions to have them evaluated. Type :help for more information. scala> The bucket right now is empty: <img width="1467" alt="Screenshot 2024-06-13 at 5 50 19 PM" src="https://github.com/linkedin/openhouse/assets/114708561/8c4f2576-f867-4561-b680-9d340b9237a3"> 4. Create table: scala> spark.sql("CREATE TABLE openhouse.db.tb (ts timestamp, col1 string, col2 string) PARTITIONED BY (days(ts))").show() ++ || ++ ++ scala> spark.sql("DESCRIBE TABLE openhouse.db.tb").show() SLF4J: Failed to load class "org.slf4j.impl.StaticLoggerBinder". SLF4J: Defaulting to no-operation (NOP) logger implementation SLF4J: See http://www.slf4j.org/codes.html#StaticLoggerBinder for further details. +--------------+---------+-------+ | col_name|data_type|comment| +--------------+---------+-------+ | ts|timestamp| | | col1| string| | | col2| string| | | | | | |# Partitioning| | | | Part 0| days(ts)| | +--------------+---------+-------+ <img width="1463" alt="Screenshot 2024-06-13 at 5 53 55 PM" src="https://github.com/linkedin/openhouse/assets/114708561/a7cc8882-5099-43c5-b73b-6f779620a7fc"> <img width="1463" alt="Screenshot 2024-06-13 at 5 54 11 PM" src="https://github.com/linkedin/openhouse/assets/114708561/9f1d8665-1f9b-49a3-be3b-495561ecf3ce"> <img width="1452" alt="Screenshot 2024-06-13 at 5 54 22 PM" src="https://github.com/linkedin/openhouse/assets/114708561/efeea33e-f27a-44e7-ab95-a390e86e6893"> 5. Add data scala> spark.sql("INSERT INTO TABLE openhouse.db.tb VALUES (current_timestamp(), 'val1', 'val2')") res2: org.apache.spark.sql.DataFrame = [] scala> spark.sql("INSERT INTO TABLE openhouse.db.tb VALUES (date_sub(CAST(current_timestamp() as DATE), 30), 'val1', 'val2')") res3: org.apache.spark.sql.DataFrame = [] scala> spark.sql("INSERT INTO TABLE openhouse.db.tb VALUES (date_sub(CAST(current_timestamp() as DATE), 60), 'val1', 'val2')") res4: org.apache.spark.sql.DataFrame = [] scala> spark.sql("SELECT * FROM openhouse.db.tb").show() +--------------------+----+----+ | ts|col1|col2| +--------------------+----+----+ | 2024-05-15 00:00:00|val1|val2| | 2024-04-15 00:00:00|val1|val2| |2024-06-14 00:55:...|val1|val2| +--------------------+----+----+ scala> spark.sql("SHOW TABLES IN openhouse.db").show() +---------+---------+ |namespace|tableName| +---------+---------+ | db| tb| +---------+---------+ <img width="1460" alt="Screenshot 2024-06-13 at 5 56 15 PM" src="https://github.com/linkedin/openhouse/assets/114708561/593f731a-fda8-49d1-86fd-f34d40e4cab8"> Test using table service: $ curl "${curlArgs[@]}" -XPOST http://localhost:8000/v1/databases/d3/tables/ \ > --data-raw '{ > "tableId": "t1", > "databaseId": "d3", > "baseTableVersion": "INITIAL_VERSION", > "clusterId": "LocalS3Cluster", > "schema": "{\"type\": \"struct\", \"fields\": [{\"id\": 1,\"required\": true,\"name\": \"id\",\"type\": \"string\"},{\"id\": 2,\"required\": true,\"name\": \"name\",\"type\": \"string\"},{\"id\": 3,\"required\": true,\"name\": \"ts\",\"type\": \"timestamp\"}]}", > "timePartitioning": { > "columnName": "ts", > "granularity": "HOUR" > }, > "clustering": [ > { > "columnName": "name" > } > ], > "tableProperties": { > "key": "value" > } > }' | json_pp % Total % Received % Xferd Average Speed Time Time Time Current Dload Upload Total Spent Left Speed 100 2146 0 1576 100 570 4184 1513 --:--:-- --:--:-- --:--:-- 5692 { "clusterId" : "LocalS3Cluster", "clustering" : [ { "columnName" : "name", "transform" : null } ], "creationTime" : 1718327830198, "databaseId" : "d3", "lastModifiedTime" : 1718327830198, "policies" : null, "schema" : "{\"type\":\"struct\",\"schema-id\":0,\"fields\":[{\"id\":1,\"name\":\"id\",\"required\":true,\"type\":\"string\"},{\"id\":2,\"name\":\"name\",\"required\":true,\"type\":\"string\"},{\"id\":3,\"name\":\"ts\",\"required\":true,\"type\":\"timestamp\"}]}", "tableCreator" : "DUMMY_ANONYMOUS_USER", "tableId" : "t1", "tableLocation" : "s3://openhouse-bucket/d3/t1-394d8186-143f-482a-b5e5-e6aa6e382556/00000-8e498f9d-153e-412f-bb0e-476cfcab926d.metadata.json", "tableProperties" : { "key" : "value", "openhouse.clusterId" : "LocalS3Cluster", "openhouse.creationTime" : "1718327830198", "openhouse.databaseId" : "d3", "openhouse.lastModifiedTime" : "1718327830198", "openhouse.tableCreator" : "DUMMY_ANONYMOUS_USER", "openhouse.tableId" : "t1", "openhouse.tableLocation" : "s3://openhouse-bucket/d3/t1-394d8186-143f-482a-b5e5-e6aa6e382556/00000-8e498f9d-153e-412f-bb0e-476cfcab926d.metadata.json", "openhouse.tableType" : "PRIMARY_TABLE", "openhouse.tableUUID" : "394d8186-143f-482a-b5e5-e6aa6e382556", "openhouse.tableUri" : "LocalS3Cluster.d3.t1", "openhouse.tableVersion" : "INITIAL_VERSION", "policies" : "", "write.format.default" : "orc", "write.metadata.delete-after-commit.enabled" : "true", "write.metadata.previous-versions-max" : "28" }, "tableType" : "PRIMARY_TABLE", "tableUUID" : "394d8186-143f-482a-b5e5-e6aa6e382556", "tableUri" : "LocalS3Cluster.d3.t1", "tableVersion" : "INITIAL_VERSION", "timePartitioning" : { "columnName" : "ts", "granularity" : "HOUR" } } <img width="1465" alt="Screenshot 2024-06-13 at 6 17 46 PM" src="https://github.com/linkedin/openhouse/assets/114708561/7f0f2e8a-dfb0-405e-a42c-9936ff6ed2bc"> $ curl "${curlArgs[@]}" -XGET http://localhost:8000/v1/databases/d3/tables/t1 | json_pp % Total % Received % Xferd Average Speed Time Time Time Current Dload Upload Total Spent Left Speed 100 1576 0 1576 0 0 9477 0 --:--:-- --:--:-- --:--:-- 9493 { "clusterId" : "LocalS3Cluster", "clustering" : [ { "columnName" : "name", "transform" : null } ], "creationTime" : 1718327830198, "databaseId" : "d3", "lastModifiedTime" : 1718327830198, "policies" : null, "schema" : "{\"type\":\"struct\",\"schema-id\":0,\"fields\":[{\"id\":1,\"name\":\"id\",\"required\":true,\"type\":\"string\"},{\"id\":2,\"name\":\"name\",\"required\":true,\"type\":\"string\"},{\"id\":3,\"name\":\"ts\",\"required\":true,\"type\":\"timestamp\"}]}", "tableCreator" : "DUMMY_ANONYMOUS_USER", "tableId" : "t1", "tableLocation" : "s3://openhouse-bucket/d3/t1-394d8186-143f-482a-b5e5-e6aa6e382556/00000-8e498f9d-153e-412f-bb0e-476cfcab926d.metadata.json", "tableProperties" : { "key" : "value", "openhouse.clusterId" : "LocalS3Cluster", "openhouse.creationTime" : "1718327830198", "openhouse.databaseId" : "d3", "openhouse.lastModifiedTime" : "1718327830198", "openhouse.tableCreator" : "DUMMY_ANONYMOUS_USER", "openhouse.tableId" : "t1", "openhouse.tableLocation" : "s3://openhouse-bucket/d3/t1-394d8186-143f-482a-b5e5-e6aa6e382556/00000-8e498f9d-153e-412f-bb0e-476cfcab926d.metadata.json", "openhouse.tableType" : "PRIMARY_TABLE", "openhouse.tableUUID" : "394d8186-143f-482a-b5e5-e6aa6e382556", "openhouse.tableUri" : "LocalS3Cluster.d3.t1", "openhouse.tableVersion" : "INITIAL_VERSION", "policies" : "", "write.format.default" : "orc", "write.metadata.delete-after-commit.enabled" : "true", "write.metadata.previous-versions-max" : "28" }, "tableType" : "PRIMARY_TABLE", "tableUUID" : "394d8186-143f-482a-b5e5-e6aa6e382556", "tableUri" : "LocalS3Cluster.d3.t1", "tableVersion" : "INITIAL_VERSION", "timePartitioning" : { "columnName" : "ts", "granularity" : "HOUR" } } $ curl "${curlArgs[@]}" -XGET http://localhost:8000/v1/databases/d3/tables/ | json_pp % Total % Received % Xferd Average Speed Time Time Time Current Dload Upload Total Spent Left Speed 100 1590 0 1590 0 0 22490 0 --:--:-- --:--:-- --:--:-- 22714 { "results" : [ { "clusterId" : "LocalS3Cluster", "clustering" : [ { "columnName" : "name", "transform" : null } ], "creationTime" : 1718327830198, "databaseId" : "d3", "lastModifiedTime" : 1718327830198, "policies" : null, "schema" : "{\"type\":\"struct\",\"schema-id\":0,\"fields\":[{\"id\":1,\"name\":\"id\",\"required\":true,\"type\":\"string\"},{\"id\":2,\"name\":\"name\",\"required\":true,\"type\":\"string\"},{\"id\":3,\"name\":\"ts\",\"required\":true,\"type\":\"timestamp\"}]}", "tableCreator" : "DUMMY_ANONYMOUS_USER", "tableId" : "t1", "tableLocation" : "s3://openhouse-bucket/d3/t1-394d8186-143f-482a-b5e5-e6aa6e382556/00000-8e498f9d-153e-412f-bb0e-476cfcab926d.metadata.json", "tableProperties" : { "key" : "value", "openhouse.clusterId" : "LocalS3Cluster", "openhouse.creationTime" : "1718327830198", "openhouse.databaseId" : "d3", "openhouse.lastModifiedTime" : "1718327830198", "openhouse.tableCreator" : "DUMMY_ANONYMOUS_USER", "openhouse.tableId" : "t1", "openhouse.tableLocation" : "s3://openhouse-bucket/d3/t1-394d8186-143f-482a-b5e5-e6aa6e382556/00000-8e498f9d-153e-412f-bb0e-476cfcab926d.metadata.json", "openhouse.tableType" : "PRIMARY_TABLE", "openhouse.tableUUID" : "394d8186-143f-482a-b5e5-e6aa6e382556", "openhouse.tableUri" : "LocalS3Cluster.d3.t1", "openhouse.tableVersion" : "INITIAL_VERSION", "policies" : "", "write.format.default" : "orc", "write.metadata.delete-after-commit.enabled" : "true", "write.metadata.previous-versions-max" : "28" }, "tableType" : "PRIMARY_TABLE", "tableUUID" : "394d8186-143f-482a-b5e5-e6aa6e382556", "tableUri" : "LocalS3Cluster.d3.t1", "tableVersion" : "INITIAL_VERSION", "timePartitioning" : { "columnName" : "ts", "granularity" : "HOUR" } } ] } Delete table: $ curl "${curlArgs[@]}" -XDELETE http://localhost:8000/v1/databases/d3/tables/t1 Valid that the table is deleted: ![Uploading Screenshot 2024-06-13 at 6.21.03 PM.png…]() - [x] Manually Tested on local docker setup. Please include commands ran, and their output. - [ ] Added new tests for the changes made. - [ ] Updated existing tests to reflect the changes made. - [ ] No tests added or updated. Please explain why. If unsure, please feel free to ask for help. - [ ] Some other form of testing like staging or soak time in production. Please explain. For all the boxes checked, include a detailed description of the testing done for the changes made in this pull request. # Additional Information - [ ] Breaking Changes - [ ] Deprecations - [x] Large PR broken into smaller PRs, and PR plan linked in the description. For all the boxes checked, include additional details of the changes made in this pull request.
Exclude metrics-core lib pulled in by hadoop yarn node manager (#126) ## Summary <!--- HINT: Replace #nnn with corresponding Issue number, if you are fixing an existing issue --> Hadoop 2.10.0 version (i.e. `hadoop-yarn` lib) pulls in very old version (`3.0.1`) of `com.codahale.metrics:metrics-core` lib. This lib is bundled in tables.jar and jobs.jar fat jar. Also some new methods such as [gauge](https://www.javadoc.io/doc/io.dropwizard.metrics/metrics-core/3.2.0/com/codahale/metrics/MetricRegistry.html#gauge-java.lang.String-com.codahale.metrics.MetricRegistry.MetricSupplier-) etc. are added in `metrics-core` lib version starting `3.2.0`. So when tables.jar and jobs.jar coexists with higher version of metrics-core lib in the classpath and if new MetricRegistry APIs (such as gauge) are used by some codebase that results in method not found error. Hence, excluding `metrics-core` lib as this lib is not used in the OSS codebase and we can always pin higher version of this lib if needed. ``` | | | +--- org.apache.hadoop:hadoop-yarn-server-nodemanager:2.10.0 | | | | +--- org.apache.hadoop:hadoop-yarn-common:2.10.0 (*) | | | | +--- org.apache.hadoop:hadoop-yarn-api:2.10.0 (*) | | | | +--- org.apache.hadoop:hadoop-yarn-registry:2.10.0 (*) | | | | +--- javax.xml.bind:jaxb-api:2.2.2 (*) | | | | +--- org.codehaus.jettison:jettison:1.1 | | | | +--- commons-lang:commons-lang:2.6 | | | | +--- javax.servlet:servlet-api:2.5 | | | | +--- commons-codec:commons-codec:1.4 -> 1.9 | | | | +--- com.sun.jersey:jersey-core:1.9 | | | | +--- com.sun.jersey:jersey-client:1.9 (*) | | | | +--- org.mortbay.jetty:jetty-util:6.1.26 | | | | +--- com.google.guava:guava:11.0.2 -> 31.1-jre (*) | | | | +--- commons-logging:commons-logging:1.1.3 -> 1.2 | | | | +--- org.slf4j:slf4j-api:1.7.25 -> 1.7.36 | | | | +--- com.google.protobuf:protobuf-java:2.5.0 | | | | +--- com.codahale.metrics:metrics-core:3.0.1 ``` Method not found error: ``` com.codahale.metrics.MetricRegistry.gauge(Ljava/lang/String;Lcom/codahale/metrics/MetricRegistry$MetricSupplier;)Lcom/codahale/metrics/Gauge; ``` ## Changes - [ ] Client-facing API Changes - [ ] Internal API Changes - [ ] Bug Fixes - [ ] New Features - [ ] Performance Improvements - [ ] Code Style - [ ] Refactoring - [ ] Documentation - [ ] Tests - [x] Lib exclusion For all the boxes checked, please include additional details of the changes made in this pull request. ## Testing Done <!--- Check any relevant boxes with "x" --> - [x] Manually Tested on local docker setup. Please include commands ran, and their output. - [ ] Added new tests for the changes made. - [ ] Updated existing tests to reflect the changes made. - [ ] No tests added or updated. Please explain why. If unsure, please feel free to ask for help. - [ ] Some other form of testing like staging or soak time in production. Please explain. `./gradlew clean build` passed. Tested using local docker Create table: ``` anath1@anath1-mn1 oh-hadoop-spark % curl "${curlArgs[@]}" -XPOST http://localhost:8000/v1/databases/d3/tables/ \ --data-raw '{ "tableId": "t1", "databaseId": "d3", "baseTableVersion": "INITIAL_VERSION", "clusterId": "LocalHadoopCluster", "schema": "{\"type\": \"struct\", \"fields\": [{\"id\": 1,\"required\": true,\"name\": \"id\",\"type\": \"string\"},{\"id\": 2,\"required\": true,\"name\": \"name\",\"type\": \"string\"},{\"id\": 3,\"required\": true,\"name\": \"ts\",\"type\": \"timestamp\"}]}", "timePartitioning": { "columnName": "ts", "granularity": "HOUR" }, "clustering": [ { "columnName": "name" } ], "tableProperties": { "key": "value" } }' {"tableId":"t1","databaseId":"d3","clusterId":"LocalHadoopCluster","tableUri":"LocalHadoopCluster.d3.t1","tableUUID":"12b090ff-0dce-487f-8e74-5d18c55c68da","tableLocation":"hdfs://namenode:9000/data/openhouse/d3/t1-12b090ff-0dce-487f-8e74-5d18c55c68da/00000-9a3a852b-26d7-43f6-8340-c0687466c3f5.metadata.json","tableVersion":"INITIAL_VERSION","tableCreator":"DUMMY_ANONYMOUS_USER","schema":"{\"type\":\"struct\",\"schema-id\":0,\"fields\":[{\"id\":1,\"name\":\"id\",\"required\":true,\"type\":\"string\"},{\"id\":2,\"name\":\"name\",\"required\":true,\"type\":\"string\"},{\"id\":3,\"name\":\"ts\",\"required\":true,\"type\":\"timestamp\"}]}","lastModifiedTime":1718344489040,"creationTime":1718344489040,"tableProperties":{"policies":"","write.metadata.delete-after-commit.enabled":"true","openhouse.tableId":"t1","openhouse.clusterId":"LocalHadoopCluster","openhouse.lastModifiedTime":"1718344489040","openhouse.tableVersion":"INITIAL_VERSION","openhouse.creationTime":"1718344489040","openhouse.tableUri":"LocalHadoopCluster.d3.t1","write.format.default":"orc","write.metadata.previous-versions-max":"28","openhouse.databaseId":"d3","openhouse.tableType":"PRIMARY_TABLE","openhouse.tableLocation":"/data/openhouse/d3/t1-12b090ff-0dce-487f-8e74-5d18c55c68da/00000-9a3a852b-26d7-43f6-8340-c0687466c3f5.metadata.json","openhouse.tableUUID":"12b090ff-0dce-487f-8e74-5d18c55c68da","key":"value","openhouse.tableCreator":"DUMMY_ANONYMOUS_USER"},"timePartitioning":{"columnName":"ts","granularity":"HOUR"},"clustering":[{"columnName":"name","transform":null}],"policies":null,"tableType":"PRIMARY_TABLE"} ``` List table: ``` anath1@anath1-mn1 oh-hadoop-spark % curl "${curlArgs[@]}" -XGET http://localhost:8000/v1/databases/d3/tables/ {"results":[{"tableId":"t1","databaseId":"d3","clusterId":"LocalHadoopCluster","tableUri":"LocalHadoopCluster.d3.t1","tableUUID":"12b090ff-0dce-487f-8e74-5d18c55c68da","tableLocation":"hdfs://namenode:9000/data/openhouse/d3/t1-12b090ff-0dce-487f-8e74-5d18c55c68da/00000-9a3a852b-26d7-43f6-8340-c0687466c3f5.metadata.json","tableVersion":"INITIAL_VERSION","tableCreator":"DUMMY_ANONYMOUS_USER","schema":"{\"type\":\"struct\",\"schema-id\":0,\"fields\":[{\"id\":1,\"name\":\"id\",\"required\":true,\"type\":\"string\"},{\"id\":2,\"name\":\"name\",\"required\":true,\"type\":\"string\"},{\"id\":3,\"name\":\"ts\",\"required\":true,\"type\":\"timestamp\"}]}","lastModifiedTime":1718344489040,"creationTime":1718344489040,"tableProperties":{"policies":"","write.metadata.delete-after-commit.enabled":"true","openhouse.tableId":"t1","openhouse.clusterId":"LocalHadoopCluster","openhouse.lastModifiedTime":"1718344489040","openhouse.tableVersion":"INITIAL_VERSION","openhouse.creationTime":"1718344489040","openhouse.tableUri":"LocalHadoopCluster.d3.t1","write.format.default":"orc","write.metadata.previous-versions-max":"28","openhouse.databaseId":"d3","openhouse.tableType":"PRIMARY_TABLE","openhouse.tableLocation":"/data/openhouse/d3/t1-12b090ff-0dce-487f-8e74-5d18c55c68da/00000-9a3a852b-26d7-43f6-8340-c0687466c3f5.metadata.json","openhouse.tableUUID":"12b090ff-0dce-487f-8e74-5d18c55c68da","key":"value","openhouse.tableCreator":"DUMMY_ANONYMOUS_USER"},"timePartitioning":{"columnName":"ts","granularity":"HOUR"},"clustering":[{"columnName":"name","transform":null}],"policies":null,"tableType":"PRIMARY_TABLE"}]} ``` For all the boxes checked, include a detailed description of the testing done for the changes made in this pull request. # Additional Information - [ ] Breaking Changes - [ ] Deprecations - [ ] Large PR broken into smaller PRs, and PR plan linked in the description. For all the boxes checked, include additional details of the changes made in this pull request.
PreviousNext