Skip to content

Tags: linkedin/openhouse

Tags

v0.5.87

Toggle v0.5.87's commit message

Verified

This commit was created on GitHub.com and signed with GitHub’s verified signature.
[1/5] Set up basic Azure sandbox environment, providers, etc. (#136)

v0.5.86

Toggle v0.5.86's commit message

Verified

This commit was created on GitHub.com and signed with GitHub’s verified signature.
Fix github build failure (#137)

## Summary

The build failed because of missing `khttp` library that is required by
`springdoc-openapi` plugin. This library has been removed from maven2
repo from 2022, so not sure why the previous build succeeded (maybe
github caches some gradle libraries). The `springdoc-openapi` plugin has
removed the usage of `khttp` too since `1.5.0`. I'm upgrading the
version to `1.6.0` which is the latest that gradle 6 supports.

More details in this issue:
springdoc/springdoc-openapi-gradle-plugin#92.

## Changes

- [ ] Client-facing API Changes
- [ ] Internal API Changes
- [x] Bug Fixes
- [ ] New Features
- [ ] Performance Improvements
- [ ] Code Style
- [ ] Refactoring
- [ ] Documentation
- [ ] Tests

For all the boxes checked, please include additional details of the
changes made in this pull request.

## Testing Done
<!--- Check any relevant boxes with "x" -->

- [ ] Manually Tested on local docker setup. Please include commands
ran, and their output.
- [ ] Added new tests for the changes made.
- [ ] Updated existing tests to reflect the changes made.
- [x] No tests added or updated. Please explain why. If unsure, please
feel free to ask for help.
- [ ] Some other form of testing like staging or soak time in
production. Please explain.

For all the boxes checked, include a detailed description of the testing
done for the changes made in this pull request.

# Additional Information

- [ ] Breaking Changes
- [ ] Deprecations
- [ ] Large PR broken into smaller PRs, and PR plan linked in the
description.

For all the boxes checked, include additional details of the changes
made in this pull request.

v0.5.85

Toggle v0.5.85's commit message

Verified

This commit was created on GitHub.com and signed with GitHub’s verified signature.
Fix tables service tests due to removal of getAllTables api (#135)

## Summary

The [PR](#127) has not cleaned
all the tests related to the getAllTables api (not sure why the build
succeeded). This PR is to remove the tests to make github build succeed.

## Changes

- [ ] Client-facing API Changes
- [ ] Internal API Changes
- [ ] Bug Fixes
- [ ] New Features
- [ ] Performance Improvements
- [ ] Code Style
- [ ] Refactoring
- [ ] Documentation
- [x] Tests

For all the boxes checked, please include additional details of the
changes made in this pull request.

## Testing Done
<!--- Check any relevant boxes with "x" -->

- [ ] Manually Tested on local docker setup. Please include commands
ran, and their output.
- [ ] Added new tests for the changes made.
- [x] Updated existing tests to reflect the changes made.
- [ ] No tests added or updated. Please explain why. If unsure, please
feel free to ask for help.
- [ ] Some other form of testing like staging or soak time in
production. Please explain.

For all the boxes checked, include a detailed description of the testing
done for the changes made in this pull request.

# Additional Information

- [ ] Breaking Changes
- [ ] Deprecations
- [ ] Large PR broken into smaller PRs, and PR plan linked in the
description.

For all the boxes checked, include additional details of the changes
made in this pull request.

v0.5.84

Toggle v0.5.84's commit message

Verified

This commit was created on GitHub.com and signed with GitHub’s verified signature.
Make snapshots expiration job leaner (#131)

## Summary
Snapshots expiration job skips removing files as we wanted to localize
files removal to one job. The job traverses the files tree though, and
that is expensive and unnecessary. As the result of this change, it
updates snapshots list in the metadata without traversing the files
tree.

## Changes

- [ ] Client-facing API Changes
- [ ] Internal API Changes
- [ ] Bug Fixes
- [ ] New Features
- [x] Performance Improvements
- [ ] Code Style
- [ ] Refactoring
- [ ] Documentation
- [ ] Tests

For all the boxes checked, please include additional details of the
changes made in this pull request.

## Testing Done
<!--- Check any relevant boxes with "x" -->

- [ ] Manually Tested on local docker setup. Please include commands
ran, and their output.
- [ ] Added new tests for the changes made.
- [ ] Updated existing tests to reflect the changes made.
- [x] No tests added or updated. Please explain why. If unsure, please
feel free to ask for help.
- [ ] Some other form of testing like staging or soak time in
production. Please explain.

No change in the job effect is expected, it's purely optimization.

For all the boxes checked, include a detailed description of the testing
done for the changes made in this pull request.

# Additional Information

- [ ] Breaking Changes
- [ ] Deprecations
- [ ] Large PR broken into smaller PRs, and PR plan linked in the
description.

For all the boxes checked, include additional details of the changes
made in this pull request.

v0.5.83

Toggle v0.5.83's commit message

Verified

This commit was created on GitHub.com and signed with GitHub’s verified signature.
Enabling access to swagger-ui using client without auth (#130)

## Summary
problem: swagger ui returns 401 when requesting api-docs via browser
<img width="989" alt="image"
src="https://github.com/linkedin/openhouse/assets/25903091/7f0c2f35-676f-4999-b11d-c9601c0969a7">

solution: expose swagger ui to unauthenticated access (just like the
existing, non-ui version `/v3/api-docs`)

## Changes

- [ ] Client-facing API Changes
- [ ] Internal API Changes
- [ ] Bug Fixes
- [ ] New Features
- [ ] Performance Improvements
- [ ] Code Style
- [ ] Refactoring
- [x] Documentation
- [ ] Tests

For all the boxes checked, please include additional details of the
changes made in this pull request.

Swagger-ui is beneficial for browsing API configuration for features
like, what are the required client headers for a given API endpoint.

## Testing Done
<!--- Check any relevant boxes with "x" -->

- [x] Manually Tested on local docker setup. Please include commands
ran, and their output.
- [ ] Added new tests for the changes made.
- [ ] Updated existing tests to reflect the changes made.
- [x] No tests added or updated. Please explain why. If unsure, please
feel free to ask for help.
- [ ] Some other form of testing like staging or soak time in
production. Please explain.

For all the boxes checked, include a detailed description of the testing
done for the changes made in this pull request.

I manually ran tests by configuring
infra/recipes/docker-compose/oh-only/docker-compose.yml to also start
the jobs rest service, then
```bash
➜ JAVA_HOME=/Library/Java/JavaVirtualMachines/jdk1.8.0_282-msft.jdk/Contents/Home ./gradlew clean build -x test -x javadoc
➜ docker compose -f infra/recipes/docker-compose/oh-only/docker-compose.yml down --rmi all
➜ docker compose -f infra/recipes/docker-compose/oh-only/docker-compose.yml up
```

then querying each endpoint:
tables
<img width="680" alt="image"
src="https://github.com/linkedin/openhouse/assets/25903091/b5411c96-0f52-4010-96fc-ce4db8e34ac5">
housetables
<img width="680" alt="image"
src="https://github.com/linkedin/openhouse/assets/25903091/6a2dda20-f501-4298-be5a-4919ed4a1075">
jobs
<img width="680" alt="image"
src="https://github.com/linkedin/openhouse/assets/25903091/eb7a6cb8-1a28-4d23-b353-e2c6e0ca54a3">

> No tests added or updated. Please explain why. If unsure, please feel
free to ask for help.

When atttempting to add unittests, using e.g. MockMvcBuilder, the
swagger endpoint returns 404 for our services. I believe this is because
the service unittests use a Mocked version of the controller, but none
of our controllers specify swagger. Swagger is configured at spring
application start, so would require a different method of testing than
using a mocked controller. I tried to use a tomcat/h2 local server but
still was getting issues of 404.

I would be open to continuing to try unittests with some pointers in the
right direction for testing springboot application server with
configured swagger.

# Additional Information

- [ ] Breaking Changes
- [ ] Deprecations
- [ ] Large PR broken into smaller PRs, and PR plan linked in the
description.

For all the boxes checked, include additional details of the changes
made in this pull request.

v0.5.82

Toggle v0.5.82's commit message

Verified

This commit was created on GitHub.com and signed with GitHub’s verified signature.
Data layout optimization (strategy generation). Part 2: data source f…

…or statistics/query logs (#109)

## Summary

This is part 2 of a new feature: data layout optimization library,
strategy generation.
Added data source interface/implementation. This PR builds on top of
#108

The following 3 components will be added eventually:
1) DLO library that has primitives for generating data layout
optimization strategies
2) App that generates strategies for all tables
3) Scheduling of the app

## Changes

- [ ] Client-facing API Changes
- [ ] Internal API Changes
- [ ] Bug Fixes
- [x] New Features
- [ ] Performance Improvements
- [ ] Code Style
- [ ] Refactoring
- [ ] Documentation
- [ ] Tests

For all the boxes checked, please include additional details of the
changes made in this pull request.

## Testing Done
<!--- Check any relevant boxes with "x" -->

- [ ] Manually Tested on local docker setup. Please include commands
ran, and their output.
- [x] Added new tests for the changes made.
- [ ] Updated existing tests to reflect the changes made.
- [ ] No tests added or updated. Please explain why. If unsure, please
feel free to ask for help.
- [ ] Some other form of testing like staging or soak time in
production. Please explain.

For all the boxes checked, include a detailed description of the testing
done for the changes made in this pull request.

# Additional Information

- [ ] Breaking Changes
- [ ] Deprecations
- [x] Large PR broken into smaller PRs, and PR plan linked in the
description.

For all the boxes checked, include additional details of the changes
made in this pull request.

v0.5.81

Toggle v0.5.81's commit message

Verified

This commit was created on GitHub.com and signed with GitHub’s verified signature.
Data layout optimization (strategy generation). Part 1: strategy clas…

…s with persistence (#108)

## Summary
This is part 1 of a new feature: data layout optimization library,
strategy generation.
Added strategy class and persistence utilities. Refactored existing
compaction app to use the library config.

The following 3 components will be added eventually:
1) DLO library that has primitives for generating data layout
optimization strategies
2) App that generates strategies for all tables
3) Scheduling of the app


## Changes

- [ ] Client-facing API Changes
- [ ] Internal API Changes
- [ ] Bug Fixes
- [x] New Features
- [ ] Performance Improvements
- [ ] Code Style
- [x] Refactoring
- [x] Documentation
- [ ] Tests

For all the boxes checked, please include additional details of the
changes made in this pull request.

## Testing Done
<!--- Check any relevant boxes with "x" -->

- [ ] Manually Tested on local docker setup. Please include commands
ran, and their output.
- [x] Added new tests for the changes made.
- [ ] Updated existing tests to reflect the changes made.
- [ ] No tests added or updated. Please explain why. If unsure, please
feel free to ask for help.
- [ ] Some other form of testing like staging or soak time in
production. Please explain.

For all the boxes checked, include a detailed description of the testing
done for the changes made in this pull request.

# Additional Information

- [ ] Breaking Changes
- [ ] Deprecations
- [x] Large PR broken into smaller PRs, and PR plan linked in the
description.

For all the boxes checked, include additional details of the changes
made in this pull request.

v0.5.80

Toggle v0.5.80's commit message

Verified

This commit was created on GitHub.com and signed with GitHub’s verified signature.
[PR4/5]: Add S3FileIO (#125)

## Summary
This is the fourth PR in a sequence of PRs to add support for S3
storage.

Openhouse catalog currently supports HDFS as the storage backend. The
end goal of this effort is to add the integration with S3 so that the
storage backend can be configured to be S3 vs HDFS based on storage
type.

The entire work is done via a series of PRs:
1. Add S3 Storage type and S3StorageClient.
2. Add base class for StorageClient and move common logic like
validation of properties there to avoid code duplication.
3. Add S3Storage implementation that uses S3StorageClient.
4. Add support for using S3FileIO for S3 storage type.
5. Add a recipe for end-to-end testing in docker.

This PR addresses 4 by adding S3FileIO.

Sushant has already done 5. So, this marks the completion of S3
integration.

## Changes

- [ ] Client-facing API Changes
- [ ] Internal API Changes
- [ ] Bug Fixes
- [x] New Features
- [ ] Performance Improvements
- [ ] Code Style
- [ ] Refactoring
- [ ] Documentation
- [ ] Tests

For all the boxes checked, please include additional details of the
changes made in this pull request.

## Testing Done
Testing by running the oh-s3-spark recipe in docker.

1. Run docker:

$ docker compose up -d
[+] Building 0.0s (0/0) docker:desktop-linux
[+] Running 16/16
✔ Network oh-s3-spark_default Cre... 0.0s
✔ Container local.spark-master St... 0.2s
✔ Container oh-s3-spark-prometheus-1 Started 0.2s
✔ Container local.mysql Started 0.2s
✔ Container local.minioS3 Started 0.2s
✔ Container local.opa Started 0.2s
! opa The requested image's platform (linux/amd64) does not match the
detected host platform (linux/arm64/v8) and no specific platform was
requested 0.0s
! spark-master The requested image's platform (linux/amd64) does not
match the detected host platform (linux/arm64/v8) and no specific
platform was requested 0.0s
✔ Container local.spark-livy Star... 0.1s
✔ Container local.spark-worker-a Started 0.1s
✔ Container local.minioClient Sta... 0.0s
✔ Container local.openhouse-housetables Started 0.0s
! spark-worker-a The requested image's platform (linux/amd64) does not
match the detected host platform (linux/arm64/v8) and no specific
platform was requested 0.0s
! spark-livy The requested image's platform (linux/amd64) does not match
the detected host platform (linux/arm64/v8) and no specific platform was
requested 0.0s
✔ Container local.openhouse-jobs Started 0.0s
✔ Container local.openhouse-tables Started 0.0s
lajain-mn2:oh-s3-spark lajain$ docker exec -it local.spark-master
/bin/bash

<img width="1120" alt="Screenshot 2024-06-13 at 3 47 32 PM"
src="https://github.com/linkedin/openhouse/assets/114708561/6e827de9-19ac-46a6-9786-cf2b5403ace7">

2. Login to MinIO.
<img width="1402" alt="Screenshot 2024-06-13 at 3 48 36 PM"
src="https://github.com/linkedin/openhouse/assets/114708561/4c49ad7d-82b7-45c2-b444-b2548ddfb43f">

<img width="1466" alt="Screenshot 2024-06-13 at 3 48 59 PM"
src="https://github.com/linkedin/openhouse/assets/114708561/aafb55dc-d49e-45f9-9c31-60b6d88bbc4f">

3. Run spark shell:
openhouse@cff3c38358c5:/opt/spark$ export AWS_REGION=us-east-1
openhouse@cff3c38358c5:/opt/spark$ bin/spark-shell --packages
org.apache.iceberg:iceberg-spark-runtime-3.1_2.12:1.2.0,software.amazon.awssdk:bundle:2.20.18,software.amazon.awssdk:url-connection-client:2.20.18
\
>   --jars openhouse-spark-runtime_2.12-*-all.jar  \
> --conf
spark.sql.extensions=org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions,com.linkedin.openhouse.spark.extensions.OpenhouseSparkSessionExtensions
\
> --conf
spark.sql.catalog.openhouse=org.apache.iceberg.spark.SparkCatalog \
> --conf
spark.sql.catalog.openhouse.catalog-impl=com.linkedin.openhouse.spark.OpenHouseCatalog
\
> --conf
spark.sql.catalog.openhouse.metrics-reporter-impl=com.linkedin.openhouse.javaclient.OpenHouseMetricsReporter
\
> --conf spark.sql.catalog.openhouse.uri=http://openhouse-tables:8080 \
> --conf spark.sql.catalog.openhouse.auth-token=$(cat
/var/config/$(whoami).token) \
>   --conf spark.sql.catalog.openhouse.cluster=LocalS3Cluster  \
> --conf
spark.sql.catalog.openhouse.io-impl=org.apache.iceberg.aws.s3.S3FileIO \
>   --conf spark.sql.catalog.openhouse.s3.endpoint=http://minioS3:9000 \
>   --conf spark.sql.catalog.openhouse.s3.access-key-id=admin \
>   --conf spark.sql.catalog.openhouse.s3.secret-access-key=password \
>   --conf spark.sql.catalog.openhouse.s3.path-style-access=true
:: loading settings :: url =
jar:file:/opt/spark/jars/ivy-2.4.0.jar!/org/apache/ivy/core/settings/ivysettings.xml
Ivy Default Cache set to: /home/openhouse/.ivy2/cache
The jars for the packages stored in: /home/openhouse/.ivy2/jars
org.apache.iceberg#iceberg-spark-runtime-3.1_2.12 added as a dependency
software.amazon.awssdk#bundle added as a dependency
software.amazon.awssdk#url-connection-client added as a dependency
:: resolving dependencies ::
org.apache.spark#spark-submit-parent-9d1e3e8a-c713-44f0-93b3-bd5029daa8e2;1.0
	confs: [default]
found org.apache.iceberg#iceberg-spark-runtime-3.1_2.12;1.2.0 in central

	found software.amazon.awssdk#bundle;2.20.18 in central
	found software.amazon.eventstream#eventstream;1.0.1 in central
	found software.amazon.awssdk#url-connection-client;2.20.18 in central
	found software.amazon.awssdk#utils;2.20.18 in central
	found org.reactivestreams#reactive-streams;1.0.3 in central
	found software.amazon.awssdk#annotations;2.20.18 in central
	found org.slf4j#slf4j-api;1.7.30 in central
	found software.amazon.awssdk#http-client-spi;2.20.18 in central
	found software.amazon.awssdk#metrics-spi;2.20.18 in central
downloading
https://repo1.maven.org/maven2/org/apache/iceberg/iceberg-spark-runtime-3.1_2.12/1.2.0/iceberg-spark-runtime-3.1_2.12-1.2.0.jar
...
[SUCCESSFUL ]
org.apache.iceberg#iceberg-spark-runtime-3.1_2.12;1.2.0!iceberg-spark-runtime-3.1_2.12.jar
(966ms)
downloading
https://repo1.maven.org/maven2/software/amazon/awssdk/bundle/2.20.18/bundle-2.20.18.jar
...
	[SUCCESSFUL ] software.amazon.awssdk#bundle;2.20.18!bundle.jar (8889ms)
downloading
https://repo1.maven.org/maven2/software/amazon/awssdk/url-connection-client/2.20.18/url-connection-client-2.20.18.jar
...
[SUCCESSFUL ]
software.amazon.awssdk#url-connection-client;2.20.18!url-connection-client.jar
(66ms)
downloading
https://repo1.maven.org/maven2/software/amazon/eventstream/eventstream/1.0.1/eventstream-1.0.1.jar
...
[SUCCESSFUL ]
software.amazon.eventstream#eventstream;1.0.1!eventstream.jar (59ms)
downloading
https://repo1.maven.org/maven2/software/amazon/awssdk/utils/2.20.18/utils-2.20.18.jar
...
	[SUCCESSFUL ] software.amazon.awssdk#utils;2.20.18!utils.jar (66ms)
downloading
https://repo1.maven.org/maven2/software/amazon/awssdk/annotations/2.20.18/annotations-2.20.18.jar
...
[SUCCESSFUL ] software.amazon.awssdk#annotations;2.20.18!annotations.jar
(62ms)
downloading
https://repo1.maven.org/maven2/software/amazon/awssdk/http-client-spi/2.20.18/http-client-spi-2.20.18.jar
...
[SUCCESSFUL ]
software.amazon.awssdk#http-client-spi;2.20.18!http-client-spi.jar
(61ms)
downloading
https://repo1.maven.org/maven2/org/reactivestreams/reactive-streams/1.0.3/reactive-streams-1.0.3.jar
...
[SUCCESSFUL ]
org.reactivestreams#reactive-streams;1.0.3!reactive-streams.jar (58ms)
downloading
https://repo1.maven.org/maven2/org/slf4j/slf4j-api/1.7.30/slf4j-api-1.7.30.jar
...
	[SUCCESSFUL ] org.slf4j#slf4j-api;1.7.30!slf4j-api.jar (59ms)
downloading
https://repo1.maven.org/maven2/software/amazon/awssdk/metrics-spi/2.20.18/metrics-spi-2.20.18.jar
...
[SUCCESSFUL ] software.amazon.awssdk#metrics-spi;2.20.18!metrics-spi.jar
(56ms)
:: resolution report :: resolve 153033ms :: artifacts dl 10382ms
	:: modules in use:
org.apache.iceberg#iceberg-spark-runtime-3.1_2.12;1.2.0 from central in
[default]
	org.reactivestreams#reactive-streams;1.0.3 from central in [default]
	org.slf4j#slf4j-api;1.7.30 from central in [default]
	software.amazon.awssdk#annotations;2.20.18 from central in [default]
	software.amazon.awssdk#bundle;2.20.18 from central in [default]
software.amazon.awssdk#http-client-spi;2.20.18 from central in [default]
	software.amazon.awssdk#metrics-spi;2.20.18 from central in [default]
software.amazon.awssdk#url-connection-client;2.20.18 from central in
[default]
	software.amazon.awssdk#utils;2.20.18 from central in [default]
	software.amazon.eventstream#eventstream;1.0.1 from central in [default]
	---------------------------------------------------------------------
	|                  |            modules            ||   artifacts   |
	|       conf       | number| search|dwnlded|evicted|| number|dwnlded|
	---------------------------------------------------------------------
	|      default     |   10  |   10  |   10  |   0   ||   10  |   10  |
	---------------------------------------------------------------------
:: retrieving ::
org.apache.spark#spark-submit-parent-9d1e3e8a-c713-44f0-93b3-bd5029daa8e2
	confs: [default]
	10 artifacts copied, 0 already retrieved (480145kB/974ms)
2024-06-14 00:51:50,318 WARN util.NativeCodeLoader: Unable to load
native-hadoop library for your platform... using builtin-java classes
where applicable
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use
setLogLevel(newLevel).
Spark context Web UI available at http://cff3c38358c5:4040
Spark context available as 'sc' (master = local[*], app id =
local-1718326322630).
Spark session available as 'spark'.
Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /___/ .__/\_,_/_/ /_/\_\   version 3.1.1
      /_/
         
Using Scala version 2.12.10 (OpenJDK 64-Bit Server VM, Java 1.8.0_232)
Type in expressions to have them evaluated.
Type :help for more information.

scala> 

The bucket right now is empty:
<img width="1467" alt="Screenshot 2024-06-13 at 5 50 19 PM"
src="https://github.com/linkedin/openhouse/assets/114708561/8c4f2576-f867-4561-b680-9d340b9237a3">

4. Create table:
scala> spark.sql("CREATE TABLE openhouse.db.tb (ts timestamp, col1
string, col2 string) PARTITIONED BY (days(ts))").show()
++
||
++
++


scala> spark.sql("DESCRIBE TABLE openhouse.db.tb").show()
SLF4J: Failed to load class "org.slf4j.impl.StaticLoggerBinder".
SLF4J: Defaulting to no-operation (NOP) logger implementation
SLF4J: See http://www.slf4j.org/codes.html#StaticLoggerBinder for
further details.
+--------------+---------+-------+
|      col_name|data_type|comment|
+--------------+---------+-------+
|            ts|timestamp|       |
|          col1|   string|       |
|          col2|   string|       |
|              |         |       |
|# Partitioning|         |       |
|        Part 0| days(ts)|       |
+--------------+---------+-------+

<img width="1463" alt="Screenshot 2024-06-13 at 5 53 55 PM"
src="https://github.com/linkedin/openhouse/assets/114708561/a7cc8882-5099-43c5-b73b-6f779620a7fc">

<img width="1463" alt="Screenshot 2024-06-13 at 5 54 11 PM"
src="https://github.com/linkedin/openhouse/assets/114708561/9f1d8665-1f9b-49a3-be3b-495561ecf3ce">

<img width="1452" alt="Screenshot 2024-06-13 at 5 54 22 PM"
src="https://github.com/linkedin/openhouse/assets/114708561/efeea33e-f27a-44e7-ab95-a390e86e6893">

5. Add data

scala> spark.sql("INSERT INTO TABLE openhouse.db.tb VALUES
(current_timestamp(), 'val1', 'val2')")
res2: org.apache.spark.sql.DataFrame = []

scala> spark.sql("INSERT INTO TABLE openhouse.db.tb VALUES
(date_sub(CAST(current_timestamp() as DATE), 30), 'val1', 'val2')")
res3: org.apache.spark.sql.DataFrame = []

scala> spark.sql("INSERT INTO TABLE openhouse.db.tb VALUES
(date_sub(CAST(current_timestamp() as DATE), 60), 'val1', 'val2')")
res4: org.apache.spark.sql.DataFrame = []

scala> spark.sql("SELECT * FROM openhouse.db.tb").show()
+--------------------+----+----+
|                  ts|col1|col2|
+--------------------+----+----+
| 2024-05-15 00:00:00|val1|val2|
| 2024-04-15 00:00:00|val1|val2|
|2024-06-14 00:55:...|val1|val2|
+--------------------+----+----+


scala> spark.sql("SHOW TABLES IN openhouse.db").show()
+---------+---------+
|namespace|tableName|
+---------+---------+
|       db|       tb|
+---------+---------+

<img width="1460" alt="Screenshot 2024-06-13 at 5 56 15 PM"
src="https://github.com/linkedin/openhouse/assets/114708561/593f731a-fda8-49d1-86fd-f34d40e4cab8">

Test using table service:
$ curl "${curlArgs[@]}" -XPOST
http://localhost:8000/v1/databases/d3/tables/ \
> --data-raw '{
>   "tableId": "t1",
>   "databaseId": "d3",
>   "baseTableVersion": "INITIAL_VERSION",
>   "clusterId": "LocalS3Cluster",
> "schema": "{\"type\": \"struct\", \"fields\": [{\"id\":
1,\"required\": true,\"name\": \"id\",\"type\": \"string\"},{\"id\":
2,\"required\": true,\"name\": \"name\",\"type\": \"string\"},{\"id\":
3,\"required\": true,\"name\": \"ts\",\"type\": \"timestamp\"}]}",
>   "timePartitioning": {
>     "columnName": "ts",
>     "granularity": "HOUR"
>   },
>   "clustering": [
>     {
>       "columnName": "name"
>     }
>   ],
>   "tableProperties": {
>     "key": "value"
>   }
> }' | json_pp
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
100 2146 0 1576 100 570 4184 1513 --:--:-- --:--:-- --:--:-- 5692
{
   "clusterId" : "LocalS3Cluster",
   "clustering" : [
      {
         "columnName" : "name",
         "transform" : null
      }
   ],
   "creationTime" : 1718327830198,
   "databaseId" : "d3",
   "lastModifiedTime" : 1718327830198,
   "policies" : null,
"schema" :
"{\"type\":\"struct\",\"schema-id\":0,\"fields\":[{\"id\":1,\"name\":\"id\",\"required\":true,\"type\":\"string\"},{\"id\":2,\"name\":\"name\",\"required\":true,\"type\":\"string\"},{\"id\":3,\"name\":\"ts\",\"required\":true,\"type\":\"timestamp\"}]}",
   "tableCreator" : "DUMMY_ANONYMOUS_USER",
   "tableId" : "t1",
"tableLocation" :
"s3://openhouse-bucket/d3/t1-394d8186-143f-482a-b5e5-e6aa6e382556/00000-8e498f9d-153e-412f-bb0e-476cfcab926d.metadata.json",
   "tableProperties" : {
      "key" : "value",
      "openhouse.clusterId" : "LocalS3Cluster",
      "openhouse.creationTime" : "1718327830198",
      "openhouse.databaseId" : "d3",
      "openhouse.lastModifiedTime" : "1718327830198",
      "openhouse.tableCreator" : "DUMMY_ANONYMOUS_USER",
      "openhouse.tableId" : "t1",
"openhouse.tableLocation" :
"s3://openhouse-bucket/d3/t1-394d8186-143f-482a-b5e5-e6aa6e382556/00000-8e498f9d-153e-412f-bb0e-476cfcab926d.metadata.json",
      "openhouse.tableType" : "PRIMARY_TABLE",
      "openhouse.tableUUID" : "394d8186-143f-482a-b5e5-e6aa6e382556",
      "openhouse.tableUri" : "LocalS3Cluster.d3.t1",
      "openhouse.tableVersion" : "INITIAL_VERSION",
      "policies" : "",
      "write.format.default" : "orc",
      "write.metadata.delete-after-commit.enabled" : "true",
      "write.metadata.previous-versions-max" : "28"
   },
   "tableType" : "PRIMARY_TABLE",
   "tableUUID" : "394d8186-143f-482a-b5e5-e6aa6e382556",
   "tableUri" : "LocalS3Cluster.d3.t1",
   "tableVersion" : "INITIAL_VERSION",
   "timePartitioning" : {
      "columnName" : "ts",
      "granularity" : "HOUR"
   }
}

<img width="1465" alt="Screenshot 2024-06-13 at 6 17 46 PM"
src="https://github.com/linkedin/openhouse/assets/114708561/7f0f2e8a-dfb0-405e-a42c-9936ff6ed2bc">

$ curl "${curlArgs[@]}" -XGET
http://localhost:8000/v1/databases/d3/tables/t1 | json_pp
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
100 1576 0 1576 0 0 9477 0 --:--:-- --:--:-- --:--:-- 9493
{
   "clusterId" : "LocalS3Cluster",
   "clustering" : [
      {
         "columnName" : "name",
         "transform" : null
      }
   ],
   "creationTime" : 1718327830198,
   "databaseId" : "d3",
   "lastModifiedTime" : 1718327830198,
   "policies" : null,
"schema" :
"{\"type\":\"struct\",\"schema-id\":0,\"fields\":[{\"id\":1,\"name\":\"id\",\"required\":true,\"type\":\"string\"},{\"id\":2,\"name\":\"name\",\"required\":true,\"type\":\"string\"},{\"id\":3,\"name\":\"ts\",\"required\":true,\"type\":\"timestamp\"}]}",
   "tableCreator" : "DUMMY_ANONYMOUS_USER",
   "tableId" : "t1",
"tableLocation" :
"s3://openhouse-bucket/d3/t1-394d8186-143f-482a-b5e5-e6aa6e382556/00000-8e498f9d-153e-412f-bb0e-476cfcab926d.metadata.json",
   "tableProperties" : {
      "key" : "value",
      "openhouse.clusterId" : "LocalS3Cluster",
      "openhouse.creationTime" : "1718327830198",
      "openhouse.databaseId" : "d3",
      "openhouse.lastModifiedTime" : "1718327830198",
      "openhouse.tableCreator" : "DUMMY_ANONYMOUS_USER",
      "openhouse.tableId" : "t1",
"openhouse.tableLocation" :
"s3://openhouse-bucket/d3/t1-394d8186-143f-482a-b5e5-e6aa6e382556/00000-8e498f9d-153e-412f-bb0e-476cfcab926d.metadata.json",
      "openhouse.tableType" : "PRIMARY_TABLE",
      "openhouse.tableUUID" : "394d8186-143f-482a-b5e5-e6aa6e382556",
      "openhouse.tableUri" : "LocalS3Cluster.d3.t1",
      "openhouse.tableVersion" : "INITIAL_VERSION",
      "policies" : "",
      "write.format.default" : "orc",
      "write.metadata.delete-after-commit.enabled" : "true",
      "write.metadata.previous-versions-max" : "28"
   },
   "tableType" : "PRIMARY_TABLE",
   "tableUUID" : "394d8186-143f-482a-b5e5-e6aa6e382556",
   "tableUri" : "LocalS3Cluster.d3.t1",
   "tableVersion" : "INITIAL_VERSION",
   "timePartitioning" : {
      "columnName" : "ts",
      "granularity" : "HOUR"
   }
}

$ curl "${curlArgs[@]}" -XGET
http://localhost:8000/v1/databases/d3/tables/ | json_pp
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
100 1590 0 1590 0 0 22490 0 --:--:-- --:--:-- --:--:-- 22714
{
   "results" : [
      {
         "clusterId" : "LocalS3Cluster",
         "clustering" : [
            {
               "columnName" : "name",
               "transform" : null
            }
         ],
         "creationTime" : 1718327830198,
         "databaseId" : "d3",
         "lastModifiedTime" : 1718327830198,
         "policies" : null,
"schema" :
"{\"type\":\"struct\",\"schema-id\":0,\"fields\":[{\"id\":1,\"name\":\"id\",\"required\":true,\"type\":\"string\"},{\"id\":2,\"name\":\"name\",\"required\":true,\"type\":\"string\"},{\"id\":3,\"name\":\"ts\",\"required\":true,\"type\":\"timestamp\"}]}",
         "tableCreator" : "DUMMY_ANONYMOUS_USER",
         "tableId" : "t1",
"tableLocation" :
"s3://openhouse-bucket/d3/t1-394d8186-143f-482a-b5e5-e6aa6e382556/00000-8e498f9d-153e-412f-bb0e-476cfcab926d.metadata.json",
         "tableProperties" : {
            "key" : "value",
            "openhouse.clusterId" : "LocalS3Cluster",
            "openhouse.creationTime" : "1718327830198",
            "openhouse.databaseId" : "d3",
            "openhouse.lastModifiedTime" : "1718327830198",
            "openhouse.tableCreator" : "DUMMY_ANONYMOUS_USER",
            "openhouse.tableId" : "t1",
"openhouse.tableLocation" :
"s3://openhouse-bucket/d3/t1-394d8186-143f-482a-b5e5-e6aa6e382556/00000-8e498f9d-153e-412f-bb0e-476cfcab926d.metadata.json",
            "openhouse.tableType" : "PRIMARY_TABLE",
"openhouse.tableUUID" : "394d8186-143f-482a-b5e5-e6aa6e382556",
            "openhouse.tableUri" : "LocalS3Cluster.d3.t1",
            "openhouse.tableVersion" : "INITIAL_VERSION",
            "policies" : "",
            "write.format.default" : "orc",
            "write.metadata.delete-after-commit.enabled" : "true",
            "write.metadata.previous-versions-max" : "28"
         },
         "tableType" : "PRIMARY_TABLE",
         "tableUUID" : "394d8186-143f-482a-b5e5-e6aa6e382556",
         "tableUri" : "LocalS3Cluster.d3.t1",
         "tableVersion" : "INITIAL_VERSION",
         "timePartitioning" : {
            "columnName" : "ts",
            "granularity" : "HOUR"
         }
      }
   ]
}

Delete table:
$ curl "${curlArgs[@]}" -XDELETE
http://localhost:8000/v1/databases/d3/tables/t1

Valid that the table is deleted:
![Uploading Screenshot 2024-06-13 at 6.21.03 PM.png…]()


- [x] Manually Tested on local docker setup. Please include commands
ran, and their output.
- [ ] Added new tests for the changes made.
- [ ] Updated existing tests to reflect the changes made.
- [ ] No tests added or updated. Please explain why. If unsure, please
feel free to ask for help.
- [ ] Some other form of testing like staging or soak time in
production. Please explain.

For all the boxes checked, include a detailed description of the testing
done for the changes made in this pull request.

# Additional Information

- [ ] Breaking Changes
- [ ] Deprecations
- [x] Large PR broken into smaller PRs, and PR plan linked in the
description.

For all the boxes checked, include additional details of the changes
made in this pull request.

v0.5.79

Toggle v0.5.79's commit message

Verified

This commit was created on GitHub.com and signed with GitHub’s verified signature.
Exclude metrics-core lib pulled in by hadoop yarn node manager (#126)

## Summary

<!--- HINT: Replace #nnn with corresponding Issue number, if you are
fixing an existing issue -->
Hadoop 2.10.0 version (i.e. `hadoop-yarn` lib) pulls in very old version
(`3.0.1`) of `com.codahale.metrics:metrics-core` lib. This lib is
bundled in tables.jar and jobs.jar fat jar. Also some new methods such
as
[gauge](https://www.javadoc.io/doc/io.dropwizard.metrics/metrics-core/3.2.0/com/codahale/metrics/MetricRegistry.html#gauge-java.lang.String-com.codahale.metrics.MetricRegistry.MetricSupplier-)
etc. are added in `metrics-core` lib version starting `3.2.0`. So when
tables.jar and jobs.jar coexists with higher version of metrics-core lib
in the classpath and if new MetricRegistry APIs (such as gauge) are used
by some codebase that results in method not found error. Hence,
excluding `metrics-core` lib as this lib is not used in the OSS codebase
and we can always pin higher version of this lib if needed.

```
|    |    |    +--- org.apache.hadoop:hadoop-yarn-server-nodemanager:2.10.0
|    |    |    |    +--- org.apache.hadoop:hadoop-yarn-common:2.10.0 (*)
|    |    |    |    +--- org.apache.hadoop:hadoop-yarn-api:2.10.0 (*)
|    |    |    |    +--- org.apache.hadoop:hadoop-yarn-registry:2.10.0 (*)
|    |    |    |    +--- javax.xml.bind:jaxb-api:2.2.2 (*)
|    |    |    |    +--- org.codehaus.jettison:jettison:1.1
|    |    |    |    +--- commons-lang:commons-lang:2.6
|    |    |    |    +--- javax.servlet:servlet-api:2.5
|    |    |    |    +--- commons-codec:commons-codec:1.4 -> 1.9
|    |    |    |    +--- com.sun.jersey:jersey-core:1.9
|    |    |    |    +--- com.sun.jersey:jersey-client:1.9 (*)
|    |    |    |    +--- org.mortbay.jetty:jetty-util:6.1.26
|    |    |    |    +--- com.google.guava:guava:11.0.2 -> 31.1-jre (*)
|    |    |    |    +--- commons-logging:commons-logging:1.1.3 -> 1.2
|    |    |    |    +--- org.slf4j:slf4j-api:1.7.25 -> 1.7.36
|    |    |    |    +--- com.google.protobuf:protobuf-java:2.5.0
|    |    |    |    +--- com.codahale.metrics:metrics-core:3.0.1

```

Method not found error: 
```
com.codahale.metrics.MetricRegistry.gauge(Ljava/lang/String;Lcom/codahale/metrics/MetricRegistry$MetricSupplier;)Lcom/codahale/metrics/Gauge;
```

## Changes

- [ ] Client-facing API Changes
- [ ] Internal API Changes
- [ ] Bug Fixes
- [ ] New Features
- [ ] Performance Improvements
- [ ] Code Style
- [ ] Refactoring
- [ ] Documentation
- [ ] Tests
- [x] Lib exclusion

For all the boxes checked, please include additional details of the
changes made in this pull request.

## Testing Done
<!--- Check any relevant boxes with "x" -->

- [x] Manually Tested on local docker setup. Please include commands
ran, and their output.
- [ ] Added new tests for the changes made.
- [ ] Updated existing tests to reflect the changes made.
- [ ] No tests added or updated. Please explain why. If unsure, please
feel free to ask for help.
- [ ] Some other form of testing like staging or soak time in
production. Please explain.

`./gradlew clean build` passed. 

Tested using local docker

Create table:
```
anath1@anath1-mn1 oh-hadoop-spark % curl "${curlArgs[@]}" -XPOST http://localhost:8000/v1/databases/d3/tables/ \
--data-raw '{
  "tableId": "t1",
  "databaseId": "d3",
  "baseTableVersion": "INITIAL_VERSION",
  "clusterId": "LocalHadoopCluster",
  "schema": "{\"type\": \"struct\", \"fields\": [{\"id\": 1,\"required\": true,\"name\": \"id\",\"type\": \"string\"},{\"id\": 2,\"required\": true,\"name\": \"name\",\"type\": \"string\"},{\"id\": 3,\"required\": true,\"name\": \"ts\",\"type\": \"timestamp\"}]}",
  "timePartitioning": {
    "columnName": "ts",
    "granularity": "HOUR"
  },
  "clustering": [
    {
      "columnName": "name"
    }
  ],
  "tableProperties": {
    "key": "value"
  }
}'

{"tableId":"t1","databaseId":"d3","clusterId":"LocalHadoopCluster","tableUri":"LocalHadoopCluster.d3.t1","tableUUID":"12b090ff-0dce-487f-8e74-5d18c55c68da","tableLocation":"hdfs://namenode:9000/data/openhouse/d3/t1-12b090ff-0dce-487f-8e74-5d18c55c68da/00000-9a3a852b-26d7-43f6-8340-c0687466c3f5.metadata.json","tableVersion":"INITIAL_VERSION","tableCreator":"DUMMY_ANONYMOUS_USER","schema":"{\"type\":\"struct\",\"schema-id\":0,\"fields\":[{\"id\":1,\"name\":\"id\",\"required\":true,\"type\":\"string\"},{\"id\":2,\"name\":\"name\",\"required\":true,\"type\":\"string\"},{\"id\":3,\"name\":\"ts\",\"required\":true,\"type\":\"timestamp\"}]}","lastModifiedTime":1718344489040,"creationTime":1718344489040,"tableProperties":{"policies":"","write.metadata.delete-after-commit.enabled":"true","openhouse.tableId":"t1","openhouse.clusterId":"LocalHadoopCluster","openhouse.lastModifiedTime":"1718344489040","openhouse.tableVersion":"INITIAL_VERSION","openhouse.creationTime":"1718344489040","openhouse.tableUri":"LocalHadoopCluster.d3.t1","write.format.default":"orc","write.metadata.previous-versions-max":"28","openhouse.databaseId":"d3","openhouse.tableType":"PRIMARY_TABLE","openhouse.tableLocation":"/data/openhouse/d3/t1-12b090ff-0dce-487f-8e74-5d18c55c68da/00000-9a3a852b-26d7-43f6-8340-c0687466c3f5.metadata.json","openhouse.tableUUID":"12b090ff-0dce-487f-8e74-5d18c55c68da","key":"value","openhouse.tableCreator":"DUMMY_ANONYMOUS_USER"},"timePartitioning":{"columnName":"ts","granularity":"HOUR"},"clustering":[{"columnName":"name","transform":null}],"policies":null,"tableType":"PRIMARY_TABLE"}
```
List table:
```
anath1@anath1-mn1 oh-hadoop-spark % curl "${curlArgs[@]}" -XGET http://localhost:8000/v1/databases/d3/tables/
{"results":[{"tableId":"t1","databaseId":"d3","clusterId":"LocalHadoopCluster","tableUri":"LocalHadoopCluster.d3.t1","tableUUID":"12b090ff-0dce-487f-8e74-5d18c55c68da","tableLocation":"hdfs://namenode:9000/data/openhouse/d3/t1-12b090ff-0dce-487f-8e74-5d18c55c68da/00000-9a3a852b-26d7-43f6-8340-c0687466c3f5.metadata.json","tableVersion":"INITIAL_VERSION","tableCreator":"DUMMY_ANONYMOUS_USER","schema":"{\"type\":\"struct\",\"schema-id\":0,\"fields\":[{\"id\":1,\"name\":\"id\",\"required\":true,\"type\":\"string\"},{\"id\":2,\"name\":\"name\",\"required\":true,\"type\":\"string\"},{\"id\":3,\"name\":\"ts\",\"required\":true,\"type\":\"timestamp\"}]}","lastModifiedTime":1718344489040,"creationTime":1718344489040,"tableProperties":{"policies":"","write.metadata.delete-after-commit.enabled":"true","openhouse.tableId":"t1","openhouse.clusterId":"LocalHadoopCluster","openhouse.lastModifiedTime":"1718344489040","openhouse.tableVersion":"INITIAL_VERSION","openhouse.creationTime":"1718344489040","openhouse.tableUri":"LocalHadoopCluster.d3.t1","write.format.default":"orc","write.metadata.previous-versions-max":"28","openhouse.databaseId":"d3","openhouse.tableType":"PRIMARY_TABLE","openhouse.tableLocation":"/data/openhouse/d3/t1-12b090ff-0dce-487f-8e74-5d18c55c68da/00000-9a3a852b-26d7-43f6-8340-c0687466c3f5.metadata.json","openhouse.tableUUID":"12b090ff-0dce-487f-8e74-5d18c55c68da","key":"value","openhouse.tableCreator":"DUMMY_ANONYMOUS_USER"},"timePartitioning":{"columnName":"ts","granularity":"HOUR"},"clustering":[{"columnName":"name","transform":null}],"policies":null,"tableType":"PRIMARY_TABLE"}]}
```

For all the boxes checked, include a detailed description of the testing
done for the changes made in this pull request.

# Additional Information

- [ ] Breaking Changes
- [ ] Deprecations
- [ ] Large PR broken into smaller PRs, and PR plan linked in the
description.

For all the boxes checked, include additional details of the changes
made in this pull request.

v0.5.78

Toggle v0.5.78's commit message

Verified

This commit was created on GitHub.com and signed with GitHub’s verified signature.
Add S3 docker setup for OpenHouse (#123)

Add S3 docker setup for OpenHouse (#123)