Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEATURE] - Support pagination for PPL and SQL query #656

Closed
2 tasks
penghuo opened this issue Jun 23, 2022 · 10 comments
Closed
2 tasks

[FEATURE] - Support pagination for PPL and SQL query #656

penghuo opened this issue Jun 23, 2022 · 10 comments
Labels
enhancement New feature or request pagination Pagination feature, ref #656

Comments

@penghuo
Copy link
Collaborator

penghuo commented Jun 23, 2022

Is your feature request related to a problem?

What solution would you like?

  • add pagination support for operator which will unblock ML plugin use case.
  • add cursor support for V2 engine.
@penghuo penghuo added enhancement New feature or request untriaged labels Jun 23, 2022
@seankao-az
Copy link
Collaborator

seankao-az commented Jul 7, 2022

[Draft] Design Doc

1 Overview

Support pagination requests for PPL and SQL query

2 Problem Statements

The V2 engine limits the response size of a query. This is a blocker for ML plugin to pull more data by using Piped Processing Language (PPL). While the V1 SQL engine supports pagination using cursor scrolling, PPL uses the V2 engine explicitly. Therefore, we need to migrate cursor support to the V2 engine.

The legacy engine took a stateless approach: the cursor encodes the necessary information for re-constructing the query: https://github.com/opensearch-project/sql/blob/main/docs/dev/Pagination.md
This worked well for SQL, but as PPL allows for extra calculations on the data rows returned from index scanning, we might not be able to rebuild context on the fly.

3 Requirements

3.1 Use Cases

3.1.1 Scrolling through web interface

SQL cursor
Sample requests:

POST /_plugin/_sql
{
  "fetch_size": 5,
  "query": "SELECT firstname, lastname FROM accounts WHERE age > 20 ORDER BY state"
}

{
  "schema": [...],
  "cursor": "eyJhIjp7fSwicyI6IkRYRjFaWEo1UVc1a1JtVjBZMmdCQUFBQUFBQUFBQU1XZWpkdFRFRkZUMlpTZEZkeFdsWnJkRlZoYnpaeVVRPT0iLCJjIjpbeyJuYW1lIjoiZmlyc3RuYW1lIiwidHlwZSI6InRleHQifSx7Im5hbWUiOiJsYXN0bmFtZSIsInR5cGUiOiJ0ZXh0In1dLCJmIjo1LCJpIjoiYWNjb3VudHMiLCJsIjo5NTF9",
  "total": 956,
  "datarows": [...],
  "size": 5,
  "status": 200
}

POST /_plugin/_sql
{
  "cursor": "eyJhIjp7fSwicyI6IkRYRjFaWEo1UVc1a1JtVjBZMmdCQUFBQUFBQUFBQU1XZWpkdFRFRkZUMlpTZEZkeFdsWnJkRlZoYnpaeVVRPT0iLCJjIjpbeyJuYW1lIjoiZmlyc3RuYW1lIiwidHlwZSI6InRleHQifSx7Im5hbWUiOiJsYXN0bmFtZSIsInR5cGUiOiJ0ZXh0In1dLCJmIjo1LCJpIjoiYWNjb3VudHMiLCJsIjo5NTF9"
}

{
  "schema": [...],
  "cursor": <next cursor>,
  "total": 956,
  "datarows": [...],
  "size": 5,
  "status": 200
}
POST /_plugin/_ppl
{
  "fetch_size": 5,
  "query": "source=accounts | where age > 20 | sort state | fields firstname,lastname"
}

{
  "schema": [...],
  "cursor": <first cursor>,
  "total": 956,
  "datarows": [...],
  "size": 5,
  "status": 200
}

POST /_plugin/_ppl
{
  "cursor": <first cursor>
}

{
  "schema": [...],
  "cursor": <next cursor>,
  "total": 956,
  "datarows": [...],
  "size": 5,
  "status": 200
}

3.1.2 Interacting with ML command in PPL to load full result set

Sample request:

POST /_plugin/_ppl
{
  "query": "source=iris_data | fields sepal_length_in_cm, sepal_width_in_cm, petal_length_in_cm, petal_width_in_cm | kmeans 3"
}

Users do not need to add the fetch_size parameter. The ml command implicitly invokes search and scroll to fetch all dataset.

3.1.3 Extending query size limit

#703
This allows for querying and processing large data sets. With this, the 3.1.2 use case can be resolved to some extent.
Sample request:

POST /_plugin/_ppl
{
  "query": "source=iris_data | head 100000 | fields sepal_length_in_cm, sepal_width_in_cm, petal_length_in_cm, petal_width_in_cm | kmeans 3"
}

3.2 Functional Requirements

  • Scrolling support for V2 query backend engine
  • Scrolling support for SQL and PPL requests

4 Measure of Success

5 Design

5.1 Background

5.1.1 Current execution flow

sql-migration-Page-1

Query request
POST /_plugin/_sql
{
  "query": "SELECT * FROM table"
}

execution path: (1a) → (2a) → (3)

Scroll query request
POST /_plugin/_sql
{
  "fetch_size": 5,
  "query": "SELECT * FROM table"
}

execution path: (1a) → (2b) → (3) → legacy OpenSearch request

Scroll cursor request
POST /_plugin/_sql
{
  "cursor": <cursor>
}

execution path: (1b) → legacy cursor executor

5.2 Design considerations

5.2.1 Physical plan generation

Physical plan for index scan does not solely depend on the logical plan for the query anymore. The presence/absence of fetch_size parameter will decide whether to make a regular query request or a scroll request to OpenSearch. Therefore, the request handler should pass this info down to the bottom where physical plan is generated.

Option 1

Let the Planner generate different physical plans for query and scroll requests.

  • plan(logical plan, is not scroll) => physical index scan with query request
  • plan(logical plan, is scroll) => physical index scan with scroll request
Option 2

Let the Analyzer generate different logical plans for query and scroll requests.

  • analyze(unresolved plan, is not scroll) => logical index scan for query request
  • analyze(unresolved plan, is scroll) => logical index scan for scroll request

5.2.2 Scrolling interaction with limit/offset

Limit

#703

Offset

OpenSearch doesn't allow scroll requests to have offset != 0.
To bypass this restriction, we can set the offset of a scroll request to be 0 and skip the first few results for the user before returning. However, it is suggested that search_after be used if we need to page through more than index.max_result_window hits.

5.2.3 Physical plan retention

Extra calculations are required for some queries, such as the parse, eval, dedup commands in ppl.
When handling pagination for source=index | other commands, the scroll request for OpenSearch handles only the first part source=index, but not the rest | other commands. We should create context and keep it alive for each scroll request.

Context lookup

Where does such context live? In local memory? Shared storage?

5.2.4 Backward compatibility with legacy cursor format

https://github.com/opensearch-project/sql/blob/main/legacy/src/main/java/org/opensearch/sql/legacy/cursor/DefaultCursor.java
This cursor was designed for stateless requests.

5.3 Design Overview

5.3.1 Journey of Request

Request Journey

5.3.2 Backend: OpenSearch Request

Interface to the OpenSearch engine

  • OpenSearchQueryRequest: for regular query request
  • OpenSearchScrollRequest: for scroll request
    • Stateful to maintain scroll ID between calls to client search method

5.3.3 Backend: Physical Plan Generation

We add a new request builder to the OpenSearchIndexScan plan.
Currently, physical plan depends solely on the (optimized) logical plan. However, the same logical plans can lead to different physical plans because:

  • A regular query request and a scroll query request generate the exact same logical plan, but they need different physical plans to invoke different OpenSearch requests.
  • The presence/absence of ML commands can determine how we want to fetch data.
  • Extend size_limit setting in query engine to support unlimited index query. #703 other commands could also require scrolling depending on the parameter passed in

We add a PlanContext component to solve the problem. The PlanContext can be set during request handling and query analyzing. When building the physical plan, the planner uses it to decide which index scan mode to use.

5.3.4 Frontend: Request Handling

If the request is a scroll query request, set PlanContext so that this information is passed to the planner later. This same logic will apply to both SQL and PPL request handling.

5.3.5 Frontend: Response Formatting

5.3.6 Cursor generation and mapping

@vamsi-amazon
Copy link
Member

In the second cursor request can there be a scenario with some extra calculations along with push down. how do you retain entire physical plan?

@seankao-az seankao-az mentioned this issue Jul 15, 2022
6 tasks
@seankao-az
Copy link
Collaborator

seankao-az commented Jul 25, 2022

In the second cursor request can there be a scenario with some extra calculations along with push down. how do you retain entire physical plan?

Thanks for pointing this out. I've come up with a design idea that takes this into consideration. It also avoids many of the problems faced in the PoC PR #693

  • The design in poc: scroll query request #693 uses the the scroll ID returned from OpenSearchScrollRequest as the cursor response for the sql/ppl request. New design hides it and exposes a new cursor we generated, so we no longer need to break PhysicalPlan interface to get the scroll ID.
  • retains the physical plan for extra calculations
@penghuo
Copy link
Collaborator Author

penghuo commented Nov 22, 2022

relate to #947

@rrlamichhane
Copy link

What's the status of this?

@Yury-Fridlyand
Copy link
Collaborator

WIP, you can track it in Bit-Quill#226

@Yury-Fridlyand
Copy link
Collaborator

OpenSearch.SQL.pagination.phase.1.demo.mp4
@Yury-Fridlyand
Copy link
Collaborator

Pagination.with.WHERE.clause.mp4
@Yury-Fridlyand
Copy link
Collaborator

@Yury-Fridlyand
Copy link
Collaborator

Superseded by #1759

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request pagination Pagination feature, ref #656
6 participants