Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cloud Storage - Implement method to check existence of multiple objects in a single operation #2337

Open
bduclaux opened this issue Sep 19, 2019 · 4 comments
Labels
api: storage Issues related to the Cloud Storage API. type: feature request ‘Nice-to-have’ improvement, new feature or different behavior or design.

Comments

@bduclaux
Copy link

Hello

We are using the PHP cloud storage library, and we are facing a performance issue to check existence of multiple objects in a storage bucket.
Currently, the only way to implement such check is by using a loop such as:

$names=["file1","file2",...,"fileN"];
foreach ($names as $name) 
    {
    $object=$bucket->object($name);
    if (!$object->exists()) {...};
    }

This triggers a REST api call for each of the objects, which is slow. We usually have around 10 object names per loop, so this takes around 0.4s.
As we do this a lot of times, we have a performance issue.

It would be great to have a method in the Bucket class to check multiple object names at once, with a single request to the cloud storage back-end APIs (not sure such method exists in the back-end API).

Thanks !

@dwsupplee dwsupplee added api: storage Issues related to the Cloud Storage API. type: feature request ‘Nice-to-have’ improvement, new feature or different behavior or design. labels Sep 19, 2019
@andrewinc
Copy link

@dwsupplee, I am working on this. I would like to coordinate the design of the solution.
$object->exists() is based on the absence of exceptions in $connection->getObject(...) - I think this is the longest operation.

  1. I propose a solution based on $bucket->objects() and further in the loop, finding names from the list. This solution will require only one request.
  2. A more complicated solution - Adding a new method to Google\Cloud\Storage\Connection\ConnectionInterface and implementing it in Google\Cloud\Storage\Connection\Rest
  3. Bad solution: cram the above code into a separate method of the Bucket object. This will not give any gain in speed.

I myself would choose No. 1

As for the method of the Bucket class that @bduclaux requested, there is a question about handling the result.

  1. For example, you can return a list of only existing names:
    $bucket->objectsExists(["file1","file2",...,"fileN"]); // ["file1",...,"fileN"]

  2. You can return an associative array with keys from the original list:
    $bucket->objectsExists(["file1","file2",...,"fileN"]); // ["file1"=>true,"file2"=>false,...,"fileN"=>true]

  3. A variant is possible when an associative array is passed by reference, then the method can return true if all names from the list are found:

$names=["file1"=>null,"file2"=>null,...,"fileN"=>null];
$result=$bucket->objectsExists($names);
// $result=false; (true if all exists) $names=["file1"=>true,"file2"=>false,...,"fileN"=>true];
  1. There is also an option with a callback function to process each name. Then the method can return true if all the names from the list are found. But this solution is not quite in the style of PHP (rather node.js)
$result=$bucket->objectsExists(["file1","file2",...,"fileN"], function($name, $exists){...});
// $result=false; (true if all exists)

I myself would choose No. 1

@bduclaux
Copy link
Author

bduclaux commented Sep 30, 2019

Hi @andrewinc , thanks !
Please also take into account the cost of the queries to the API. Class A queries are more expensive than class B queries (see https://cloud.google.com/storage/pricing ). Getting the full list of objects might take a bit of time for large buckets, unless you specify a prefix.

@andrewinc
Copy link

Yes @bduclaux You're right. This operation $bucket->objects() will be charged as a class A operation , i.e. will cost 10 times more expensive than $object->exists() and therefore appropriate when requesting 10 or more objects, i.e. for example, in your case:

We usually have around 10 object names per loop

As for the large list, you are also right that the list can be shortened by specifying prefix if desired. Of course, you need to leave this feature provided for in $bucket->objects(['prefix' =>...]);

I would like to know what @dwsupplee will write about this. Maybe it will offer a different solution.

@dwsupplee
Copy link
Contributor

dwsupplee commented Oct 1, 2019

@andrewinc thanks so much for taking the time to put some thoughts together on this, and thank you @bduclaux for the feature request. We'd definitely love to add support for something like this.

We've been laying the groundwork for exposing asynchronous network requests for some time now. This should allow us to expose something which looks like the following:

use GuzzleHttp\Promise;
use Google\Cloud\Storage\StorageClient;

$bucket = (new StorageClient())->bucket('my-bucket');
$promises = [];
$objectNames = ['a.txt', 'b.txt', 'c.txt'];

foreach ($objectNames as $objectName) {
    $bucket->object($objectName)
        ->existsAsync()
        ->then(function ($exists) {
            echo "$objectName: $exists" . PHP_EOL;
        });
}

Promise\unwrap($promises);

We've done this as a "one-off" over on StorageObject::downloadAsStreamAsync, with the plan being to expose the rest of the async method counterparts across the Storage library as part of our 2.0 version bump (we don't have a clear ETA for this at the moment).

Another option would be to expose the batch API through our storage client, this would allow interweaving up to 100 requests together into a single API request. It looks like some work in progress to define a plan for how we can expose this across languages. I'll check in and see where this progress is at, but will note it could require breaking changes to the library as well.

I prefer these approaches over the list objects implementation because I'm apprehensive of edge case scenarios like the following:

I have 100,000 objects in my bucket. I want to check objects "a.txt" and "z.txt" exist. "a.txt" happens to be object 1/100,000 returned, while "z.txt" is object 100,000. The max results returned from a single RPC to list objects is 1,000 - meaning I'd have to page through 100 times to get to "z.txt". The end cost is ~100 RPCs to check for two objects.

@vishwarajanand vishwarajanand self-assigned this Feb 26, 2024
@vishwarajanand vishwarajanand removed their assignment Jun 18, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
api: storage Issues related to the Cloud Storage API. type: feature request ‘Nice-to-have’ improvement, new feature or different behavior or design.
4 participants