Skip to content
This repository has been archived by the owner on May 3, 2022. It is now read-only.

Should keys/values/entries be async iterators? #6

Closed
domenic opened this issue Feb 12, 2018 · 35 comments
Closed

Should keys/values/entries be async iterators? #6

domenic opened this issue Feb 12, 2018 · 35 comments
Labels
api A proposal for changing the API surface

Comments

@domenic
Copy link
Collaborator

domenic commented Feb 12, 2018

Right now they return promises for arrays. This can be convenient; once you await them, they're just normal arrays.

However, an async iterator would map better to the underlying IndexedDB technology (cursors which advance one by one), and would work a lot better for cases with many entries.

Should we make these return async iterators?

An alternate design would be to have them return promises-for-arrays, but also have keysIter/valuesIter/entriesIter as async-iteration versions.

@domenic domenic added the api A proposal for changing the API surface label Feb 12, 2018
@jakearchibald
Copy link

Is there a proposal or something that makes it easy to turn an async iterator into an array? Array.asyncFrom() or Promise.asyncAll() or something.

@domenic
Copy link
Collaborator Author

domenic commented Feb 13, 2018

Not at the moment. I'd be hestitant to propose such a thing because it encourages writing code that defeats the purpose of streaming. It's better to have people actually do the per-item processing in a streaming, async fashion using for-await-of.

@jakearchibald
Copy link

In that case I'd have keys/values/entries return a promise for an array, so they can use getAll under the hood, which I understand is faster than cursors.

Storage objects could implement async iterators for cursor-like behaviour:

for await (const [key, value] of storage) {
  // …
}

@domenic
Copy link
Collaborator Author

domenic commented Feb 14, 2018

Oh, I didn't know about getAll. And very neat idea on just having Symbol.asyncIterator, instead of dedicated APIs.

@inexorabletash
Copy link
Member

One wrinkle with getAll() - there is getAll() and getAllKeys() but no way to do both in a single call. Fortunately, with IDB's transaction semantics, if you issue both in sequence you're guaranteed get consistent results., i.e.

const r1 = store.getAllKeys();
const r2 = store.getAll();
r2.onsuccess = e => {
  // if r2 succeeds then r1 must have succeeded already.
  const [keys, values] = [r1.result, r2.result];
  ...
};

@inexorabletash
Copy link
Member

inexorabletash commented Feb 26, 2018

As an aside, in case it's not clear: you won't be able to directly map cursors to async iterators for the reasons explored in https://github.com/inexorabletash/indexeddb-promises - the transaction would be kept alive during iteration until a non-IDB task slips into the microtask processing.

So if you want async iteration you'll want to snapshot (e.g. via getAll()) and implement the async-ness in userspace.

@jakearchibald
Copy link

Ah, yeah, in which case there isn't really any point in using async iterators.

@domenic
Copy link
Collaborator Author

domenic commented Feb 28, 2018

Agreed. I guess that sadly closes this in the negative, at least until IDB gets the necessary primitives.

@domenic domenic closed this as completed Feb 28, 2018
domenic added a commit that referenced this issue Feb 28, 2018
Per #6 (comment) we can just wait for the second request.
@pwnall
Copy link

pwnall commented Oct 17, 2018

If this is still on the table, I beg to differ -- I think we should do async iteration, because it gives us more room to evolve the implementation.

Specifically, async interation gives us the option to protect against excessive memory consumption and OOMs if the database grows a lot. Today, we can use getAll with a count and buffer some results (say, 1k - 10k), to offset the cost of having a transaction per read. If IndexedDB evolves better primitives, we can take advantage of them in future revisions.

@domenic
Copy link
Collaborator Author

domenic commented Oct 17, 2018

Thanks for chiming in. I see the point and, with a more future-facing look, tend to agree. Let's reopen as a first step.

What should we do about naming? In all cases we should add [Symbol.asyncIterator](), but what about the keys/values/entries functions? A few options (not exhaustive):

  • Only supply [Symbol.asyncIterator](), which gives [key, value] pairs, alongside promise-for-array keys()/values()/entries()
  • Make keys()/values()/entries() return async iterators, and have separate keysArray()/etc. functions. (I'm not a big fan of this because it mismatches other async maps on the platform.)
  • Make keys()/values()/entries() return promises-for-arrays, and have separate keysIter()/etc. functions

@domenic domenic reopened this Oct 17, 2018
@asutherland
Copy link

Why expose any promise-for-array variations? In w3c/ServiceWorker#1066 we saw a case where the naive "return everything in a single array" design for cache.keys() would break things once a lot of stuff was stored in a Cache. Exposing promise-for-array for symmetry with existing Map metaphors when we know it becomes a foot-gun as storage usage grows seems counterproductive.

Along these lines, async iterators give browsers the ability to help consumers of the API avoid tying up the main thread by breaking up the results into batches. API consumers will get what they want and what browsers want as part of the API.

And the IndexedDB escape-hatch allows those who really want everything all at once to fall back to IndexedDB and getAll/getAllKeys.

@domenic
Copy link
Collaborator Author

domenic commented Dec 19, 2018

If we think other async maps on the platform have made a mistake, then departing from them and having keys()/values()/entries() return an async iterator makes sense to me. And yeah, I guess the more conservative thing is to omit the promise-for-array variants. I'll work on that spec change now.

@domenic
Copy link
Collaborator Author

domenic commented Dec 20, 2018

I've put up a rough draft, so far only for keys(), at https://github.com/domenic/async-local-storage/compare/async-iterators?diff=split . I'd welcome comments on the specification style which is fairly unusual. (Probably some of it will migrate to Web IDL.)

On the side of substantial issues, one thing that came up is the tension between flexibility and predictability. In particular, what happens if you modify the database during iteration? Concretely, what should this do?

storage.set(1, "1!");
storage.set(2, "2!");

let i = 3;
for await (const k of storage.keys()) {
  console.log(k);
  storage.set(i, `${i}!`);
  ++i;
}

If you use the getAllKeys() strategy, this will log 1, 2, then stop. If you use a batching strategy with batch size 1, this will loop forever. If you use a batching strategy with batch size 2, it depends on whether you do your check-for-more-batches after grabbing the batch (loop will exit) or before a new batch is requested (loop will run forever). If you use a batching strategy with size 100, it will log 1, 2, then stop.

This seems pretty bad for predictability. Are we OK with saying that the results are undetermined between browsers in this way? Or should we lock this down more?

For example, we could mandate a set batch size. Or, we could mandate using getAllKeys() for now and hope that in the future if IndexedDB gets nice primitives then all browsers can switch at once, while keeping the same API shape. Or, we could completely get rid of keys()/values()/entries() from v1.

Or, maybe there's something cleverer that I haven't thought of.

@asutherland
Copy link

I hope the IndexedDB transaction model already covers this. keys() gets a "readonly" transaction that it uses for all retrievals. The set() call creates a "readwrite" transaction whose mutations will never be seen by the keys() "readonly" transaction per https://w3c.github.io/IndexedDB/#transaction-lifetime-concept. So the bits about "and perform a new transaction each time to retrieve the next key" would need to be removed.

I see a few wrinkles that need to be addressed with this, though:

  1. The polyfill has to deal with the realities of self-closing transactions. So it needs to at least keep getting one more result value per call or the transaction closes and semantics break. This limits how lazy the poly-fill can be.
  2. The spec would want to reasonably guarantee that code that calls storage.keys() but indefinitely defers processing the async iterator does not cause storage.set('foo', 1); storage.get('foo'); to indefinitely stall. Implementation-wise, this translates to "You must not keep the readonly transaction backing the async iterator alive if your transactions implementation isn't clever enough to keep things going; instead you need to slurp the data into memory." (Some implementations, like Firefox's IndexedDB-on-SQLite-WAL can potentially just rely on the semantics enabled by SQLite's WAL to let "readonly" transactions be long-lived.)
  3. The spec would want to decide and explicitly state how an indefinitely deferred keys() consumption is addressed. This is necessary because an implementation doing something clever with long-lived "readonly" transactions will at some point still need to close out that transaction so WAL check-pointing can compact the database and stop the entire disk from filling up. I think the options are:
    • The iterator is allowed to throw if you don't consume it in a timely fashion. This is hard because the obvious definition of "timely fashion" here is the IndexedDB self-closing transaction timeline, which is arguably unfriendly. (But does simplify the polyfill!) And anything longer than that is likely to be surprising to users or a nightmare for cross-browser consistency.
    • The implementation is required to consume the iterator's values into memory when it would close the underlying "readonly" transaction so that the values can be provided when eventually desired. And if that would cause an OOM, it causes an OOM. I like this one.

@domenic
Copy link
Collaborator Author

domenic commented Dec 20, 2018

Ah, so you'd use a single transaction, just potentially multiple key-retrieval calls? And for now that would mean you'd have to do all your key retrieval calls within one event loop turn, since IndexedDB transactions always auto-close right now. (And thus, probably you'd just do a single key retrieval call.)

But later, if IndexedDB gets better primitives for non-auto-closing, async local storage could layer on top of that, and the only difference would be the timing of the results---not their contents.

Am I understanding correctly?

@inexorabletash
Copy link
Member

On vacation, so keeping this short: an alternative is iterating in user-space across transactions. Store the last key returned, and use an open-lower-bound range to get the next key/value, e.g.

const range = IDBKeyRange.lower(last_key, /*open=*/true);
const rk = tx.getKey(range);
const rv = tx.get(range);
rv.onsuccess = e => { 
   last_key = rk.result;
   if (last_key === undefined) { end_of_data(); }
   else { async_yieldify(rk.result, rv.result); }
};

The semantics roughly match what happens to a cursor that's iterating within a read/write transaction where the data changes in IDB today.

@asutherland
Copy link

@domenic Not quite as bad as one event loop turn. In order to keep the transaction active you need to dispatch another request in the same event loop turn that the last currently outstanding request succeeds/errors. So you would fetch batch N+1 once batch N comes back. This just means the the polyfill batch filling would be driven by how fast IDB can deliver records, not how fast the keys() caller wants to consume them.

Re: open-lower-bound range, that seems quite reasonable to me too. Since the values in ConvenienceStore are structured cloned rather than strings, if a consumer wants to atomically store two related values, they can just stick them in an object { foo, bar } keyed "fooBar" rather than storing them in separate keys "foo" and "bar". This would be consistent with the ordering of get("foo"); get("bar"); which would happen in 2 transactions. And I guess that also makes it clear that this is the right way to handle it. If a caller wants a snapshot, they can use IndexedDB on the store.

@domenic
Copy link
Collaborator Author

domenic commented Jan 2, 2019

@inexorabletash's idea seems good to me. I'll work on prototyping that in Blink, and writing a bunch of tests for it, then I'll come back here with a spec patch.

@domenic
Copy link
Collaborator Author

domenic commented Jan 3, 2019

Update: as far as I can tell, the only way to get the first key in IDB is to use cursors; get() or getKey() will not allow a null range argument, and there appears to be no way to construct an unbounded range using the IDBKeyRange factory methods. So, this will need a slightly different strategy, unless I'm missing something (which is quite possible).

@vweevers
Copy link

vweevers commented Jan 3, 2019

and there appears to be no way to construct an unbounded range using the IDBKeyRange factory methods

How about IDBKeyRange.lowerBound(-Infinity)?

@asutherland
Copy link

objectStore.getAll(null, 1) and objectStore.getAllKeys(null, 1) should work for values and keys, respectively. (Leveraging their query, count signatures.)

@domenic
Copy link
Collaborator Author

domenic commented Jan 3, 2019

objectStore.getAll(null, 1) and objectStore.getAllKeys(null, 1) should work for values and keys, respectively. (Leveraging their query, count signatures.)

Yeah, that's what I ended up doing; I had to have separate getFirstKey/getFirstKeyValuePair and getKey/getKeyValuePair functions, which was a bit icky, but it all got abstracted over.

How about IDBKeyRange.lowerBound(-Infinity)?

That might be even better, thanks! I'll try it, once I get the tests passing against my branching version.

@inexorabletash
Copy link
Member

Would pushing additional primitives into IndexedDB to make things easier here be reasonable? I don't believe it's a goal for this API to be polyfillable on any existing engines, is it?

re: IDBKeyRange.lowerBound(-Infinity) - see some discussions in w3c/IndexedDB#76 about potentially adding keys that sort before -Infinity. That said, we can evolve this API's definition and IDB in parallel, so it's not a problem as long as we remember to update both specs (or merge them as was suggested at TPAC by @asutherland maybe?)

Adding something like IDBKeyRange.unbounded() seems feasible - the spec already has the concept. Alternately, we could just change get() and getKey() to allow null which already means this for most queries, it's just explicitly forbidden for those calls. IIRC it was just an early decision to throw in that case to prevent developer confusion. Apart from tests, it should be web-compatible to relax that (chase down null disallowed flag mentions in the spec).

Also, adding getKeyAndValue() has been requested before, to avoid the API/resource overhead of a cursor. I'd be down with adding that too.

@domenic
Copy link
Collaborator Author

domenic commented Jan 7, 2019

Would pushing additional primitives into IndexedDB to make things easier here be reasonable?

For sure, although there's a question of how far do we go: e.g. we could go as far as to add fully async-iterator friendly extensible transactions, or better cursors, or even async iterator cursors. My initial preference is to do things based on top of what's specced at this time, and co-evolve to simplify the layered ALS spec on top of any future IDB additions.

I don't believe it's a goal for this API to be polyfillable on any existing engines, is it?

Correct, just layered on top of the "theoretical web platform" provided by the stable-ish portion of the specification ecosystem.

merge them as was suggested at TPAC

Yeah, that's still the plan; need to move this to WICG, then when it's ready to graduate, merge into IDB.

Adding something like IDBKeyRange.unbounded() seems feasible - the spec already has the concept. Alternately, we could just change get() and getKey() to allow null which already means this for most queries, it's just explicitly forbidden for those calls. IIRC it was just an early decision to throw in that case to prevent developer confusion. Apart from tests, it should be web-compatible to relax that (chase down null disallowed flag mentions in the spec).

Also, adding getKeyAndValue() has been requested before, to avoid the API/resource overhead of a cursor. I'd be down with adding that too.

These all seem quite nice to me, but I'm not sure I'm the right person to drive the spec/test/implementation updates; as tempting as it is, I think just using what we have right now is the right path for my time investment. But, I'm happy to be convinced otherwise. Thoughts?

chromium-wpt-export-bot pushed a commit to web-platform-tests/wpt that referenced this issue Jan 11, 2019
Following the spec discussion at WICG/kv-storage#6. This CL is used for prototyping and tests.

Change-Id: If5262a32d42eb1c7abaad4fb2a029ce44e94834e
@tabatkins
Copy link

I can't tell from this discussion whether the impl strategy for async iterators would allow you to easily iterate over the "original" version of the db, and mutate it in the loop without seeing the changes. I think it's important for usability to make that case easy, even if it's not the easiest thing to do; having to drop down to IDB and understand the specifics of the transaction semantics (per @inexorabletash's example in #6 (comment)) is... less than good. (Or, equivalent in meaning but very different implementation-wise, do a read-only loop, building up all desired mutations in a side-list, then do a write-only loop, writing all the mutations to the db. Also v awkward.)

For eager iterators this is easy; for(const val of [...iter()]) {...} works just fine. It would be unfortunate if this was substantially more difficult for async iterators.

I might just be asking for Promise.asyncAll(), a la Jake's earlier request, and thus wouldn't actually require this API to do anything in particular. "You shouldn't do that" isn't a valid response to this request, I think; you shouldn't generally eagerly pull all the data out of an eager iterator, either, but it's sometimes useful.

@inexorabletash
Copy link
Member

There's also...

navigator.locks.request('kvstorage', async lock => {
  for await (const [k,v] of kvstorage.entries()) {
    ...
  }
});

@tabatkins
Copy link

Ah, that as a slightly-easier way to iterate the db so I can synthesize it into a Promise.all() compatible form?

@domenic
Copy link
Collaborator Author

domenic commented Jan 14, 2019

I agree there's generally a missing language-level primitive for converting an async iterator into a promise for an array. I'll start noodling on that. I did suggest maybe having something built-in for convenience in #6 (comment), but reactions seemed negative to making this "antipattern" too easy.

In the meantime, it is a pretty simple wrapper to write:

async function collectAsyncIterator(ai) {
  const result = [];
  for await (const x of ai) {
    result.push(x);
  }
  return result;
}

I don't really understand @inexorabletash's suggestion (what work is the navigation.locks lock doing?) or @tabatkins's response :)

@tabatkins
Copy link

@inexorabletash's suggestion was a simpler way to iterate the indexeddb directly; presumably with the assumption of a "kvstorage" lock that the kvStorage library also uses, so you can be sure you've consumed the whole thing without intervening edits.

But yeah, I guess the converter between async-iterable and Promise<Array> is pretty trivial.

@inexorabletash
Copy link
Member

Apologies for being vague. I was attempting to address:

...easily iterate over the "original" version of the db, and mutate it in the loop without seeing the changes...

... and somewhat mis-parsed as wanting to be able to iterate without seeing changes coming from another context. On re-reading, I don't think that's what @tabatkins was getting at. Attempting to restate his use case using different terminology: iterate over a snapshot, while making changes during iteration that aren't reflected in the iteration.

My suggestion just uses locks to introduce a transaction so that the iteration isn't polluted by changes from another context (assuming all contexts wrap access by grabbing a lock), which is a solution to a different problem.

In general, the "operate on a snapshot while allowing mutations" is something databases provide (leveldb, sqlite, etc), although is expensive and not guaranteed, so e.g. IndexedDB doesn't guarantee it. In IDB, if you have a read-only transaction, you don't get to mutate. If you have a read-write transaction, you observe your own writes even prior to the transaction commit.

@vweevers
Copy link

In general, the "operate on a snapshot while allowing mutations" is something databases provide (leveldb, sqlite, etc), although is expensive [..]

At the risk of derailing this thread (is there a better place to discuss this?): asynchronously reading from a snapshot would be very nice to have. Might not be that expensive in Chrome at least, seeing as it uses LevelDB under the hood (and LevelDB snapshots are lightweight).

My suggestion just uses locks to introduce a transaction so that the iteration isn't polluted by changes from another context

@inexorabletash Relying on Web Locks would only work for small data sets, right? E.g. you wouldn't want to lock everything while slowly reading a million rows.

IndexedDB doesn't guarantee it.

@inexorabletash Can you expand on that? I was under the assumption that IndexedDB does offer "snapshot guarantees" on a read-only transaction. And (level-js) unit tests confirm this, in Chrome, FF and Opera (IE and Edge occasionally fail).

@inexorabletash
Copy link
Member

(Apologies for the further derail. Happy to take this up in an issue on the IDB repo if more discussion is needed - https://github.com/w3c/IndexedDB)

Relying on Web Locks would only work for small data sets, right?

Probably not a great application architecture, no. (aside: Web Locks do support shared/exclusive)

I was under the assumption that IndexedDB does offer "snapshot guarantees" on a read-only transaction.

The spec details when transactions can be started; the rule is that a RO transactions should see consistent data, and it clarifies that this can be done by having RO transactions block R/W transactions, or RO transactions can snapshot so that subsequent R/W transactions can start. Currently, only Chrome snapshots -- in other browsers, R/W transactions are blocked until RO transactions finish -- but we're changing that as part of some architectural rework of the implementation.

Restating: the IDB spec guarantees that no changes are observed within a RO transaction, but allows that to be provided with or without snapshots.

@inexorabletash
Copy link
Member

inexorabletash commented Jan 15, 2019

Also...

asynchronously reading from a snapshot would be very nice to have. Might not be that expensive in Chrome at least, seeing as it uses LevelDB under the hood (and LevelDB snapshots are lightweight).

IMHO, this is better tackled with even lower level primitives than kv-storage, e.g. allowing LevelDB in the browser via a WASM-friendly storage layer, so you literally run LevelDB and can tune it for your app's needs.

@domenic
Copy link
Collaborator Author

domenic commented Jan 15, 2019

Update: I have an implementation I'm happy with, and a boatload of tests, over at https://chromium-review.googlesource.com/c/chromium/src/+/1396453 . It has pretty nice behavior for if you try to set values while iterating and such.

If anyone has more test suggestions let me know (here or there). I'll get back to working on a spec patch tomorrow, using that as the basis.

@domenic
Copy link
Collaborator Author

domenic commented Jan 16, 2019

First draft up at #47. I realized I have no tests for the error cases yet so writing those, and potentially fixing the spec if it handles it badly, is the next step.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
api A proposal for changing the API surface
Projects
None yet
Development

No branches or pull requests

7 participants