-
-
Notifications
You must be signed in to change notification settings - Fork 109
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Trying to understand the npm-plugin #151
Comments
Reading through this and the comments makes me think such a data structure and flow could be: // A module@version string where the version is fully resolved, like "preact@10.5.4":
type NormalizedModuleSpecifier = string;
// the single in-memory representation of a (version of a) module:
type ModuleCache = Map<NormalizedModuleSpecifier, ModuleCacheEntry>;
interface ModuleCacheEntry {
// if true, streaming has completed and .files is exhaustive:
complete: boolean;
// already-resolved files:
files: Map<path, string>;
// a new class that handles the tar streaming:
stream: TarStreamer;
}
interface TarStreamer {
// returns or creates a Promise entry in .pendingFiles for the given path
waitForFile(path): Promise<string>;
// like whenFiles, can contain promises for files that end up not existing:
private pendingFiles = Map<path, Promise<string>>;
}; Resolution of a file checks In |
@developit Love where this is going! The proposed interface is a huge improvement in readability for me 👍 The only thing I'm wondering is what we gain by caching a file map for each module vs one big Map like our current Same for tarStreamer: I'm not sure if we'd need one per module. Need to experiment with that. Fully on board with all the other stuff 👍 |
The file map within each module is important because we need it in order to handle requests for modules that haven't yet been streamed. Probably clearer in the example below, but basically when we get a request for I ended up writing most of an implementation in a comment for some reason: class NpmResolver {
constructor({ modulesDir = './node_modules', plugins = [] }) {
this.modulesDir = modulesDir;
// versioned module state and in-memory cache
this.moduleCache = new Map();
// holds registry metadata (ie, "packuments")
this.metaCache = new Map();
// plugins transform all files from tarballs before they are used/stored:
this.plugins = plugins;
}
async resolve({ module, version, path }) {
path = normalizePath(path);
version = await this.resolveVersion(module, version);
const record = this.getModuleRecord(module, version);
return record.getFile(path);
}
async resolveVersion(module, version) {
let meta = this.metaCache.get(module);
if (!meta) {
// get packument for `module` from registry:
meta = await getPackageMeta(module);
this.metaCache.set(module, meta);
}
// resolve tags/aliases to versions ("latest"-->"10.5.4")
if (meta['dist-tags'] && version in meta['dist-tags']) return meta['dist-tags'][version];
if (version in meta.versions) return version;
return semverMaxSatisfying(Object.keys(meta.versions, version));
}
getModuleRecord(module, version) {
const specifier = `${module}@${version}`;
let record = this.moduleCache.get(specifier);
if (!record) {
// could potentially skip this, since it's technically possible to generate the tarballUrl:
const meta = this.metaCache.get(module).versions[version];
record = new ModuleRecord(module, version, meta, this.transformFile.bind(this));
this.moduleCache.set(specifier, record);
}
return record;
}
transformFile(file) {
for (const plugin of this.plugins) {
if (!plugin.transform) continue;
const out = plugin.transform(file.data, file.name);
if (out) file.data = out;
}
// asynchronously write the file to disk too
writeNpmFile(file.module, file.name, file.data);
}
}
class ModuleRecord {
constructor(module, version, meta, transformFile) {
this.module = module;
this.version = verison;
this.completed = false;
this.files = new Map();
this.stream = new TarStreamer(module, version, meta.dist.tarball);
this.stream.transformFile = transformFile;
this.stream.done.then(() => {
this.completed = true;
this.stream = null; // we can get rid of all streaming state now
});
}
getFile(path) {
const file = this.files.get(path);
if (file != null) return file;
if (this.completed) throw Error(`${this.module}@${this.version}/${path} not found.`);
// still streaming, return pending file promise:
return this.stream.waitForFile(path);
}
}
class TarStreamer {
constructor(module, version, tarballUrl) {
this.module = module;
this.version = version;
this.tarballUrl = tarballUrl;
this.done = this.stream();
this.pendingFiles = new Map();
this.transformFile = () => {};
}
async stream() {
// same as getTarFiles() basically
const { tarballUrl, module, version } = this;
const tarStream = await getStream(tarballUrl);
await parseTarball(tarStream, async (name, stream) => {
let data = await streamToString(stream);
const file = { name, data, stream, module, version };
await this.transformFile(file);
const pending = this.pendingFiles.get(file.name);
if (pending) pending.resolve(file.data);
});
this.pendingFiles.forEach(pending => pending.reject('Not found'));
this.pendingFiles.clear();
}
} |
@developit Currently noodling on your snippet a bit. I like the structure in yours a lot and tried to come up with an imperative alternative to the class to be able to compare the different approaches. The main drawback I see with classes is that it encapsulates state to small entities. This works fine for now but in my experience additional features usually require connecting these states across boundaries at which point one is forced to remodel the whole house again rather than adding a function here and there. A pattern that my brother introduced me to whilst working on our e2e-test runner const State = {
cache: new Map(),
resolved: new Set(),
}
// Most functions receive that as the first argument.
function doSomething(state, arg1, arg2) {
// Here I can access any state
} It may seem very unglamorous (and it probably is), but that "dumbness" has served us extremely well. No matter what new feature we added, we haven't run into cases where we need to re-model the architecture due to lack of access to stateful stuff. Instead everything becomes a function which is very easy to swap out or add new code to it. This pattern lends itself to adapt to changes extremely easy and quickly. We still put boundaries on stateful stuff, so we didn't just have a global object though. In our case such a boundary seems to be the So I'd like to throw in another contender. The following code is based on your snippet and tries to rewrite it with an imperative style. Note that the code would be noticeably shorter than the class approach, but I typed out the full implementation for // Shared registry state that is passed to each function
export interface RegistryState {
modulesDir: string;
cache: Map<FullSpec, string>;
metaCache: Map<FullSpec, any>;
completed: Set<ModuleSpec>;
pendingModules: Map<ModuleSpec, Map<Path, { promise, resolve, reject }>>;
plugins: Plugin[];
}
async function resolveModule(registry: RegistryState, info: Info) {
const { cache, completed, metaCache } = registry;
let { module, version, path } = info;
version = await resolveVersion(registry, module, version);
path = normalizePath(path);
// Memory cache
const cached = cache.get(`${module}@${version} :: ${path}`);
if (cached) {
return cached;
}
// If the we did load the tarball, but the file is not in cache,
// we requested a file that doesn't exist.
if (completed.has(`${module}@${version}`)) {
throw new Error(`${module}@${version}/${path} not found.`);
}
return waitForFile(registry, { module, version, path }, async () => {
const tarball = metaCache.get(module).dist.tarball;
streamTarball(registry, { module, version, path }, tarball);
});
}
async function resolveVersion(registry, module, version) {
let meta = registry.meta.get(module);
if (!meta) {
// get packument for `module` from registry:
meta = await getPackageMeta(module);
registry.meta.set(module, meta);
}
// resolve tags/aliases to versions ("latest"-->"10.5.4")
if (meta['dist-tags'] && version in meta['dist-tags']) return meta['dist-tags'][version];
if (version in meta.versions) return version;
return semverMaxSatisfying(Object.keys(meta.versions, version));
}
// We could inline this into `resolveModule()` if we wanted to. Thought it would be clearer
// to contain the Promise loading logic here, as it's a bit unusual although necessary.
async function waitForFile(registry: RegistryState, { module, version, path }, callback) {
const { pendingModules } = registry;
let pending = pendingModules.get(`${module}@${version}`);
if (pending) {
let wait = pending.get(path);
if (wait) return wait.promise;
const promise = new Promise((resolve, reject) => {
pending.set(path, { promise, resolve, reject });
});
return promise;
}
// First time module is requested. Populate cache
pending = new Map();
pendingModules.set(`${module}@${version}`, pending);
const promise = new Promise((resolve, reject) => {
pending.set(path, { promise, resolve, reject });
});
callback();
return promise;
}
async function streamTarball(registry: RegistryState, { module, version, path }, url: string) {
const { pendingModules, completed, cache } = registry;
const pending = pendingModules.get(`${module}@${version}`);
const tarStream = await getStream(url);
await parseTarball(tarStream, async (name, stream) => {
let data = await streamToString(stream);
for (const plugin of this.plugins) {
if (!plugin.transform) continue;
const out = plugin.transform(data, name);
if (out) data = out;
}
// Asynchronously write the file to disk, but await for completion
// so that we don't notify our Promise listeners too early.
await writeNpmFile(module, name, data);
cache.set(`${module}@${version} :: ${path}`, data);
const promise = pending.get(name);
if (promise) promise.resolve(data);
});
// Yay, parsing is done! Mark module as being completed and reject remaining listeners
completed.add(`${module}@${version}`);
pending.forEach(pending => pending.reject('Not found'));
// No need to keep listeners in memory, future calls will bail out on `completed`
pendingModules.delete(`${module}@${version}`);
} |
I'm good with the game world style approach. Only thing in your code sample I find awkward is the callback to waitFile. Based on the logic it seems like we could just do that before calling waitFile if there is no existing key in // The first time module is requested, create a cache entry and start streaming it:
if (!pendingModules.has(`${module}@${version}`)) {
pendingModules.set(`${module}@${version}`, new Map());
const tarball = metaCache.get(module).dist.tarball;
streamTarball(registry, { module, version, path }, tarball);
}
// now waitForFile can assume there's an entry in pendingModules:
return waitForFile(registry, { module, version, path });
}
async function waitForFile(registry: RegistryState, { module, version, path }) {
const { pendingModules } = registry;
const pending = pendingModules.get(`${module}@${version}`);
let wait = pending.get(path);
if (wait) return wait.promise;
wait = {};
wait.promise = new Promise((resolve, reject) => {
wait.resolve = resolve;
wait.reject = reject;
});
pending.set(path, wait);
return wait;
} Also I think I mentioned it in the PR thing with the comments, but there shouldn't be any reason to wait while writing files to disk, since nothing ever reads from the disk. |
I've now spent a couple of hours going through the code of our
npm-plugin
, but still don't have a good grasp on the control flow. On a high level it's easy to understand what it's supposed to do, but the devil is in the details. So this issue is more of notebook for me on my journey to figure out what is going on.Noticed that we have multiple caches that do the same thing: Resolve a module request to the actual file content. The shape of those are:
Those caches only differ in that the presence of a module specifier signals that an attempt to fetch it from the network/local npm cache was already made.
There are multiple resolution strategies in the current code:
node_modules
folderThe tricky bit is that we receive cache requests in parallel. The code currently does account for that via a
whens
abstraction. It's meant to only be resolved once an item has been resolved.To ensure that parallel requests don't trigger multiple fetches of the same tarball the code uses a
memo
function to return the same Promise object as the initiator to await on.Improvement ideas
whenFiles
to something more descriptive likeloadFile
. Same forwhens
which could be namedpending
or something.whenFiles
abstraction necessary or could we reuse a plainPromise
with multiple listeners here?The text was updated successfully, but these errors were encountered: