Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Trying to understand the npm-plugin #151

Open
marvinhagemeister opened this issue Oct 8, 2020 · 5 comments · May be fixed by #874
Open

Trying to understand the npm-plugin #151

marvinhagemeister opened this issue Oct 8, 2020 · 5 comments · May be fixed by #874

Comments

@marvinhagemeister
Copy link
Member

marvinhagemeister commented Oct 8, 2020

I've now spent a couple of hours going through the code of our npm-plugin, but still don't have a good grasp on the control flow. On a high level it's easy to understand what it's supposed to do, but the devil is in the details. So this issue is more of notebook for me on my journey to figure out what is going on.

Noticed that we have multiple caches that do the same thing: Resolve a module request to the actual file content. The shape of those are:

Map<ModuleSpecifier, Map<FilePath, FileContent>>

Those caches only differ in that the presence of a module specifier signals that an attempt to fetch it from the network/local npm cache was already made.

There are multiple resolution strategies in the current code:

  1. Check in memory cache
  2. Check if it is available in local node_modules folder
  3. Check if the tarball is available in the global npm/yarn cache folder (not implemented yet)
  4. Retrieve tarball from registry
  5. Bundle retrieved package and store the results in the cache of 1.

The tricky bit is that we receive cache requests in parallel. The code currently does account for that via a whens abstraction. It's meant to only be resolved once an item has been resolved.

To ensure that parallel requests don't trigger multiple fetches of the same tarball the code uses a memo function to return the same Promise object as the initiator to await on.

Improvement ideas

  • Ideally we only have one file cache and different strategies to put stuff into the cache. This hopefully simplifies retrieval access and makes the caching logic more explicit.
  • Rename whenFiles to something more descriptive like loadFile. Same for whens which could be named pending or something.
  • Is the whenFiles abstraction necessary or could we reuse a plain Promise with multiple listeners here?
@developit
Copy link
Member

developit commented Oct 8, 2020

Reading through this and the comments makes me think such a data structure and flow could be:

// A module@version string where the version is fully resolved, like "preact@10.5.4":
type NormalizedModuleSpecifier = string;

// the single in-memory representation of a (version of a) module:
type ModuleCache = Map<NormalizedModuleSpecifier, ModuleCacheEntry>;

interface ModuleCacheEntry {
  // if true, streaming has completed and .files is exhaustive:
  complete: boolean;
  // already-resolved files:
  files: Map<path, string>;
  // a new class that handles the tar streaming:
  stream: TarStreamer;
}

interface TarStreamer {
  // returns or creates a Promise entry in .pendingFiles for the given path
  waitForFile(path): Promise<string>;
  // like whenFiles, can contain promises for files that end up not existing:
  private pendingFiles = Map<path, Promise<string>>;
};

Resolution of a file checks files.get(path) and returns any matching entry. If there's no match and .complete=true, resolution fails (file does not exist). If .complete=false streaming is in progress - it calls this.stream.waitForFile(path) which returns a Promise.

In stream.waitForFile(), if there's already a Promise for the given path in .pendingFiles, it is returned. Otherwise, a new Promise is created and added to .pendingFiles. As the untar stream unpacks each file in the tarball, it checks for a matching Promise in .pendingFiles and resolves it with the file contents. At the end of streaming, any unresolved Promises in .pendingFiles are rejected with a "file does not exist" error.

@marvinhagemeister
Copy link
Member Author

marvinhagemeister commented Oct 8, 2020

@developit Love where this is going! The proposed interface is a huge improvement in readability for me 👍

The only thing I'm wondering is what we gain by caching a file map for each module vs one big Map like our current DISK_CACHE where keys are module@version :: path. To me it seems like having this per module kinda duplicates that logic, but I may miss something.

Same for tarStreamer: I'm not sure if we'd need one per module. Need to experiment with that. Fully on board with all the other stuff 👍

@developit
Copy link
Member

developit commented Oct 8, 2020

The file map within each module is important because we need it in order to handle requests for modules that haven't yet been streamed. Probably clearer in the example below, but basically when we get a request for preact@10.5.4/hooks/dist/hooks.mjs, if the "preact" tarball is already being downloaded we need a way to register a callback for both the success and error cases. An event would be really difficult for that, because in the error case (a file is requested that doesn't end up being in the tarball), nothing really knows to fire an event for that "file".

I ended up writing most of an implementation in a comment for some reason:

class NpmResolver {
  constructor({ modulesDir = './node_modules', plugins = [] }) {
    this.modulesDir = modulesDir;
    // versioned module state and in-memory cache
    this.moduleCache = new Map();
    // holds registry metadata (ie, "packuments")
    this.metaCache = new Map();
    // plugins transform all files from tarballs before they are used/stored:
    this.plugins = plugins;
  }

  async resolve({ module, version, path }) {
    path = normalizePath(path);
    version = await this.resolveVersion(module, version);
    const record = this.getModuleRecord(module, version);
    return record.getFile(path);
  }

  async resolveVersion(module, version) {
    let meta = this.metaCache.get(module);
    if (!meta) {
      // get packument for `module` from registry:
      meta = await getPackageMeta(module);
      this.metaCache.set(module, meta);
    }
    // resolve tags/aliases to versions ("latest"-->"10.5.4")
    if (meta['dist-tags'] && version in meta['dist-tags']) return meta['dist-tags'][version];
    if (version in meta.versions) return version;
    return semverMaxSatisfying(Object.keys(meta.versions, version));
  }

  getModuleRecord(module, version) {
    const specifier = `${module}@${version}`;
    let record = this.moduleCache.get(specifier);
    if (!record) {
      // could potentially skip this, since it's technically possible to generate the tarballUrl:
      const meta = this.metaCache.get(module).versions[version];
      record = new ModuleRecord(module, version, meta, this.transformFile.bind(this));
      this.moduleCache.set(specifier, record);
    }
    return record;
  }

  transformFile(file) {
    for (const plugin of this.plugins) {
      if (!plugin.transform) continue;
      const out = plugin.transform(file.data, file.name);
      if (out) file.data = out;
    }
    // asynchronously write the file to disk too
    writeNpmFile(file.module, file.name, file.data);
  }
}

class ModuleRecord {
  constructor(module, version, meta, transformFile) {
    this.module = module;
    this.version = verison;
    this.completed = false;
    this.files = new Map();
    this.stream = new TarStreamer(module, version, meta.dist.tarball);
    this.stream.transformFile = transformFile;
    this.stream.done.then(() => {
      this.completed = true;
      this.stream = null; // we can get rid of all streaming state now
    });
  }
  getFile(path) {
    const file = this.files.get(path);
    if (file != null) return file;
    if (this.completed) throw Error(`${this.module}@${this.version}/${path} not found.`);
    // still streaming, return pending file promise:
    return this.stream.waitForFile(path);
  }
}

class TarStreamer {
  constructor(module, version, tarballUrl) {
    this.module = module;
    this.version = version;
    this.tarballUrl = tarballUrl;
    this.done = this.stream();
    this.pendingFiles = new Map();
    this.transformFile = () => {};
  }
  async stream() {
    // same as getTarFiles() basically
    const { tarballUrl, module, version } = this;
    const tarStream = await getStream(tarballUrl);
    await parseTarball(tarStream, async (name, stream) => {
      let data = await streamToString(stream);
      const file = { name, data, stream, module, version };
      await this.transformFile(file);
      const pending = this.pendingFiles.get(file.name);
      if (pending) pending.resolve(file.data);
    });
    this.pendingFiles.forEach(pending => pending.reject('Not found'));
    this.pendingFiles.clear();
  }
}
@marvinhagemeister
Copy link
Member Author

marvinhagemeister commented Oct 9, 2020

@developit Currently noodling on your snippet a bit. I like the structure in yours a lot and tried to come up with an imperative alternative to the class to be able to compare the different approaches.

The main drawback I see with classes is that it encapsulates state to small entities. This works fine for now but in my experience additional features usually require connecting these states across boundaries at which point one is forced to remodel the whole house again rather than adding a function here and there.

A pattern that my brother introduced me to whilst working on our e2e-test runner pentf is ruthlessly writing very imperative functions and passing an object with stateful stuff as the first parameter.

const State = {
  cache: new Map(),
  resolved: new Set(),
}

// Most functions receive that as the first argument.
function doSomething(state, arg1, arg2) {
  // Here I can access any state
}

It may seem very unglamorous (and it probably is), but that "dumbness" has served us extremely well. No matter what new feature we added, we haven't run into cases where we need to re-model the architecture due to lack of access to stateful stuff. Instead everything becomes a function which is very easy to swap out or add new code to it. This pattern lends itself to adapt to changes extremely easy and quickly.

We still put boundaries on stateful stuff, so we didn't just have a global object though. In our case such a boundary seems to be the npm-plugin itself. I see no reason why we should split it up further and limit access to it.

So I'd like to throw in another contender. The following code is based on your snippet and tries to rewrite it with an imperative style.

Note that the code would be noticeably shorter than the class approach, but I typed out the full implementation for waitForFile().

// Shared registry state that is passed to each function
export interface RegistryState {
	modulesDir: string;
	cache: Map<FullSpec, string>;
	metaCache: Map<FullSpec, any>;
	completed: Set<ModuleSpec>;
    pendingModules: Map<ModuleSpec, Map<Path, { promise, resolve, reject }>>;
	plugins: Plugin[];
}

async function resolveModule(registry: RegistryState, info: Info) {
	const { cache, completed, metaCache } = registry;
	let { module, version, path } = info;

	version = await resolveVersion(registry, module, version);
	path = normalizePath(path);

	// Memory cache
	const cached = cache.get(`${module}@${version} :: ${path}`);
	if (cached) {
		return cached;
	}

	// If the we did load the tarball, but the file is not in cache,
	// we requested a file that doesn't exist.
	if (completed.has(`${module}@${version}`)) {
		throw new Error(`${module}@${version}/${path} not found.`);
	}

	return waitForFile(registry, { module, version, path }, async () => {
		const tarball = metaCache.get(module).dist.tarball;
		streamTarball(registry, { module, version, path }, tarball);
	});
}

async function resolveVersion(registry, module, version) {
	let meta = registry.meta.get(module);
	if (!meta) {
		// get packument for `module` from registry:
		meta = await getPackageMeta(module);
		registry.meta.set(module, meta);
	}
	// resolve tags/aliases to versions ("latest"-->"10.5.4")
	if (meta['dist-tags'] && version in meta['dist-tags']) return meta['dist-tags'][version];
	if (version in meta.versions) return version;
	return semverMaxSatisfying(Object.keys(meta.versions, version));
}

// We could inline this into `resolveModule()` if we wanted to. Thought it would be clearer
// to contain the Promise loading logic here, as it's a bit unusual although necessary.
async function waitForFile(registry: RegistryState, { module, version, path }, callback) {
	const { pendingModules } = registry;
	let pending = pendingModules.get(`${module}@${version}`);
	if (pending) {
		let wait = pending.get(path);
		if (wait) return wait.promise;
		const promise = new Promise((resolve, reject) => {
			pending.set(path, { promise, resolve, reject });
		});
		return promise;
	}

	// First time module is requested. Populate cache
	pending = new Map();
	pendingModules.set(`${module}@${version}`, pending);

	const promise = new Promise((resolve, reject) => {
		pending.set(path, { promise, resolve, reject });
	});

	callback();
	return promise;
}

async function streamTarball(registry: RegistryState, { module, version, path }, url: string) {
	const { pendingModules, completed, cache } = registry;

	const pending = pendingModules.get(`${module}@${version}`);
	const tarStream = await getStream(url);
	await parseTarball(tarStream, async (name, stream) => {
		let data = await streamToString(stream);

		for (const plugin of this.plugins) {
			if (!plugin.transform) continue;
			const out = plugin.transform(data, name);
			if (out) data = out;
		}

		// Asynchronously write the file to disk, but await for completion
		// so that we don't notify our Promise listeners too early.
		await writeNpmFile(module, name, data);

		cache.set(`${module}@${version} :: ${path}`, data);
		const promise = pending.get(name);
		if (promise) promise.resolve(data);
	});

	// Yay, parsing is done! Mark module as being completed and reject remaining listeners
	completed.add(`${module}@${version}`);
	pending.forEach(pending => pending.reject('Not found'));
	// No need to keep listeners in memory, future calls will bail out on `completed`
	pendingModules.delete(`${module}@${version}`);
}
@developit
Copy link
Member

developit commented Oct 9, 2020

I'm good with the game world style approach. Only thing in your code sample I find awkward is the callback to waitFile. Based on the logic it seems like we could just do that before calling waitFile if there is no existing key in pendingModules:

	// The first time module is requested, create a cache entry and start streaming it:
	if (!pendingModules.has(`${module}@${version}`)) {
		pendingModules.set(`${module}@${version}`, new Map());
		const tarball = metaCache.get(module).dist.tarball;
		streamTarball(registry, { module, version, path }, tarball);
	}

	// now waitForFile can assume there's an entry in pendingModules:
	return waitForFile(registry, { module, version, path });
}

async function waitForFile(registry: RegistryState, { module, version, path }) {
	const { pendingModules } = registry;
	const pending = pendingModules.get(`${module}@${version}`);
	let wait = pending.get(path);
	if (wait) return wait.promise;
	wait = {};
	wait.promise = new Promise((resolve, reject) => {
		wait.resolve = resolve;
		wait.reject = reject;
	});
	pending.set(path, wait);
	return wait;
}

Also I think I mentioned it in the PR thing with the comments, but there shouldn't be any reason to wait while writing files to disk, since nothing ever reads from the disk.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
2 participants