Mercator Crawler

Provides a URL-frontier, MetaData fetcher, URL Deduper, and is very plug and play friendly 😄.

To run the default Mercator Crawler with no options (this will fetch metadata and provide a readability like function that grabs the main content/article body):

import { Mercator } from "mercator-crawler";

(async () => {
	const mercator = new Mercator();

	// do not await this seedURL. You can only await it after you have called runToCompletion or iterated through all the data sent back.
	mercator.seedURL("https://www.wsj.com/articles/magnus-carlsen-ian-nepomniachtchi-world-chess-championship-computer-analysis-11639003641").then(x => {
		console.log(x);
	});

	await mercator.runToCompletion();
})();

Example 2:

import { Mercator } from "mercator-crawler";

(async () => {
	const mercator = new Mercator();

	// The sendURL can be awaited as it automatically runs to completion.
	const {articleBody, metadata} = await mercator.sendURL("https://www.wsj.com/articles/magnus-carlsen-ian-nepomniachtchi-world-chess-championship-computer-analysis-11639003641");
	
	console.log(articleBody);
	console.log(metadata);
})();

URL Frontier

A URL Frontier's job is to provide preference and politeness.

Currently there is very little preference built-in (you could provide your own through the MercatorSettings).

Metadata fetcher

Fetches general info about a given url.

URL Deduper

This isn't the technical term, but it basically allows you to stop duplicate urls from entering the URL Frontier at the same time.

Resources

Video on web crawling (follows a similar architecture to the IR book): https://www.youtube.com/watch?v=BKZxZwUgL3Y Single Chapter on URL frontier: https://nlp.stanford.edu/IR-book/html/htmledition/the-url-frontier-1.html Book on Information Retrieval (look at the 19th and 20th chapters ["Web search basics" and "Web crawling and indexes"]): https://nlp.stanford.edu/IR-book/

Name		Name	Last commit message	Last commit date
Latest commit History 22 Commits
.vscode		.vscode
public-test-2		public-test-2
public-test-3		public-test-3
public-test		public-test
src		src
.editorconfig		.editorconfig
.gitignore		.gitignore
.npmignore		.npmignore
.prettierrc.yml		.prettierrc.yml
LICENSE		LICENSE
Makefile		Makefile
babel.config.js		babel.config.js
jest.config.ts		jest.config.ts
package.json		package.json
readme.md		readme.md
tsconfig.json		tsconfig.json
yarn-error.log		yarn-error.log
yarn.lock		yarn.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Mercator Crawler

URL Frontier

Metadata fetcher

URL Deduper

Resources

About

Releases

Packages

Languages

License

johnsonjo4531/mercator-crawler

Folders and files

Latest commit

History

Repository files navigation

Mercator Crawler

URL Frontier

Metadata fetcher

URL Deduper

Resources

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages