0.0.35 • Published 13 days ago

fetchfox v0.0.35

Weekly downloads
-
License
MIT
Repository
github
Last release
13 days ago

Getting started

Install the package and playwright:

npm i fetchfox
npx playwright install-deps
npx playwright install

Then use it. Here is the callback style:

import { fox } from 'fetchfox';

const results = await fox
  .init('https://pokemondb.net/pokedex/national')
  .extract({ name: 'Pokemon name', number: 'Pokemon number' })
  .limit(3)
  .run(null, (delta) => { console.log(delta.item) });
  
for (const result of results) {
  console.log('Item:', result.item);
}

If you prefer, you can use the streaming style:

import { fox } from 'fetchfox';

const stream = fox
  .init('https://pokemondb.net/pokedex/national')
  .extract({ name: 'Pokemon name', number: 'Pokemon number' })
  .stream();

for await (const delta of stream) {
  console.log(delta.item);
}

Read on below for instructions on how to configure your API key and AI model.

Enter your API key

You'll need to give an API key for the AI provider you are using, such as OpenAI. There are a few ways to do this.

The easiest option is to set the OPENAI_API_KEY environment variable. This will get picked up by the FetchFox library, and all AI calls will go through that key. To use this option, run your code like this:

OPENAI_API_KEY=sk-your-key node index.js

Alternatively, you can pass in your API key in code, like this:

import { fox } from 'fetchfox';

const results = await fox
  .config({ ai: { model: 'openai:gpt-4o-mini', apiKey: 'sk-your-key' }})
  .run(`https://news.ycombinator.com/news find links to comments, get basic data, export to out.jsonl`);

This will use OpenAI's gpt-4o-mini model, and the API key you specify. You can pass in other models, including models from other providers like this:

const results = await fox
  .config({ ai: { model: 'anthropic:claude-3-5-sonnet-20240620', apiKey: 'your-anthropic-key' }})
  .run(`https://news.ycombinator.com/news find links to comments, get basic data, export to out.jsonl`);

Choose the AI model that best suits your needs.

Start prompting

Easiest is to use a single prompt, like in the example below.

import { fox } from 'fetchfox';

const results = await fox.run(
  `https://news.ycombinator.com/news find links to comments, get basic data, export to out.jsonl`);

For more control, you can specify the steps like below.

import { fox } from 'fetchfox';

const results = await fox
  .init('https://github.com/bitcoin/bitcoin/commits/master')
  .crawl('find links to the comment pages')
  .extract('get the following data: article name, top comment text, top commenter username')
  .schema({ articleName: '', commentText: '', username: '' })
  .export('out.jsonl');

You can chain steps to do more complicated scrapes. The example below does the following:

  1. Start on the GitHub page for the bitcoin project
  2. Find 10 commits
  3. Get data bout them including lines of code changed
  4. Filter for only the ones that change 10 lines of code
  5. Get the authors of those commits, and find the repos those authors commit to

This scrape will take some time, so there is an option to output incremental results.

import { fox } from 'fetchfox';

const f = await fox
  .config({ diskCache: '/tmp/fetchfox_cache'  })
  .init('https://github.com/bitcoin/bitcoin/commits/master')
  .crawl('find urls commits, limit: 10')
  .extract('get commit hash, author, and loc changed')
  .filter('commits that changed at least 10 lines')
  .crawl('get urls of the authors of those commits')
  .extract('get username and repos they commit to')
  .schema({ username: 'username', repos: ['array of repos'] });

const results = f.run(null, ({ delta, index }) => {
  console.log(`Got incremental result on step ${index}: ${delta}`);
});

The library is modular, and you can use the component individually.

import { Crawler, SinglePromptExtractor } from 'fetchfox';

const ai = 'openai:gpt-4o-mini';
const crawler = new Crawler({ ai });
const extractor = new SinglePromptExtractor({ ai });

const url = 'https://news.ycombinator.com';
const questions = [
  'what is the article title?',
  'how many points does this submission have? only number',
  'how many comments does this submission have? only number',
  'when was this article submitted? convert to YYYY-MM-DD HH:mm{am/pm} format',
];

for await (const link of crawler.stream(url, 'comment links')) {
  console.log('Extract from:', link.url);
  for await (const item of extractor.stream(link.url, questions)) {
    console.log(item);
  }
}

Choosing the right AI model

FetchFox lets you swap in a variety of different AI providers and models. You can check the src/ai/... directory for the list of currently supported providers.

By default, FetchFox uses OpenAI's gpt-4o-mini model. We've found this model to provide a good tradeoff between cost, runtime, and accuracy. You can read more about benchmarking on our blog.

CLI

Or use the command line tool. Install it:

npm install -g fetchfox

And then run the extract command:

fetchfox extract https://www.npmjs.com/package/@tinyhttp/cookie \
  'what is the package name?,what is the version number?,who is the main author?'

Or use npx instead:

npx fetchfox extract https://www.npmjs.com/package/@tinyhttp/cookie \
  'what is the package name?,what is the version number?,who is the main author?'

cli