2.0.0-alpha.11 • Published 2 years ago

@unblocked-web/specifications v2.0.0-alpha.11

Weekly downloads
-
License
MIT
Repository
github
Last release
2 years ago

Unblocked Specification

The Unblocked Specification defines a generic protocol to create an automated browser "Agent" and allow for "Plugins" that can control the agent at every point in the stack as it navigates and interacts with web pages. The goal is to allow Plugins to be written in a generic manner which can allow Automated Engines to avoid being "blocked" while extracting public data.

Why is this needed?

There are amazing tools available like Puppeteer and Playwright to control automated web browsers. These tools allow for coding interactions with websites. However... as they're currently built, they can be detected by websites.

Headless Chrome is initialized with different services and features than headed Chrome (not to mention differences with Chromium vs Chrome). These differences can be detected along the spectrum of a web browser session - from TLS, to Http and the DOM. To find a detailed analysis of these differences, check out Double Agent.

To scrape website data, scrapers also need to be able to rotate user attributes like User Agent, IP Address, Language, Geolocation, and even lower level attributes like the WebGL settings and Canvas output.

This Specification defines a series of "hooks" that allow for reliably controlling these settings.

NOTE: Many settings are available within the regular Devtools Specification, but the browser must be appropriately "paused" at each step or the settings will be injected in time.

Plugins

Plugins are defined as a collection of "hooks" and an EmulationProfile to coordinate among many plugins.

Emulation Profile

An EmulationProfile is a set of configurations that the Agent and Plugins will coordinate to emulate in an automated browser. During initialization, each "installed" plugin will be passed in the EmulationProfile with a chance to help define the attributes of the scraping session. A plugin might provide a UserAgent and BrowserEngine, or might have logic to set WebGL settings from data files.

Configurations include:

  • userAgentOption IUserAgentOption. An object to be provided by a participating plugin that represents a UserAgent that can be emulated.
  • browserEngine IBrowserEngine. Metadata about the Browser executable and launch arguments that should be used to launch the underlying browser process (eg, Chrome 98).
  • deviceProfile IDeviceProfile. Settings relevant to the hardware to be emulated, including Media devices and Graphics card settings.
  • options IEmulationOptions. Options to configure user and browser settings. These are passed on from a Client program. These same settings are applied to the Profile itself. A plugin can opt to modify these if needed, or set them with defaults.
  • customEmulatorConfig object. Settings to be passed to individual Plugins. The @unblocked-web/default-browser-emulator uses a custom userAgentSelector syntax, which is an example of this property.
  • logger IBoundLog. Optional logger instance to use for output.

  • dnsOverTlsProvider object. Configure the host and port to use for DNS over TLS. This feature replicates the Chrome feature that is used if the host DNS provider supports DNS over TLS or DNS over HTTPS. A null value will disable this feature.

    • host string. The DNS provider host address. Google=8.8.8.8, Cloudflare=1.1.1.1, Quad9=9.9.9.9.
    • servername string. The DNS provider tls servername. Google=dns.google, Cloudflare=cloudflare-dns.com, Quad9=dns.quad9.net.
  • geolocation IGeolocation. Overrides the geolocation of the user.
    • latitude number. Latitude between -90 and 90.
    • longitude number. Longitude between -180 and 180.
    • accuracy number. Non-negative accuracy value. Defaults to random number 40-50.
  • timezoneId string. Overrides the host timezone. A list of valid ids are available at unicode.org
  • locale string. Overrides the host languages settings (eg, en-US). Locale will affect navigator.language value, Accept-Language request header value as well as number and date formatting rules.
  • viewport IViewport. Sets the emulated screen size, window position in the screen, inner/outer width and height.
    • width number. The page width in pixels (minimum 0, maximum 10000000).
    • height number. The page height in pixels (minimum 0, maximum 10000000).
    • deviceScaleFactor number defaults to 1. Specify device scale factor (can be thought of as dpr).
    • screenWidth? number. The optional screen width in pixels (minimum 0, maximum 10000000).
    • screenHeight? number. The optional screen height in pixels (minimum 0, maximum 10000000).
    • positionX? number. Optional override browser X position on screen in pixels (minimum 0, maximum 10000000).
    • positionY? number. Optional override browser Y position on screen in pixels (minimum 0, maximum 10000000).
  • upstreamProxyUrl string. A socks5 or http proxy url (and optional auth) to use for all HTTP requests in this session. The optional "auth" should be included in the UserInfo section of the url, eg: http://username:password@proxy.com:80.
  • upstreamProxyIpMask object. Optional settings to mask the Public IP Address of a host machine when using a proxy. This is used by the default BrowserEmulator to mask WebRTC IPs.
    • ipLookupService string. The URL of an http based IpLookupService. Defaults to ipify.org.
    • proxyIp string. The optional IP address of your proxy, if known ahead of time.
    • publicIp string. The optional IP address of your host machine, if known ahead of time.
  • showChrome boolean. A boolean whether to show the Chrome browser window.
  • disableDevtools boolean. Do not automatically show devtools when showChrome is enabled.
  • disableIncognito boolean. Disable the use of an incognito context.
  • disableMitm boolean. Disable the use of a man-in-the-middle server. This stops the ability to mimic the TLS signature of a headed Chrome version.
  • noChromeSandbox boolean. A boolean to disable the Chrome Sandbox requirement on Linux.

Plugin Creation

An individual plugin should implement the specification defined at IUnblockedPlugin. Any desired hooks should be added as functions to a class.

class MyFirstPlugin implements IUnblockedPlugin {
  async onNewPage(page) {
    // do something
  }
}

A plugin can optionally participate in a scrape and set Emulation Profile attributes by adding a static class function called shouldActivate.

class MyFirstPlugin implements IUnblockedPlugin {
  static shouldActivate(profile: IEmulationProfile): boolean {
    // 1. A plugin can set properties.
    if (!profile.browserEngine) profile.browserEngine = getMySuperEngine();
    // 2. A plugin can set defaults
    if (!profile.locale) profile.locale = 'en-GB';
    // 3. A plugin can choose to participate in this session.
    return doISupportTheUserAgent(profile.userAgentOption);
    // NOTE: A plugin should likely not change profile settings if it does not participate.
  }
}

Lifecycle

Plugins are created with an agent and thrown away when the Agent is closed.

Plugin Coordination

One or more Plugins are added to a single IUnblockedPlugins manager that will be expected to follow a short specification. An implementor will need to be able to call a class level shouldActivate function on each Plugin class.

  1. Each registered Plugin with a static method called shouldActivate must be called in the order Plugins are registered. The same EmulationProfile object must be passed into each call. If a Plugin responds with false, it should not be used for the given session. If no method exists, it should always be activated.
  2. An instance of each participating Plugin will be constructed with the EmulationProfile object.
  3. Only a single instance of playInteractions will be allowed. It should be the last implementation provided.
  4. Only a single instance of addDomOverride will be used. It should be the first implementation that indicates it can run the override by returning true.
  5. The Plugin will last for the duration of an Agent session, and should be disposed afterwards.

Agent

The /agent folder of this specification defines all of the hooks that are expected by an Agent in order to intercept and adjust it to remain unblocked.

  • /agent/hooks: This folder has interfaces describing all of the "hook" points an Agent is expected to expose
  • /agent/browser: The browser-related interfaces, like a Browser, BrowserContext (incognito Window), Page, Frame, etc
  • /agent/net: The network stack, including taps into lower level protocols
  • /agent/interact: An interaction specification, allowing for grouping interaction steps.

NOTE: This set of interfaces was initially extracted from the SecretAgent project (https://github.com/unblocked-web/secret-agent). As such, it has too broad a spec. It should be whittled down over time.

To reach the goal of emulating a human using a regular browser, the following "hooks" must be provided by an implementor:

Browser

Browser level hooks are called at a Browser level.

onNewBrowser(browser, launchArgs)

Called anytime a new Browser will be launched. The hooking method (eg, BrowserEmulator) can manipulate the browser.engine.launchArguments to control Chrome launch arguments. A list can be found here.

browser: IBrowser a Browser instance. Do not manipulate beyond launchArguments unless you really know what you're doing. launchArgs: IBrowserLaunchArgs arguments provided by a user or set in the environment that an emulator should use to appropriately set the launchArguments

NOTE: a new browser might be reused by an implementor, so you should not assume this method will be called one-to-one with your scraper sessions.

onNewBrowserContext(context)

Called anytime a new BrowserContext has been created. A BrowserContext is the equivalent to a Chrome Incognito Window. This "hook" Promise will be resolved before any Pages are created in the BrowserContext. This a mechanism to isolate the User Storage and Cookies for a scraping session.

  • context: IBrowserContext* a BrowserContext instance that has just been opened.

onDevtoolsPanelAttached(devtoolsSession)

Called anytime a new Devtools Window is opened for any Devtools Window in the Browser.

A DevtoolsSession object has control to send and received any Devtools Protocol APIs and Events supported by the given Browser.

devtoolsSession: IDevtoolsSession a DevtoolsSession instance connected to the Devtools Panel.

NOTE: this only happens when a browser is launched into a Headed mode.

BrowserContext

These hooks are called on an individual BrowserContext.

addDomOverride(runOn, script, args, callback?)

Add a custom DOM override to the plugin. The function will be run only by the first Plugin that returns true.

  • runOn page | worker Where to run this script.
  • script string A script to be run in the page. It will be provided with access to Proxy utilities (currently matching the Unblocked Default Browser Emuluator _proxyUtils).
  • args { callbackName?: string } & any Arguments to provide to the script. callbackName should be used to specify the name of a callback that will be injected onto the page. It's recommended to immediately delete this function to prevent it being detected by a webpage.
  • callback? (data: string, frame: IFrame) => any An optional callback to inject onto the page. The given name will match args.callbackName if provided, and injected as this argument to the script.

onNewPage(page)

Called anytime a new Page will be opened. The hooking Method can perform Devtools API calls using page.devtoolsSession.

An implementor is expected to pause the Page and allow all Devtools API calls and Page scripts to be registered before the Page will render. This is likely done by instructing Chrome to pause all new pages in the debugger by default. The debugger will be resumed ONLY after all initialization API calls are sent.

page: IPage the created page paused waiting for the debugger. NOTE: you should not expect to get responses to Devtools APIs before the debugger has resumed.

onNewWorker(worker)

Called anytime a new Service, Shared or Web Worker will be created. The hooking method can perform Devtools API calls using worker.devtoolsSession.

The Worker will be paused until hook methods are completed.

worker: IWorker the created worker.

onDevtoolsPanelAttached(devtoolsSession)

Called anytime a new Devtools Window is opened for a Page in this BrowserContext.

devtoolsSession: IDevtoolsSession the connected DevtoolsSession instance.

NOTE: this only happens when a browser is launched into a Headed mode.

onDevtoolsPanelDetached(devtoolsSession)

Called anytime a new Devtools Window is closed for a Page in this BrowserContext.

devtoolsSession: IDevtoolsSession the disconnected DevtoolsSession instance.

Interact

Interaction hooks allow an emulator to control a series of interaction steps.

playInteractions(interactions, runFn, helper)

This hook allows a caller to manipulate the directed interaction commands to add an appearance of user interaction.

For instance, a scraper might provide instructions:

[
  [
    { command: 'scroll', mousePosition: [0, 1050] },
    { command: 'click', mousePosition: [150, 150] },
  ],
];

An interaction hook could add timeouts and appear more human by breaking a scroll into smaller chunks.

runFn({ command: 'scroll', mousePosition: [0, 500] });
runFn({ command: 'move', mousePosition: [0, 500] });
wait(100);
runFn({ command: 'scroll', mousePosition: [0, 1050] });
runFn({ command: 'move', mousePosition: [0, 1050] });
wait(100);
runFn({ command: 'click', mousePosition: [150, 150], delayMillis: 25 });
  • interactions: IInteractionGroup[]. A group of steps that are used to control the browser. Steps are things like Click on an Element, Move the Mouse to Coordinates, etc.
  • runFn: function(interaction: IInteractionStep). A provided function that will perform the final interaction with the webpage.
  • helper: IInteractionsHelper. A series of utility functions to calculate points and DOM Node locations.

beforeEachInteractionStep(step, isMouseCommand)

A callback run before each interaction step.

  • interactionStep: IInteractionStep. The step being performed: things like Click on an Element, Move the Mouse to Coordinates, etc.
  • isMouseCommand: boolean. Is this a mouse interaction step?

afterInteractionGroups()

A callback run after all interaction groups from a single playInteractions have completed.

adjustStartingMousePoint(point, helper)

A callback allowing an implementor to adjust the initial mouse position that will be visible to the webpage.

  • point: IPoint. The x,y coordinates to adjust.
  • helper: IInteractionsHelper. A series of utility functions to calculate points and DOM Node locations.

Network

Network hooks allow an emulator to control settings and configurations along the TCP -> TLS -> HTTP/2 stack.

onDnsConfiguration(settings)

Change the DNS over TLS configuration for a session. This will be called once during setup of a BrowserContext.

Chrome browsers will use the DNS over TLS configuration of your DNS host if it's supported (eg, CloudFlare, Google DNS, Quad9, etc). This setting can help mimic that usage.

Hook methods can manipulate the settings object to control the way the network stack will look up DNS requests.

  • settings: IDnsSettings. DNS Settings that can be configured.
    • dnsOverTlsConnection tls.ConnectionOptions. TLS settings used to connect to the desired DNS Over TLS provider. Usually just a host and port.
    • useUpstreamProxy boolean. Whether to dial DNS requests over the upstreamProxy (if configured). This setting determines if DNS is resolved from the host machine location or the remote location of the proxy endpoint.

onTcpConfiguration(settings)

Change TCP settings for all Sockets created to serve webpage requests. This configuration will be called once during setup of a BrowserContext.

Different Operating Systems exhibit unique TCP characteristics that can be used to identify when a browser says it's running on Windows 8, but shows TCP indicators that indicate it's actually running on Linux.

  • settings: ITcpSettings. TCP Settings that can be configured.
    • tcpWindowSize number. Set the "WindowSize" used in TCP (max number of bytes that can be sent before an ACK must be received). NOTE: some operating systems use sliding windows. So this will just be a starting point.
    • tcpTtl number. Set the "TTL" of TCP packets.

onTlsConfiguration(settings)

Change TLS settings for all secure Sockets created to serve webpage requests. This configuration will be called once per BrowserContext.

Different Browsers (and sometimes versions) will present specific order and values for TLS ClientHello Ciphers, Extensions, Padding and other attributes. Because these values do not change for a specific version of a Browser, they're an easy way to pickup when a request says it's Chrome 97, but is actually coming from Node.js.

  • settings: ITlsSettings. TLS Settings that can be configured.
    • tlsClientHelloId string. A ClientHelloId that will be mimicked. This currently maps to uTLS values.
    • socketsPerOrigin number. The number of sockets to allocate before re-use for each Origin. This should mimic the source Browser settings.

onHttpAgentInitialized(agent)

Callback hook called after the network stack has been initialized. This configuration will be called once per BrowserContext.

This function can be useful to do any post setup lookup (eg, to determine the public IP allocated by a proxy URL).

  • agent: IHttpSocketAgent. The agent that has been initialized. This object will expose a method to initialize a new Socket (ie, to dial an IP lookup service).

onHttp2SessionConnect(request, settings)

Callback to manipulate the HTTP2 settings used to initialize a conversation.

Browsers and versions send specific HTTP2 settings that remain true across all operating systems and clean installations.

  • request: IHttpResourceLoadDetails. The request being made.
  • settings: IHttp2ConnectSettings. Settings that can be adjusted.
    • localWindowSize number. The HTTP2 initial window size to use.
    • settings http2.Settings. A node.js http2 module Settings object. It can be manipulated to change the settings sent to create an HTTP connection.

beforeHttpRequest(request)

Callback before each HTTP request. This hook provides the opportunity to manipulate or bypass each request before it's sent on to the destination URL.

Browsers and versions send specific HTTP header values and order that are consistent by Resource Type, Origin, Cookie status, and more. An emulator should ensure headers are correct before a request is sent.

  • request: IHttpResourceLoadDetails. The request being made. Details listed below are relevant to headers.
    • url: URL. The full destination URL.
    • isServerHttp2: boolean. Is this an HTTP2 request (the headers are different for HTTP/1 and 2).
    • method: string. The http method.
    • requestHeaders: IncomingHeaders. The headers that should be manipulated.
    • resourceType: IResourceType. The type of resource being requested.
    • originType: OriginType. The type of origin (none,same-origin,same-site,cross-site).

beforeHttpResponse(resource)

Callback before sending an HTTP response to the Browser. This can be used to track cookies on response, or implement a caching layer (ie, by tracking cache headers and sending on http request, then intercepting 304 response and sending a 200 + body).

websiteHasFirstPartyInteraction(url)

Callback after a Domain has had a First-Party User Interaction.

Some Browsers have implemented rules that Cookies cannot be set for a Domain until a user has explicitly loaded that site (it can also impact things like referer headers). This was put in place to avoid the technique to redirect a user through an ad tracking network as a way to set tracking cookies. To properly simulate cookies and headers, this method will help identify when a browser considers a Domain to have received first party interaction.

  • url: URL. The page that has been interacted with.