1.0.11 • Published 9 months ago

mirror-sites v1.0.11

Weekly downloads
-
License
MIT
Repository
-
Last release
9 months ago

mirror-sites

Introduction

Simulate browser access to mirror sites or download resources

Usage

  1. installation: npm install -g mirror-sites
  2. global command: mirror-sites
  3. with appropriate arguments will download all network resources of the site

Arguments

-u(required): site url

The URL of the site to be mirrored, usually the homepage of the site, for example https://vuejs.org/

-remote(recommended): download third-party site resources

Whether the resources of third-party sites such as CDN referenced by the site should be downloaded together

For example, <script src="https://cdn.usefathom.com/script.js" /> will be downloaded to /cdn.usefathom.com/script.js, and the page code will also be rewritten to <script src="/cdn.usefathom.com/script.js" />. For more rewriting rules, please refer to -source

-p(recommended): using your own Chrome

By default, when using mirror sites for the first time, the Chromium browser will be automatically downloaded for mirroring operations

When you don't want to download a browser but want to use your own Chrome browser for mirroring operations, you can open your browser through commands (Please fill in the browser address based on your system and installation directory)

"C:\Program Files (x86)\Google\Chrome\Application\Chrome.exe" --remote-debugging-port=9222

Open the browser according to the above example, and this parameter is the --remote-debugging-port 9222

For pages that need to be logged in, you can manually log in to the browser and mirror them again

-new(take a look): use when stuck

When using -p, priority will be given to finding the tab of the site that has already been opened in your browser. If the program gets stuck in "Configuring mirror operations...", Unable to operate your tab for unknown reasons. Use -new to open a new tab for operation

-load(take a look): continue downloading after interruption

During the mirroring process, there may be unexpected interruptions or stuck situations. After the interruption, the next time the mirroring will start from scratch, and resources that have already been downloaded will be downloaded again

Using -load allows the webpage to continue from where it was last interrupted

Whether using -load or not, the archive will be generated in the directory/mirror-sites-record.json

The archive file format is as follows

  • failed: Download failed files, you can manually retry the download
  • download: Successfully downloaded files
  • links: The links crawled by the site, with the following numbers representing different states
    • 0: Not yet mirrored
    • 1: Mirror successful
    • 2: Mirror failed, possibly the link is an interface and will eventually redirect
    • 3: Ignored,Crawled to the address, but did not mirror it
    • -1: Third-party site, no mirroring

-m: restrict url

The regular expression of the URL that needs to be mirrored within the site, and URLs that do not match the path part of the expression will be ignored. This is very useful for filtering multilingual URLs containing language information (such as /en/index.html) to avoid duplicate collection of the same page. If you only want to collect English, enter "-m \/en.*"

-o: output directory

Default to the Directory during command line execution + site domain name

-t: download resource types

The type of resource to download, without filling in the default, will mirror the entire website. The downloaded resources will be stored in a folder created according to the download address, for example

/css/index.css
/js/index.js
/images
  /icons
    /icon.svg
  /logo
    /logo.png

All supported types are as follows

  • xhr: Contains xhr and fetch, typically JSON data returned by the interface
  • document: HTML Page
  • stylesheet: CSS file
  • script: JS file
  • image: Image
  • media: Audio/Video
  • font: Font

Multiple types are separated by '|', for example, if you only want to download images, audio/video, you can fill in the image|media

At this point, the downloaded resources will be stored in folders created according to type, such as

/image
  /a.jpg
  /b.png
  /c.gif
/media
  /audio
    /a.mp3
    /b.wav
  /video
    /a.mp4
    /b.mov

-source: source code

The downloaded HTML maintains the source code and does not replace the full URL address with a relative address. It is useful for situations where frameworks such as vue.js are used and the source code contains custom components

Benefits of not using -source

The full path of the link will be modified to a relative path

Assuming the mirrored website has the following code

<script src="https://cdn.usefathom.com/script.js" />
<a href="https://vuejs.org/api/application.html#createapp" />

Will be converted to

<script src="/cdn.usefathom.com/script.js" />
<a href="/api/application.html#createapp" />
  • A tag

After clicking on the A tag, it will jump back to the source site. This usually requires you to manually change all the A tag links to relative paths after mirroring the website. If the source website already uses relative paths, there is no need to modify them

  • Resources

Resources will also use the resources of the source site. If the source site changes the URL or the corresponding content of the URL, your site will become unusable or encounter errors

Therefore, it is generally necessary to cooperate with -remote to download third-party site resources to the local (/cdn.usefeathom.com/script.js)

The benefits of using -source

If the website is developed using native JavaScript, the code is as follows

<script src="https://unpkg.com/vue@3/dist/vue.global.js"></script>

<div id="app">
  <h1>-source</h1>
  <my-component />
</div>

<script>
  const { createApp, ref } = Vue
  const MyComponent = {
	  data() {
	    return {
        count: 0,
      }
	  },
	  template: `<div @click="count++">count is {{ count }}</div>`,
	}

  const app = createApp({})
  app.component('my-component', MyComponent);
  app.mount('#app')
</script>

You can see that the source code contains custom components such as <my-component/>

When not using -source, replaced resource /unpkg.com/vue@3/dist/vue.global.js The saved webpage code will be the rendered page

At this point, <my-component /> will become <div>count is 0</div> and the click event will be lost, indicating that the result of the mirror will be very bad

Using -source to keep downloading the source code will be very useful

You need to weigh between manually replacing the part of the URL or manually replacing the part of the custom component

If there is a better solution that can be compatible with both, please provide an issue to the author

1.0.11

9 months ago

1.0.1

10 months ago

1.0.0

10 months ago