mirror-sites v1.0.11
mirror-sites
Introduction
Simulate browser access to mirror sites or download resources
Usage
- installation:
npm install -g mirror-sites
- global command:
mirror-sites
- with appropriate arguments will download all network resources of the site
Arguments
-u(required): site url
The URL of the site to be mirrored, usually the homepage of the site, for example https://vuejs.org/
-remote(recommended): download third-party site resources
Whether the resources of third-party sites such as CDN referenced by the site should be downloaded together
For example, <script src="https://cdn.usefathom.com/script.js" />
will be downloaded to /cdn.usefathom.com/script.js
, and the page code will also be rewritten to <script src="/cdn.usefathom.com/script.js" />
. For more rewriting rules, please refer to -source
-p(recommended): using your own Chrome
By default, when using mirror sites for the first time, the Chromium browser will be automatically downloaded for mirroring operations
When you don't want to download a browser but want to use your own Chrome browser for mirroring operations, you can open your browser through commands (Please fill in the browser address based on your system and installation directory)
"C:\Program Files (x86)\Google\Chrome\Application\Chrome.exe" --remote-debugging-port=9222
Open the browser according to the above example, and this parameter is the --remote-debugging-port 9222
For pages that need to be logged in, you can manually log in to the browser and mirror them again
-new(take a look): use when stuck
When using -p, priority will be given to finding the tab of the site that has already been opened in your browser. If the program gets stuck in "Configuring mirror operations...", Unable to operate your tab for unknown reasons. Use -new
to open a new tab for operation
-load(take a look): continue downloading after interruption
During the mirroring process, there may be unexpected interruptions or stuck situations. After the interruption, the next time the mirroring will start from scratch, and resources that have already been downloaded will be downloaded again
Using -load
allows the webpage to continue from where it was last interrupted
Whether using -load
or not, the archive will be generated in the directory/mirror-sites-record.json
The archive file format is as follows
- failed: Download failed files, you can manually retry the download
- download: Successfully downloaded files
- links: The links crawled by the site, with the following numbers representing different states
- 0: Not yet mirrored
- 1: Mirror successful
- 2: Mirror failed, possibly the link is an interface and will eventually redirect
- 3: Ignored,Crawled to the address, but did not mirror it
- -1: Third-party site, no mirroring
-m: restrict url
The regular expression of the URL that needs to be mirrored within the site, and URLs that do not match the path part of the expression will be ignored. This is very useful for filtering multilingual URLs containing language information (such as /en/index.html) to avoid duplicate collection of the same page. If you only want to collect English, enter "-m \/en.*"
-o: output directory
Default to the Directory during command line execution + site domain name
-t: download resource types
The type of resource to download, without filling in the default, will mirror the entire website. The downloaded resources will be stored in a folder created according to the download address, for example
/css/index.css
/js/index.js
/images
/icons
/icon.svg
/logo
/logo.png
All supported types are as follows
- xhr: Contains xhr and fetch, typically JSON data returned by the interface
- document: HTML Page
- stylesheet: CSS file
- script: JS file
- image: Image
- media: Audio/Video
- font: Font
Multiple types are separated by '|', for example, if you only want to download images, audio/video, you can fill in the image|media
At this point, the downloaded resources will be stored in folders created according to type, such as
/image
/a.jpg
/b.png
/c.gif
/media
/audio
/a.mp3
/b.wav
/video
/a.mp4
/b.mov
-source: source code
The downloaded HTML maintains the source code and does not replace the full URL address with a relative address. It is useful for situations where frameworks such as vue.js are used and the source code contains custom components
Benefits of not using -source
The full path of the link will be modified to a relative path
Assuming the mirrored website has the following code
<script src="https://cdn.usefathom.com/script.js" />
<a href="https://vuejs.org/api/application.html#createapp" />
Will be converted to
<script src="/cdn.usefathom.com/script.js" />
<a href="/api/application.html#createapp" />
- A tag
After clicking on the A tag, it will jump back to the source site. This usually requires you to manually change all the A tag links to relative paths after mirroring the website. If the source website already uses relative paths, there is no need to modify them
- Resources
Resources will also use the resources of the source site. If the source site changes the URL or the corresponding content of the URL, your site will become unusable or encounter errors
Therefore, it is generally necessary to cooperate with -remote to download third-party site resources to the local (/cdn.usefeathom.com/script.js)
The benefits of using -source
If the website is developed using native JavaScript, the code is as follows
<script src="https://unpkg.com/vue@3/dist/vue.global.js"></script>
<div id="app">
<h1>-source</h1>
<my-component />
</div>
<script>
const { createApp, ref } = Vue
const MyComponent = {
data() {
return {
count: 0,
}
},
template: `<div @click="count++">count is {{ count }}</div>`,
}
const app = createApp({})
app.component('my-component', MyComponent);
app.mount('#app')
</script>
You can see that the source code contains custom components such as <my-component/>
When not using -source
, replaced resource /unpkg.com/vue@3/dist/vue.global.js
The saved webpage code will be the rendered page
At this point, <my-component />
will become <div>count is 0</div>
and the click event will be lost, indicating that the result of the mirror will be very bad
Using -source
to keep downloading the source code will be very useful
You need to weigh between manually replacing the part of the URL or manually replacing the part of the custom component
If there is a better solution that can be compatible with both, please provide an issue to the author