Enjinscraper NPM

EnjinScraper

Scrapes an Enjin site via the Enjin API.

For support, please join the support Discord: https://discord.gg/2SfGAMskWt.

Usage

Warning If you have 2 factor authentication enabled on your Enjin account, you must either disable it or make a temporary account without 2FA to run this tool!

To scrape all data the tool can scrape, the account must be a sitewide admin or owner account.

EnjinScraper will now do its best even if you can't provide an API key or a site admin account. At minimum, a regular site account is still needed. There's still of course some limits to this, but I've done my best to include as much as I can. Note there's still a few minor things I have to add to it when it works in this mode, such as it seemingly not getting forum images properly and perhaps a way for it to get application questions. These things are normally handled by the admin panel, but obviously some users trying to archive sites they don't own don't have this luxury. Best of luck to everyone!

To use this mode, you will have to manually provide the module IDs of forums, news, and wiki modules you wish to scrape. This will be the first number found in the URL when on any page of one of these modules. For example, here: https://www.megacrafting.com/forum/m/4627724/viewthread/9364148-mejinxx-application the forum module ID is 4627724. These should be specified as an array of strings under manualForumModuleIDs.

Note: if you've already scraped with a site admin account and api key, this update does not provide any additional data.

Quick Run With NPX

Windows

Run the following in Powershell:

mkdir EnjinScraper
cd EnjinScraper
winget install -e --id OpenJS.NodeJS
npx enjinscraper

Note that if rerunning later, you may need to use npx enjinscraper@latest to force use of the latest version.

Configuration

Obtaining an API key

Per Enjin's instructions:

To enable your API, visit your admin panel / settings / API area. The content on this page includes your base API URL, your secret API key, and the API mode. Ensure that the API mode is set to "Public".

Configuring the `config.json`

Optionally, create a config.json file in the root directory of the project. Otherwise, you will be prompted for required values on first run. The file should look like this, but with comments omitted:

{
    "apiKey": "someapiKey", // Required
    "domain": "www.example.com", // Required
    "email": "someemail@email.com", // Required
    "password": "somepassword", // Required
    "adminMode": true,
    "excludeHTMLModuleIDs": [
        "1000001",
        "1000002"
    ],
    "excludeForumModuleIDs": [],
    "excludeNewsModuleIDs": [],
    "excludeTicketModuleIDs": [],
    "excludedWikiModuleIDs": [],
    "manualForumModuleIDs": [],
    "manualNewsModuleIDs": [],
    "manualTicketModuleIDs": [],
    "manualWikiModuleIDs": [],
    "manualUserIDs": [],
    "disabledModules": {
        "html": false,
        "forums": {
            "postIPs": true
        },
        "galleries": false,
        "news": false,
        "wikis": false,
        "tickets": false,
        "applications": false,
        "comments": false,
        "users": {
            "ips": false,
            "tags": false,
            "fullinfo": true,
            "characters": true,
            "games": true,
            "photos": true,
            "wall": true,
            "yourFriends": true
        },
        "files": {
            "s3": false,
            "wiki": false,
            "avatars": true,
            "profileCovers": true,
            "gameBoxes": true,
            "userAlbums": true
        }
    },
    "retrySeconds": 5, // Setting to 0 will retry instantly
    "retryTimes": 5, // Setting to 0 will disable retries; setting to -1 will retry indefinitely
    "debug": true,
    "disableSSL": false,
    "overrideScrapeProgress": false // setting this to true will ignore the scrapers internal progress tracking and scrape all non-disabled modules
}

Note that data stemming from user profiles is disabled by default, as this can majorly extend the time needed to scrape sites with large member counts. You can of course change this in disabledModules.users You should use an account with the greatest possible permissions, as that will increase the amount of content that can be scraped. Given that, the practical use of this tool is unfortunately limited to those with backend access to the site to be scraped. There is no neeed to enter module IDs, as the scraper will automatically gather info about all modules on the site.

Running Manually

git clone https://github.com/Kas-tle/EnjinScraper.git
cd EnjinScraper
yarn
npx ts-node index.ts

Outputs

The scraper will output an sqlite file at target/site.sqlite in the root directory of the project. For a more detailed database schema, see OUTPUTS.md. The database will contain the following tables:

scrapers: Contains information about what steps have been completed to gracefully resume scraping if needed.
module_categories: Enumerates the different cateogries modules can fall into
modules: Contains information about modules
presets: Contains information about presets, essentially a list of individual modules
pages: Contains information about modules in the context of the page they reside on
site_data: A table that stores various information about a website
html_modules: Contains the HTML, JavaScript, and CSS of HTML modules
forum_modules: Contains information about the forum modules that were scraped
forums: Contains information about the forums scraped from the forum modules
threads: Contains information about the threads scraped from the forums
posts: Contains information about the posts scraped from the forums
gallery_albums: Contains information about albums in a gallery, including their titles, descriptions, and images
gallery_images: Contains information about images in a gallery, including their titles, descriptions, and associated albums
gallery_tags: Contains information about tags in a gallery, including their locations and associated images and albums
wiki_pages: Contains information about pages in a wiki, including their content, access control settings, and metadata
wiki_revisions: Contains information about revisions to pages in a wiki, including their content, access control settings, and metadata
wiki_likes: Contains information about users who have liked pages in a wiki
wiki_categories: Contains information about categories in a wiki, including their titles and thumbnails
wiki_uploads: Contains information about uploaded files in a wiki
news_articles: Contains information about news articles scraped from the news modules
ticket_modules: Contains information about ticket modules
tickets: Contains information about tickets scraped from the ticket modules
ticket_replies: Contains information about replies made to support tickets
applications: Contains basic information about applications
application_sections: Contains sections from applications
application_questions: Contains questions from applications
application_questions: Contains individual responses for applications
comments: Contains information about comments on applications, news articles, wiki pages, and gallery images
users: Contains information about users
user_profiles: Contains information about user profiles, including their personal information, gaming IDs, and social media handles
user_games: Contains information about the games that a user has added to their profile
user_characters: Contains information about the characters that a user has added to their profile
user_albums: Contains information about the albums that a user has created
user_images: Contains information about the images that a user has uploaded
user_wall_posts: Contains information about wall posts made by users
user_wall_comments: Contains information about comments made on wall posts by users
user_wall_comment_likes: Contains information about users who have liked comments on wall posts
user_wall_post_likes: Contains information about users who have liked wall posts

All files scraped will be stored in the target/files directory in the same directory as the config.json file. The directory structure will simply follow the URL with the https:// header removed. For example, if the site is https://www.example.com/somdir/file.png, the files will be stored in the target/files/www.example.com/somdir/file.png directory.

Files that are stored in Enjin's Amazon S3 instance for your site will be automatically downloaded and stored in the target/files directory. The files will be stored in the same directory structure as they are on the S3 instance. All information about these files will be stored in the s3_files table in the database. Examples of modules that store files here include galleries, forums, applications, tickets, and news posts.

Files from wiki pages will generally be found under target/files/s3.amazonaws.com/files.enjin.com/${siteID}/modules/wiki/${wikiPresetID}/file.png.

User avatars are also scraped, which combines the URLs found in user_profiles.avatar, user_wall_comments.avatar, and user_wall_post_likes.avatar. These will generally be found under assets-cloud.enjin.com/users/${userID}/avatar/full.${fileID}.png. Note that these files are generally stored in the database with the size medium, but we download the full size only instead.

Profile cover images come from user_profiles.cover_image and are found in either https://assets-cloud.enjin.com/users/${userID}/cover/${fileID}.png if the user has uploaded their own cover image, or resources.enjin.com/${resourceLocator}/themes/${version}/image/profile/cover/${category}/${fileName}.jpg if the user is using an Enjin provided cover image.

Game boxes are the images displayed for games a user has on their profile. They are found in assets-cloud.enjin.com/gameboxes/${gameID}/boxmedium.jpg.

Lastly, user album images from user_images.url_original can be found in either s3.amazonaws.com/assets.enjin.com/users/${userID}/pics/original/${fileName} or assets.enjin.com/wall_embed_images/${fileName}.