1.1.8 • Published 3 months ago

eliza-scraper v1.1.8

Weekly downloads
-
License
-
Repository
-
Last release
3 months ago

Eliza Scraper

A powerful scraping and RAG (Retrieval Augmented Generation) toolkit for collecting, processing, and serving blockchain ecosystem data. Built with Bun, TypeScript, and modern vector search capabilities.

Features

  • 🤖 Automated Twitter data scraping with engagement filtering
  • 📊 Real-time token price and market data tracking via CoinGecko
  • 📑 Documentation site scraping with caching
  • 🔄 Configurable cron jobs for automated data collection
  • 🔍 Vector search with time-decay relevance scoring
  • 🚀 REST API with Swagger documentation

Architecture

RAG Pipeline for Tweet Processing

graph TD
    A[Twitter] --> B[Apify Scraper]
    B --> C[Filters]
    C --> D[Text Processing]
    D --> E[OpenAi Embedding]
    E --> F[Qdrant Vector DB]
    F --> G[Search API]
    
    subgraph "filters"
    C --> H[Min Replies]
    C --> I[Min Retweets]
    C --> J[Search Tags]
    C --> K[Kaito Most Mindshare handles]
    end
    
    subgraph "Vector Search"
    F --> L[Semantic Search]
    end 

System Architecture

graph TB
    CLI[CLI Interface] --> Services
    API[REST API] --> Services
    
    subgraph "Services"
    Tweets[Tweet Service]
    Tokens[Token Service]
    Docs[Site Service]
    end
    
    subgraph "Storage"
    Qdrant[(Qdrant Vector DB)]
    Postgres[(PostgreSQL)]
    end
    
    Services --> Storage

Scraper Quick Start

  1. Install globally:
npm install -g eliza-scraper
  1. Create .env file:
# Required APIs
APIFY_API_KEY=your_apify_key
COINGECKO_API_KEY=your_coingecko_key
OPENAI_API_KEY=

# Database
DATABASE_URL=postgresql://user:password@localhost:5432/token_scraper

# Vector DB
QDRANT_URL=http://localhost:6333
QDRANT_COLLECTION_NAME=twitter_embeddings

# Optional
TWITTER_SEARCH_TAGS=berachain,bera
APP_TOPICS=berachain launch,berachain token
  1. Start required services:
docker-compose up -d

CLI Commands

Start API Server

elizascraper server --port <number>

Tweet Scraping Service

elizascraper tweets-cron [options]

Options:
--port Port number (default: 3000)
--tags Custom search tags, can also pass twitter search strings, check out https://github.com/igorbrigadir/twitter-advanced-search (comma-separated)
--start Start date (YYYY-MM-DD)
--end End date (YYYY-MM-DD)
--min-replies Minimum replies threshold (default: 4)
--retweets Minimum retweets threshold (default: 2)
--max-items Maximum tweets to fetch (default: 200)

Disclaimer: make sure you don't run the cron job in a small time interval, with start and end date less than 3 days, fetching less than 50 tweets, would lead to a ban shortly on your apify key.

To prevent ban, same search query, ran multiple times inside an hour would be skipped over.

Tracking Latest Token Data

elizascraper token-cron [options]

Options:
-p, --port <number> Port to listen on (default: 3008)
-i, --interval <number> Interval in minutes (default: 60)
--currency <string> Currency to track (default: 'usd')
--cg_category <string> CoinGecko category (default: 'berachain-ecosystem')

Documentation Scraping Service

elizascraper site-cron [options]

Options:
-p, --port <number> Port to listen on (default: 3010)
-i, --interval <number> Interval in minutes (default: 60)
-l, --link <string> URL to scrape (default: 'https://docs.berachain.com')

API Documentation

Tweets API

GET /tweets/search

Search for tweets based on semantic similarity.

Query Parameters:

  • searchString (required): Text to search for semantically similar tweets

Response Schema:

{
  status: "success" | "error",
  results: {
    tweet: {
      id: string,
      text: string,
      fullText: string,
      author: {
        name: string,
        userName: string,
        profilePicture: string
      },
      createdAt: string,
      retweetCount: number,
      replyCount: number,
      likeCount: number
    },
    combinedScore: number,
    originalScore: number,
    date: string
  }[]
}

GET /tweets/random

Get random tweets by predefined topics.

Response Schema:

{
  topic: string,
  tweet: {
    tweet: {
      id: string,
      text: string,
      fullText: string,
      author: {
        name: string,
        userName: string,
        profilePicture: string
      },
      createdAt: string,
      retweetCount: number,
      replyCount: number,
      likeCount: number
    },
    combinedScore: number,
    originalScore: number,
    date: string
  }[]
}

Tokens API

GET /tokens

Get token market data.

Query Parameters:

  • symbol (optional): Token symbol to filter results

Response Schema:

{
  status: "success" | "error",
  results: {
    id: string,
    symbol: string,
    name: string,
    image: string,
    currentPrice: number,
    marketCap: number,
    marketCapRank: number,
    fullyDilutedValuation: number,
    totalVolume: number,
    high24h: number,
    low24h: number,
    priceChange24h: number,
    priceChangePercentage24h: number,
    marketCapChange24h: number,
    marketCapChangePercentage24h: number,
    circulatingSupply: number,
    totalSupply: number,
    maxSupply: number,
    ath: number,
    athChangePercentage: number,
    athDate: string,
    atl: number,
    atlChangePercentage: number,
    atlDate: string,
    lastUpdated: string
  }
}

Documentation API

GET /docs/scrape

Scrape and cache documentation from a given URL.

Query Parameters:

  • url (required): URL of the documentation site to scrape

Response Schema:

{
  status: "success" | "error",
  data: {
    title: string,
    last_updated: string,
    total_sections: number,
    sections: {
      topic: string,
      source_url: string,
      overview: string,
      subsections: {
        title: string,
        content: string
      }[]
    }[]
  },
  source: "cache" | "fresh"
} | {
  status: "error",
  message: string
}

Cache Behavior:

  • Documentation is cached for 7 days
  • Returns cached version if available and not expired
  • Automatically refreshes cache if data is older than 7 days

Error Handling

All routes use a standardized error response format:

{
  status: "error",
  code: string,
  message: string
}

Contributing

Contributions are welcome! Please read our contributing guidelines before submitting pull requests.

License

MIT