0.0.2 • Published 9 years ago

fuata v0.0.2

Weekly downloads
2
License
ISC
Repository
github
Last release
9 years ago

Untitled

Build Status Test Coverage Code Climate Dependencies devDependency Status

Who Should I Follow?

Have you ever wanted to know who are the best people online to follow and why?

  • Who posts interesting content and who doesn't?
  • Who is "trending" and why?

I wonder this all the time. So I'm building fuata (working title) to scratch my own itch.

Data Model

I expect to store a couple of hundred (million) records in the database.

  • fullName
  • @username
  • dateJoined

Following

  • following @username
  • dateFollowed
  • dateUnfollowed

same for followers.

Example data structure: (nested Objects are easier for data updates)

{
  "followers": {
    "u1" : ["timestamp"],
    "u2" : ["timestamp"]
  },
  "following": {
    "u3" : ["timestamp"],
    "u2" : ["timestamp", "timestamp2", "timestamp3"]
  }
}

This can be stored as a basic flat-file database where github-username.json would be the file

the key here is:

  • u: username (of the person who the user is following or being followed by)
  • timestamp: startdate when the person first started following/being followed enddate when the person stopped following

In addition to creating a file per user, we should maintain an index of all the users we are tracking. the easiest way is to have a new-line-separted list.

But... in the interest of being able to run this on Heroku (where you don't have access to the filesystem so no flat-file-db!) I'm going to use LevelDB for this. >> Use Files on DigitalOcean!

Tests

Check:

  • GitHub.com is accessible
  • a known GitHub user exists
  • known GH user has non-zero number of followers
  • known GH user is following a non-zero number of people

Scrape following/followers page:

  • Scrape first page
  • Check for 'next' page

New user?

  • If a user doesn't exist in the Database create it.
  • set time for lastupdated to now.

Read data from db/disk so we can update it.

  • If a user has previously been crawled there will be a record in db

Backup the Data

Given that LevelDB is Node (in-memory) storage. It makes sense to either pay for persistance or use files!

### Quantify Data Load

Each time we crawl a user's profile we add 5kb (average) data to the file.

340kb 73 files

so crawling the full list of GitHub users (5 Million) once would require 5,000,000 * 5kb = 25 Gb!

We might need to find a more efficient way of storing the data. SQL?

Simple UI

  • Upload sketch

FAQ?

Q: How is this different from Klout? A: Klout tries to calculate your social "influence". That's interesting but useless for tracking makers.

Research

Must read up about http://en.wikipedia.org/wiki/Inverted_index so I understand how to use: https://www.npmjs.org/package/level-inverted-index

Useful Links

GitHub Stats API

Example:

curl -v https://api.github.com/users/pgte/followers
[
  {
    "login": "methodmissing",
    "id": 379,
    "avatar_url": "https://avatars.githubusercontent.com/u/379?v=2",
    "gravatar_id": "",
    "url": "https://api.github.com/users/methodmissing",
    "html_url": "https://github.com/methodmissing",
    "followers_url": "https://api.github.com/users/methodmissing/followers",
    "following_url": "https://api.github.com/users/methodmissing/following{/other_user}",
    "gists_url": "https://api.github.com/users/methodmissing/gists{/gist_id}",
    "starred_url": "https://api.github.com/users/methodmissing/starred{/owner}{/repo}",
    "subscriptions_url": "https://api.github.com/users/methodmissing/subscriptions",
    "organizations_url": "https://api.github.com/users/methodmissing/orgs",
    "repos_url": "https://api.github.com/users/methodmissing/repos",
    "events_url": "https://api.github.com/users/methodmissing/events{/privacy}",
    "received_events_url": "https://api.github.com/users/methodmissing/received_events",
    "type": "User",
    "site_admin": false
  },

etc...]

Issues with using the GitHub API:

  • The API only returns 30 results per query.
  • X-RateLimit-Limit: 60 (can only make 60 requests per hour) ... 1440 queries per day (60 per hour x 24 hours) sounds like ample on the surface. But, if we assume the average person has at least 2 pages worth of followers (30<) it means on a single instance/server we can only track 720 people. Not really enough to do any sort of trend analysis. :disappointed: If we are tracking people with hundreds of followers (and growing fast) e.g. 300< followers. the number of users we can track comes down to 1440 / 10 = 140 people... (10 requests to fetch complete list of followers) we burn through 1440 requests pretty quickly.
  • There's no guarantee which order the followers will be in (e.g. most recent first?)
  • Results are Cached so they are not-real time like they are in the Web. (seems daft, but its true.) Ideally they would have a Streaming API but sadly, GitHub is built in Ruby-on-Rails which is "RESTful" (not real-time).

But...

Once we know who we should be following, we can use

curl -v https://api.github.com/users/pgte/following/visionmedia

Interesting Facts

  • GitHub has 3.4 Million users
  • yet the most followed person Linus Torvalds only has 19k followers (so its a highly distributed network )

Profile Data to Scrape

example github profile

Interesting bits of info have blue squares drawn around them.

Basic Profile Details for TJ:

  followercount: 11000,
  stared: 1000,
  followingcount: 147,
  worksfor: 'Segment.io',
  location: 'Victoria, BC, Canada',
  fullname: 'TJ Holowaychuk',
  email: 'tj@vision-media.ca',
  url: 'http://tjholowaychuk.com',
  joined: '2008-09-18T22:37:28Z',
  avatar: 'https://avatars2.githubusercontent.com/u/25254?v=2&s=460',
  contribs: 3217,
  longest: 43,
  current: 0

Tasks

  • Add lastmodified checker for DB (avoid crawling more than once a day) >> db.lastUpdated
  • Save List of Users to DB
  • Check Max Crawler Concurrency
  • Experiment with Child Processes?
  • Record Profile (basics) History

Crawler Example:

var C = require('./src/index.js');

var user = 'alanshaw';

C.crawlUser(user, function (err, profile) {
  console.log(profile);
});

Objective 1

  • Track who the best people to follow are
  • Track if I am already following a person

Twitter Followers