fuata v0.0.2
Untitled
Who Should I Follow?
Have you ever wanted to know who are the best people online to follow and why?
- Who posts interesting content and who doesn't?
- Who is "trending" and why?
I wonder this all the time. So I'm building fuata (working title) to scratch my own itch.
Data Model
I expect to store a couple of hundred (million) records in the database.
- fullName
- @username
- dateJoined
Following
- following @username
- dateFollowed
- dateUnfollowed
same for followers.
Example data structure: (nested Objects are easier for data updates)
{
"followers": {
"u1" : ["timestamp"],
"u2" : ["timestamp"]
},
"following": {
"u3" : ["timestamp"],
"u2" : ["timestamp", "timestamp2", "timestamp3"]
}
}
This can be stored as a basic flat-file database where github-username.json would be the file
the key here is:
- u: username (of the person who the user is following or being followed by)
- timestamp: startdate when the person first started following/being followed enddate when the person stopped following
In addition to creating a file per user, we should maintain an index of all the users we are tracking. the easiest way is to have a new-line-separted list.
But... in the interest of being able to run this on Heroku
(where you don't have access to the filesystem so no flat-file-db!)
I'm going to use LevelDB for this. >> Use Files on DigitalOcean!
Tests
Check:
- GitHub.com is accessible
- a known GitHub user exists
- known GH user has non-zero number of followers
- known GH user is following a non-zero number of people
Scrape following/followers page:
- Scrape first page
- Check for 'next' page
New user?
- If a user doesn't exist in the Database create it.
- set time for lastupdated to now.
Read data from db/disk so we can update it.
- If a user has previously been crawled there will be a record in db
Backup the Data
Given that LevelDB is Node (in-memory) storage. It makes sense to either pay for persistance or use files!
### Quantify Data Load
Each time we crawl a user's profile we add 5kb (average) data to the file.
so crawling the full list of GitHub users (5 Million) once would require 5,000,000 * 5kb = 25 Gb!
We might need to find a more efficient way of storing the data. SQL?
Simple UI
- Upload sketch
FAQ?
Q: How is this different from Klout? A: Klout tries to calculate your social "influence". That's interesting but useless for tracking makers.
Research
Must read up about http://en.wikipedia.org/wiki/Inverted_index so I understand how to use: https://www.npmjs.org/package/level-inverted-index
GitHub stats (node module): https://github.com/apiengine/ghstats (no tests or recent work/activity, but interesting functionality)
Hard Drive reliability stats: https://www.backblaze.com/blog/hard-drive-reliability-update-september-2014 (useful when selecting which drives to use in the storage array - Clear Winner is Hitachi 3TB)
- RAID explained in layman's terms: http://uk.pcmag.com/storage-devices-reviews/7917/feature/raid-levels-explained
- RAID Calculator: https://www.synology.com/en-global/support/RAID_calculator (if you don't already know how much space you get)
- SQLite limits: https://www.sqlite.org/limits.html
Useful Links
- Summary of Most Active GitHub users: http://git.io/top
- Intro to web-scraping with cheerio: https://www.digitalocean.com/community/tutorials/how-to-use-node-js-request-and-cheerio-to-set-up-simple-web-scraping
- GitHub background info: http://en.wikipedia.org/wiki/GitHub
GitHub Stats API
- Github Stats API: https://developer.github.com/v3/repos/statistics/
- GitHub Followers API: https://developer.github.com/v3/users/followers/
Example:
curl -v https://api.github.com/users/pgte/followers
[
{
"login": "methodmissing",
"id": 379,
"avatar_url": "https://avatars.githubusercontent.com/u/379?v=2",
"gravatar_id": "",
"url": "https://api.github.com/users/methodmissing",
"html_url": "https://github.com/methodmissing",
"followers_url": "https://api.github.com/users/methodmissing/followers",
"following_url": "https://api.github.com/users/methodmissing/following{/other_user}",
"gists_url": "https://api.github.com/users/methodmissing/gists{/gist_id}",
"starred_url": "https://api.github.com/users/methodmissing/starred{/owner}{/repo}",
"subscriptions_url": "https://api.github.com/users/methodmissing/subscriptions",
"organizations_url": "https://api.github.com/users/methodmissing/orgs",
"repos_url": "https://api.github.com/users/methodmissing/repos",
"events_url": "https://api.github.com/users/methodmissing/events{/privacy}",
"received_events_url": "https://api.github.com/users/methodmissing/received_events",
"type": "User",
"site_admin": false
},
etc...]
Issues with using the GitHub API:
- The API only returns 30 results per query.
- X-RateLimit-Limit: 60 (can only make 60 requests per hour) ... 1440 queries per day (60 per hour x 24 hours) sounds like ample on the surface. But, if we assume the average person has at least 2 pages worth of followers (30<) it means on a single instance/server we can only track 720 people. Not really enough to do any sort of trend analysis. :disappointed: If we are tracking people with hundreds of followers (and growing fast) e.g. 300< followers. the number of users we can track comes down to 1440 / 10 = 140 people... (10 requests to fetch complete list of followers) we burn through 1440 requests pretty quickly.
- There's no guarantee which order the followers will be in (e.g. most recent first?)
- Results are Cached so they are not-real time like they are in the Web. (seems daft, but its true.) Ideally they would have a Streaming API but sadly, GitHub is built in Ruby-on-Rails which is "RESTful" (not real-time).
But...
Once we know who we should be following, we can use
- https://developer.github.com/v3/users/followers/#follow-a-user
- https://developer.github.com/v3/users/followers/#check-if-one-user-follows-another e.g:
curl -v https://api.github.com/users/pgte/following/visionmedia
Interesting Facts
- GitHub has 3.4 Million users
- yet the most followed person Linus Torvalds only has 19k followers (so its a highly distributed network )
Profile Data to Scrape
Interesting bits of info have blue squares drawn around them.
Basic Profile Details for TJ:
followercount: 11000,
stared: 1000,
followingcount: 147,
worksfor: 'Segment.io',
location: 'Victoria, BC, Canada',
fullname: 'TJ Holowaychuk',
email: 'tj@vision-media.ca',
url: 'http://tjholowaychuk.com',
joined: '2008-09-18T22:37:28Z',
avatar: 'https://avatars2.githubusercontent.com/u/25254?v=2&s=460',
contribs: 3217,
longest: 43,
current: 0
Tasks
- Add lastmodified checker for DB (avoid crawling more than once a day) >> db.lastUpdated
- Save List of Users to DB
- Check Max Crawler Concurrency
- Experiment with Child Processes?
- Record Profile (basics) History
Crawler Example:
var C = require('./src/index.js');
var user = 'alanshaw';
C.crawlUser(user, function (err, profile) {
console.log(profile);
});
Objective 1
- Track who the best people to follow are
- Track if I am already following a person
Twitter Followers
9 years ago