Dynamically Injecting Scraped GitHub Data in a Google Spreadsheet

The blog is currently being ported from WordPress to over 12 years of static pages of content. If there's an article missing that you're hoping to see, please contact me and let me know and I'll prioritize getting it online.

September 30, 2016


At the new job, I started piecing together a Google Spreadsheet about our open-source software in various languages, based on frameworks, or example applications. One of the things we wanted to track for each repository was a count of open issues and pull requests we had at GitHub. Turns out Google made this pretty easy, but you have to scrape the repo page first.

The code

I wrote a scraper in Python using Flask, Requests and BeatifulSoup, hosted the project on Heroku, and works a little like this:

# app.py
# listen to a URL like /github.com/GetStream/stream-python
# return a JSON block of both the count of issues and pull_requests
@app.route('/<string:site>/<string:account>/<string:repo>')
def scraper_nometadata(site, account, repo):
    return scraper(site, account, repo, None)

# listen to a URL like /github.com/GetStream/stream-js/issues or
# /github.com/GetStream/stream-node-orm/pull_requests
@app.route('/<string:site>/<string:account>/<string:repo>/<string:metadata>')
def scraper(site, account, repo, metadata):
    url = 'https://{site}/{account}/{repo}'.format(
        site=site,
        account=account,
        repo=repo,
    )
    counters = []

    r = requests.get(url)
    soup = BeautifulSoup(r.text, "html.parser")
    # look for a CSS class called 'counter', it's only used three
    # times on github.com for counting issues, pull requests and
    # projects
    # I didn't test this on bitbucket or other sites yet
    counters = soup.select(".counter")
    counter_position = None

    if metadata:
        if metadata == 'issues':
            counter_position = 0
        elif metadata == 'pull_requests':
            counter_position = 1
        return jsonify(counters[counter_position].contents[0])

    return jsonify({
        'issues': counters[0].contents[0],
        'pull_requests': counters[1].contents[0],
    })

The code took minutes to write. I spent about three times as long figuring out how to launch a Python app on Heroku, so here are some tips:

  1. You’ll need to upload a requirements.txt with your project. I recommend including gunicorn.

    # requirements.txt
    flask
    requests
    bs4
    gunicorn
    
  2. To launch on Heroku, you’ll need a file called “Procfile” to tell Heroku how to launch your application.

    # Procfile
    web: gunicorn app:app
    
  3. Before you push code to Heroku, you’ll need to run this at the command line:

    # launch a web instance for your app
    $ heroku ps:scale web=1
    

Always check heroku logs if something doesn’t work.

Loading the Data Dynamically in a Google Spreadsheet

Once your Heroku service is launched, you can add this data to your Google Spreadsheet by using the IMPORTDATA() method. It takes a URL as a parameter, and updates data about once per hour.

There are ways to load the data more often, but that’s not something we need to worry about at Stream right now. Your field data in the Google Spreadsheet will look something like this:

=IMPORTDATA("http://your-app.herokuapp.com/github.com/GetStream/stream-rails")