We here at Crunchbase are totally astonished at how much activity there is on Crunchbase, both on the web application as well as on the API. A good problem we’re having is that the increased traffic means that our poor server is increasingly unable to handle it all. Accordingly, we’re taking this opportunity to regroup and upgrade not only our hardware, but our Rails stack. Speaking of our stack, we welcome your comments as to what production environment suggestions you, our faithful users, have. We have a lot in store for Crunchbase in the months and years to come, and getting our production environment right is key to the things we want to do (like continue to expose it through our open API). Thank you for understanding, and we’re looking to have a permanent solution for API availability next week.

What kind of comments on your production environment are you looking for exactly? Can you elaborate on how your systems are setup and configured?
A few off-the-cuff suggestions:
–Start producing dumps of the crunchbase API data, which people can use as an alternative to the one-off requests. Wikipedia, Wikia, and others do this.
–Put in a proxy server into your environment, e.g. nginx, and then serve most requests through it via memcached. So, for example, if someone hits a crunchbase page since it has been updated last, then nginx would serve the page directly from cache, skipping Rails or whatever backend you use altogether. When a page is updated, you expire that key (its URI) from the cache.
–Not sure how you are serving your images, but hopefully those don’t hit your backend either. If you use nginx, you can have it serve them directly. Or you could off-load them to S3.
1. This is our long-term solution.
2. We do something like this, but I’d also like to use memcached, we do use nginx, however. Good suggestion.
3. We serve images from the filesystem, so no backend hit.
We are moving forward with nginx/Ruby Enterprise Edition/Phusion Passenger, and we plan to do some kind of throttling on the API, and provide an amended database dump to alleviate API load.
Agree with Gabriel
For larger clients / heavier loads to API system, complete system dumps with a regular deltas feed would help. This would also address concerns a developer could have about going all the way to crunchBase to serve a page.
The Crunchbase API is under load partly because the API design requires that some queries return huge chunks of data.
Our experience of using Crunchbase is that it is missing key things that most other API’s implement. Crunchbase might in fact be under less load if it implemented these features and functions.
Here’s a few things that the crunchbase really needs:
1: gzip compressed HTTP responses - without this, crunchbase returns huge quantities of data - this is likely to be placing significant load on your systems. Getting a list of companies returns megabytes oftext data which would be much smaller if it was compressed.
2: sorted responses - it would be nice if the API was able to return, for example, a list of companies, sorted by company name or any other relevant column. Currently if you request a list oc companies they come back in an order which is not alpphabetically sorted and there is no way to ask for sorting.
3: paging of data using COUNT and OFFSET. Currently, getting a list of companies from Crunchbase requires getting a list of EVERY company in your database. - again, this might be a contributing factor. We’d rather be able to get COUNT results at a time and I’m sure your systems would prefer to only be returning COUNT results rather than thousands. For example, if you want a list of companies, you must get a list of ALL companies - all 24,000 or whetever it is. Pretty much every other API out there implements some form of paging.
Implementing the above will reduce load on your systems.
Thanks
If I was implementing an API for Crunchbase I’d probably use database that is able to return query responses in XML form (all modern databases do this). Ideally the database would also be able to return responses to SQL queries in JSON form too (can any current database do this?).
You could then have an extremely simple set of PHP pages that translate Crunchbase API requests into database API requests and then fling the result back to the client, without having to do any processing or modification of the result at all.
The would be extremely simple and extremely fast. There would be almost no complex transformation of data required.
Ideally your API would return data in both XML and JSON form. Again, choosing the right database will make this easy or even trivial.
At risk of starting a flamewar, I think Rails probably isn’t a good choice for implementing a high performance API. Rails is designed for building generic web applications. For an API you need simplicity and performance - put some very fast application server (PHP, .NET, Java, hell even fastcgi/C++) in front of the database server and pump out tens of thousands of results per second. This would also me MUCH more simple to implement than trying to compensate for Rails performance problems by using memcached etc. Rails isn’t really a fit to implementing a simple API.
When you use Rails for API’s you end up with performance problems so severe that you have to take the API offline
It’s hard to believe that such a small database and simple API could need more than a single server even to handle 10,000 requests per second.
How many requests per second was it getting before you had to take it offline?
Duh!
I’ve just realised that the API offline notification was from July and it is now September. Perhaps the most recent post on the blog should not be “Crunchbase offline”.
So howcome our Crunchbase queries are returning 404’s?
Hi CB,
Please let us know when the CB API will be back, its a great resource.
Thanks,
Gary