API Scraping News

These are the news items I've curated in my monitoring of the API space that have some relevance to the API definition conversation and I wanted to include in my research. I'm using all of these links to better understand how the space is testing their APIs, going beyond just monitoring and understand the details of each request and response.

No More Scraping Of Banking Data In Europe According to PSD2, Only APIs

Part of my partnership with centers around me investing more time into studying the banking industry, starting with the rollout of PSD2 in Europe next month. I’ll be working through each aspect of the regulations for the banking industry when it comes to APIs, but I wanted to highlight a recent change regarding scraping that is pretty monumental. In a recent press release from the European Commission they further clarified guidance for third party payment services providers (TPPs), and whether or not they can be scraping data from bank still, instead of using the APIs being mandated by the commission.

Here is the section from the press release specifically addressing “what data can TPPs access and use via screen scraping”:

PSD2 prohibits TPPs from accessing any other data from the customer payment account beyond those explicitly authorised by the customer. Customers will have to agree on the access, use and processing of these data. With these new rules, it will no longer be allowed to access the customer’s data through the use of the techniques of “screen scraping”. Screen scraping means accessing the data through the customer interface with the use of the customer’s security credentials. Through screen scraping, TPPs can access customer data without any further identification vis-à-vis the banks. Banks will have to put in place a communication channel that allows TPPs to access the data that they need in accordance with PSD2. The channel will also be used to enable banks and TPPs to identify each other when accessing these data. It will also allow them to communicate through secure messaging at all times. Banks may establish this communication channel by adapting their customer online banking interface. They may also create a new dedicated interface that will include all necessary information for the relevant payment service providers. The RTS specifies the contingency safeguards that banks shall put in place if they decide to develop a dedicated interface. This will ensure fair competition and business continuity for TPPs.

Banks will have to provide APIs for aggregators to access data. Aggregators will not be allowed to scrape, and are being forced to use the APIs. While there will be a rolling out period, and I’m sure there will still be the bad actors on both sides of the equation, it is the groundwork for a much more sensible and secure approach to providing access to banking customers data–cleaning up the current mess. It is an important step for the banking sector, as well as a significant precedent for the API space when it comes to requiring API access to users data, allowing them to take advantage of valuable 3rd party services.

I’m seeing hints of similar language out of the CFPB regarding banking in the United States, but we are still years behind this kind of regulations. While I would like optimistic that the EU regulations will have an impact on the US when it comes to banks who do business overseas, I’m not holding my breathe. Where I’m going to be placing bets is when it comes to forward thinking banks like Capital One leading the charge when it comes to access to customer data via APIs because it makes sense, not because the government is mandating it. I’m not a big fan of government dictating that industries do APIs, I’m more about companies doing APIs because they make sense.

Challenges When Aggregating Data Published Across Many Years

My partner in crime is working on a large data aggregation project regarding ed-tech funding. She is publishing data to Google Sheets, and I’m helping her develop Jekyll templates she can fork and expand using Github when it comes to publishing and telling stories around this data across her network of sites. Like API Evangelist, Hack Education runs as a network of Github repositories, with a common template across them–we call the overlap between API Evangelist, Contrafabulists.

One of the smaller projects she is working on as part of her ed-tech funding research involves pulling the grants made by the Gates Foundation since the 1990s. Similar to my story a couple weeks ago about my friend David Kernohan, where he was wanting to pull data from multiple sources, and aggregate into a single, workable project. Audrey is looking to pull data from a single source, but because the data spans almost 20 years–it ends up being a lot like aggregating data from across multiple sources.

A couple of the challenges she is facing trying to gather the data, and aggregate as a common dataset are:

  • PDF - The enemy of any open data advocate is the PDF, and a portion of her research data data is only available in PDF format which translates into a good deal of manual work.
  • Search - Other portions of the data is available via the web, but obfuscated behind search forms requiring many different searches to occur, with paginated results to navigate.
  • Scraping - The lack of APIs, CSV, XML, and other machine readable results raises the bar when it comes to aggregating and normalizing data across many years, making scraping a consideration, but because of PDFs, and obfuscated HTML pages behind a search, even scraping will have a significant costs.
  • Format - Even once you’ve aggregated data from across the many sources, there is a challenge with it being in different formats. Some years are broken down by topic, while others are geographically based. All of this requires a significant amount of overhead to normalize and bring into focus.
  • Manual - Ultimately Audrey has a lot of work ahead of her, manually pulling PDFs and performing searches, then copying and pasting data locally. Then she’ll have to roll up her sleeves to normalize all the data she has aggregated into a single, coherent vision of where the foundation has put its money.

Data research takes time, and is tedious, mind numbing work. I encounter many projects like hers where I have to make a decision between scraping or manually aggregating and normalizing data–each project will have it’s own pros and cons. I wish I could help, but it sounds like it will end up being a significant amount of manual labor to establish a coherent set of data in Google Sheets. Once, she is done though, she has all the tools in place to publish as YAML to Github, and get to work telling stories around the data across her work using Jekyll and Liquid. I’m also helping her make sure she has a JSON representation of each of her data projects, allowing others to build on top of her hard work.

I wish all companies, organizations, institutions, and agencies would think about how they publish their data publicly. It’s easy to think that data stewards will have ill intentions when it comes to publishing data in a variety of formats like they do, but more likely it is just a change of stewardship when it comes to managing and publishing the data. Different folks will have different visions of what sharing data on the web needs to look like, and have different tools available to them, and without a clear strategy you’ll end up with a mosaic of published data over the years. Which is why I’m telling her story. I am hoping to possibly influence one or two data stewards, or would-be data stewards when it comes to the importance of pausing for a moment and thinking through your strategy for standardizing how you store and publish your data online.

Bringing The API Deployment Landscape Into Focus

I am finally getting the time to invest more into the rest of my API industry guides, which involves deep dives into core areas of my research like API definitions, design, and now deployment. The outline for my API deployment research has begun to come into focus and looks like it will rival my API management research in size.

With this release, I am looking to help onboard some of my less technical readers with API deployment. Not the technical details, but the big picture, so I wanted to start with some simple questions, to help prime the discussion around API development.

  • Where? - Where are APIs being deployed. On-premise, and in the clouds. Traditional website hosting, and even containerized and serverless API deployment.
  • How? - What technologies are being used to deploy APIs? From using spreadsheets, document and file stores, or the central database. Also thinking smaller with microservices, containes, and serverless.
  • Who? - Who will be doing the deployment? Of course, IT and developers groups will be leading the charge, but increasingly business users are leveraging new solutions to play a significant role in how APIs are deployed.

The Role Of API Definitions While not every deployment will be auto-generated using an API definition like OpenAPI, API definitions are increasingly playing a lead role as the contract that doesn’t just deploy an API, but sets the stage for API documentation, testing, monitoring, and a number of other stops along the API lifecycle. I want to make sure to point out in my API deployment research that API definitions aren’t just overlapping with deploying APIs, they are essential to connect API deployments with the rest of the API lifecycle.

Using Open Source Frameworks Early on in this research guide I am focusing on the most common way for developers to deploy an API, using an open source API framework. This is how I deploy my APIs, and there are an increasing number of open source API frameworks available out there, in a variety of programming languages. In this round I am taking the time to highlight at least six separate frameworks in the top programming languages where I am seeing sustained deployment of APIs using a framework. I don’t take a stance on any single API framework, but I do keep an eye on which ones are still active, and enjoying usag bey developers.

Deployment In The Cloud After frameworks, I am making sure to highlight some of the leading approaches to deploying APIs in the cloud, going beyond just a server and framework, and leveraging the next generation of API deployment service providers. I want to make sure that both developers and business users know that there are a growing number of service providers who are willing to assist with deployment, and with some of them, no coding is even necessary. While I still like hand-rolling my APIs using my peferred framework, when it comes to some simpler, more utility APIs, I prefer offloading the heavy lifting to a cloud service, and save me the time getting my hands dirty.

Essential Ingredients for Deployment Whether in the cloud, on-premise, or even on device and even the network, there are some essential ingredients to deploying APIs. In my API deployment guide I wanted to make sure and spend some time focusing on the essential ingredients every API provider will have to think about.

-Compute - The base ingredient for any API, providing the compute under the hood. Whether its baremetal, cloud instances, or serverless, you will need a consistent compute strategy to deploy APIs at any scale. -Storage - Next, I want to make sure my readers are thinking about a comprehensive storage strategy that spans all API operations, and hopefully multiple locations and providers. -DNS - Then I spend some time focusing on the frontline of API deployment–DNS. In todays online environment DNS is more than just addressing for APIs, it is also security. -Encryption - I also make sure encryption is baked in to all API deployment by default in both transit, and storage.

Some Of The Motivations Behind Deploying APIs In previous API deployment guides I usually just listed the services, tools, and other resources I had been aggregating as part of my monitoring of the API space. Slowly I have begun to organize these into a variety of buckets that help speak to many of the motivations I encounter when it comes to deploying APIs. While not a perfect way to look at API deployment, it helps me thinking about the many reasons people are deploying APIs, and craft a narrative, and provide a guide for others to follow, that is potentially aligned with their own motivations.

  • Geographic - Thinking about the increasing pressure to deploy APIs in specific geographic regions, leveraging the expansion of the leading cloud providers.
  • Virtualization - Considering the fact that not all APIs are meant for production and there is a lot to be learned when it comes to mocking and virtualizing APIs.
  • Data - Looking at the simplest of Create, Read, Update, and Delete (CRUD) APIs, and how data is being made more accessible by deploying APIs.
  • Database - Also looking at how APIs are beign deployed from relational, noSQL, and other data sources–providing the most common way for APIs to be deployed.
  • Spreadsheet - I wanted to make sure and not overlook the ability to deploy APIs directly from a spreadsheet making APIs are within reach of business users.
  • Search - Looking at how document and content stores are being indexed and made searchable, browsable, and accessible using APIs.
  • Scraping - Another often overlooked way of deploying an API, from the scraped content of other sites–an approach that is alive and well.
  • Proxy - Evolving beyond early gateways, using a proxy is still a valid way to deploy an API from existing services.
  • Rogue - I also wanted to think more about some of the rogue API deployments I’ve seen out there, where passionate developers reverse engineer mobile apps to deploy a rogue API.
  • Microservices - Microservices has provided an interesting motivation for deploying APIs–one that potentially can provide small, very useful and focused API deployments.
  • Containers - One of the evolutions in compute that has helped drive the microservices conversation is the containerization of everything, something that compliments the world of APis very well.
  • Serverless - Augmenting the microservices and container conversation, serverless is motivating many to think differently about how APIs are being deployed.
  • Real Time - Thinking briefly about real time approaches to APIs, something I will be expanding on in future releases, and thinking more about HTTP/2 and evented approaches to API deployment.
  • Devices - Considering how APis are beign deployed on device, when it comes to Internet of Things, industrial deployments, as well as even at the network level.
  • Marketplaces - Thinking about the role API marketplaces like Mashape (now RapidAPI) play in the decision to deploy APIs, and how other cloud providers like AWS, Google, and Azure will play in this discussion.
  • Webhooks - Thinking of API deployment as a two way street. Adding webhooks into the discussion and making sure we are thinking about how webhooks can alleviate the load on APIs, and push data and content to external locations.
  • Orchestration - Considering the impact of continous integration and deployment on API deploy specifically, and looking at it through the lens of the API lifecycle.

I feel like API deployment is still all over the place. The mandate for API management was much better articulated by API service providers like Mashery, 3Scale, and Apigee. Nobody has taken the lead when it came to API deployment. Service providers like DreamFactory and Restlet have kicked ass when it comes to not just API management, but making sure API deployment was also part of the puzzle. Newer API service providers like Tyk are also pusing the envelope, but I still don’t have the number of API deployment providers I’d like, when it comes to referring my readers. It isn’t a coincidence that DreamFactory, Restlet, and Tyk are API Evangelist partners, it is because they have the services I want to be able to recommend to my readers.

This is the first time I have felt like my API deployment research has been in any sort of focus. I carved this layer of my research of my API management research some years ago, but I really couldn’t articulate it very well beyond just open source frameworks, and the emerging cloud service providers. After I publish this edition of my API deployment guide I’m going to spend some time in the 17 areas of my research listed above. All these areas are heavily focused on API deployment, but I also think they are all worth looking at individually, so that I can better understand where they also intersect with other areas like management, testing, monitoring, security, and other stops along the API lifecycle.

I Am Using Kimono Labs To Fill In Gap For Companies Who Do Not Have AN RSS Feed For Their Blog

I am tracking on around 2500 companies who are doing interesting things in the API space. Out of these companies about 1000 of them have blogs, which for me is a pretty important signal. About 1/4 of these companies with blogs, do not have an RSS feed, which in 2014 seems a little odd to me, but maybe I'm an old timer.

I believe that a blog is one of the most important signals, any API provider come put out, right alongside Twitter and Github. Historically I depend on the Twitter account for these RSS-less blogs, but now I'm taking a different path, and using Kimono Labs to fill in the gap.

Using the Kimono Labs Chrome extension, I just visit the main blog page for one of these companies, and select the title of the first blog post. Kimono gives me the ability to name this field, which I usually just call “title”. Since the title is a link, Kimono also associates the link to the blog post, along with each title. You could also highlight the summary text, but I don’t need this, as I have secondary processes that runs and pulls the full content of a blog post, as well as the timestamp, author, and taking of a screenshot.

Within a couple of seconds, using Kimono Labs, I now have an API for each companies blog, assuming the role RSS would normally play. When I have time each week, I will generate an API for the most important companies I'm tracking on. Maybe someday I will close the entire blog syndication gap for the companies I track on, but for now its nice to be able to tackle the most high value companies, and know that I have a viable solution to the problem with Kimono Labs.

Legitimizing Scraping As A Data Source For APIs

Harvesting or scraping content and data from other public web sources has been something many do, but few will talk publicly about. While scraping does infringe on copyright in some situations, in many others situations, it is quickly becoming a legitimate way to acquire content or data, for use behind an API.

There is a lot of content available online, where the current stewards do not have the control, resources, or interest in making content available in a machine readable format. In these scenarios, for many developers, if you want the content, you just write a scrape script, and liberate it from the site, to be used as you wish.

In the last year, we are seeing a new breed of API service providers emerge, who assist users in deploying APis from data and content that is liberated though harvesting or scraping. For the first time, I'm seeing enough of these new tools and services, that I'm going to break out into its own area, and make part of this API deployment white paper.

Building Blocks Of API Deployment

As I continue my research the world of API deployment, I'm trying to distill the services, and tooling I come across, down into what I consider to be a common set of building blocks. My goal with identifying API deployment building blocks is to provide a simple list of what the moving parts are, that enable API providers to successfully deploy their services.

Some of these building blocks overlap with other core areas of my research like design, and management, but I hope this list captures the basic building blocks of what anyone needs to know, to be able to follow the world of API deployment. While this post is meant for a wider audience, beyond just developers, I think it provides a good reminder for developers as well, and can help things come into focus. (I know it does for me!)

Also there is some overlap between some of these building blocks, like API Gateway and API Proxy, both doing very similiar things, but labeled differently. Identifying building blocks for me, can be very difficult, and I'm constantly shifting definitions around, until I find a comfortable fit--so some of these will evolve, especially with the speed at which things are moving in 2014.

CSV to API - Text files that contain comma separate values or CSVs, is one of the quickest ways to convert existing data to an API. Each row of a CSV can be imported and converted to a record in a database, and easily generate a RESTful interface that represents the data stored in the CSV. CSV to API can be very messy depending on the quality of the data in the CSV, but can be a quick way to breathe new life into old catalogs of data lying around on servers or even desktops. The easiest way to deal with CSV is to import directly into database, than generate API from database, but the process can be done at time of API creation.
Database to API - Database to API is definitely the quickest way to generate an API. If you have valuable data, generally in 2013, it will reside in a Microsoft, MySQL, PostgreSQL or other common database platform. Connecting to a database and generate a CRUD, or create, read, updated and delete API on an existing data make sense for a lot of reason. This is the quickest way to open up product catalogs, public directories, blogs, calendars or any other commonly stored data. APIs are rapidly replace database connections, when bundled with common API management techniques, APIs can allow for much more versatile and secure access that can be made public and shared outside the firewall.
Framework - There is no reason to hand-craft an API from scratch these days. There are numerous frameworks out their that are designed for rapidly deploying web APIs. Deploying APIs using a framework is only an option when you have the necessary technical and developer talent to be able to understand the setup of environment and follow the design patterns of each framework. When it comes to planning the deployment of an API using a framework, it is best to select one of the common frameworks written in the preferred language of the available developer and IT resources. Frameworks can be used to deploy data APIs from CSVs and databases, content from documents or custom code resources that allow access to more complex objects.
API Gateway - API gateways are enterprise quality solutions that are designed to expose API resources. Gateways are meant to provide a complete solution for exposing internal systems and connecting with external platforms. API gateways are often used to proxy and mediate existing API deployments, but may also provide solutions for connecting to other internal systems like databases, FTP, messaging and other common resources. Many public APIs are exposed using frameworks, most enterprise APIs are deployed via API gateways--supporting much larger ideployments.
API Proxy - API proxy are common place for taking an existing API interface, running it through an intermediary which allows for translations, transformations and other added services on top of API. An API proxy does not deploy an API, but can take existing resources like SOAP, XML-RPC and transform into more common RESTful APIs with JSON formats. Proxies provide other functions such as service composition, rate limiting, filtering and securing of API endpoints. API gateways are the preffered approach for the enterprise, and the companies that provide services support larger API deployments.
API Connector - Contrary to an API proxy, there are API solutions that are proxyless, while just allowing an API to connect or plugin to the advanced API resources. While proxies work in many situations, allowing APIs to be mediated and transformed into required interfaces, API connectors may be preferred in situations where data should not be routed through proxy machines. API connector solutions only connect to existing API implementations are easily integrated with existing API frameworks as well as web servers like Nginx.
Hosting - Hosting is all about where you are going to park your API. Usual deployments are on-premise within your company or data center, in a public cloud like Amazon Web Services or a hybrid of the two. Most of the existing service providers in the space support all types of hosting, but some companies, who have the required technical talent host their own API platforms. With HTTP being the transport in which modern web APIs put to use, sharing the same infrastructure as web sites, hosting APIs does not take any additional skills or resources, if you already have a web site or application hosting environment.
API Versioning - There are many different approaches to managing different version of web APIs. When embarking on API deployment you will have to make a decision about how each endpoint will be versioned and maintained. Each API service provider offers versioning solutions, but generally it is handled within the API URI or passed as an HTTP header. Versioning is an inevitable part of the API life-cycle and is better to be integrated by design as opposed to waiting until you are forced to make a evolution in your API interface.
Documentation - API documentation is an essential building block for all API endpoints. Quality, up to date documentation is essential for on-boarding developers and ensuring they successfully integrate with an API. Document needs to be derived from quality API designs, kept up to date and made accessible to developers via a portal. There are several tools available for automatically generting documentation and even what is called interactive documentation, that allows for developers to make live calls against an API while exploring the documentation. API documentation is part of every API deployment.
Code Samples - Second to documentation, code samples in a variety of programming languages is essential to a successful API integration. With quality API design, generating samples that can be used across multiple API resources is possible. Many of the emerging API service providers and the same tools that generate API documentation from JSON definitions can also auto generate code samples that can be used by developers. Generation of code samples in a variety of programming languages is a requirement during API deployment.
Scraping - Harvesting or scraping of data from an existing website, content or data source. While we all would like content and data sources to be machine readable, sometimes you have just get your hands dirty and scrape it. While I don't support scraping of content in all scenarios, and business sectors, but in the right situations scraping can provide a perfectly acceptable content or data source for deploying an API.
Container - The new virtualization movement, lead by Docket, and support by Amazon, Google, Red Hat, Microsoft, and many more, is providing new ways to package up APIs, and deploy as small, modular, virtualized containers.
Github - Github provides a simple, but powerful way to support API deployment, allowing for publsihing of a developer portal, documentation, code libraries, TOS, and all your supporting API business building blocks, that are necessary for API effort. At a minimum Github should be used to manage public code libraries, and engage with API consumers using Github's social features.
Terms of Use / Service - Terms of Use provide a legal framework for developers to operate within. They set the stage for the business development relationships that will occur within an API ecosystem. TOS should protect the API owners company, assets and brand, but should also provide assurances for developers who are building businesses on top of an API. Make sure an APIs TOS pass insepection with the lawyers, but also strike a healthy balance within the ecosystem and foster innovation.

If there are any features, service or tools you depend on when deploying your APIs, please let me know at @kinlane. I'm not trying to create an exhaustive list, I just want to get idea for what is available across the providers, and where the gaps are potentially. 

I'm feel like I'm finally getting a handle on the building blocks for API design, deployment, and management, and understanding the overlap in the different areas. I will revisit my design and management building blocks, and evolve my ideas of what my perfect API editor would look like, and how this fits in with API management infrastructure from 3Scale, and even API integration.

Disclosure: 3Scale is an API Evangelist partner.

The Black, White And Gray of Web Scraping

There are many reasons for wanting to scrape data or content from a public website. I think these reasons can be easily represented as different shades of gray, the darker the grey being considered less legal, and the lighter the grey more legal you could consider it. You with me?

An example of darker grey would be scraping classified ad listings from craigslist for use on your own site. Where an example of lighter grey could be pulling a listing of veterans hospitals from the Department of Veterans Affairs website for use in a mobile app that supports veterans. One is corporate owned data, and the other is public data. The motives for wanting either set of data would potentially be radically different, and the restrictions on each set of data would be different as well.

Many opponents of scraping don't see the shades of grey, they just see people taking data and content that isn't theirs. Proponents of scraping will have an array of opinions ranging from, if it is on the web, it should be available to everyone, to people who only would scrape openly licensed or public data, and stay away from anything proprietary.

Scraping of data is never a black and white issue. I’m not blindly supporting scraping in any situation, but I'm a proponent of sensible approaches to harvesting of valuable information, development of open source tools, as well as services that assist users in scraping.

The Role Of Scraping In API Deployment

Scraping has been something I’ve done since I first started working on the web. Sometimes you just need some data or a piece of content that isn't available in a machine readable format, and the only way is to get it scrape it off a web page.

Scraping is widespread, but something very few individuals or companies will admit to doing. Just like writing scripts for pulling data from APIs, I write a lot of scripts that pull content and data from websites and RSS feeds. Even though I tend to write my own scripts for scraping, I’ve been closely watching the new breed of scraping tools like Scraperwiki:


ScraperWiki is a web-based platform for collaboratively building programs to extract and analyze public (online) data, in a wiki-like fashion. "Scraper" refers to screen scrapers, programs that extract data from websites. "Wiki" means that any user with programming experience can create or edit such programs for extracting new data, or for analyzing existing datasets. The main use of the website is providing a place for programmers and journalists to collaborate on analyzing public data

I was first attracted to Scraperwiki as a way to harvest Tweets, and further interested by their web and PDF extraction tools. Scraperwiki has already been around for a while, founded back in 2010, and recently there is a new wave of scraping tools that have emerged:

Importio turns the web into a database, releasing the vast potential of data trapped in websites. Allowing you to identify a website, select the data and treat it as a table in your database. In effect transform the data into a row and column format. You can then add more websites to your data set, the same as adding more rows and query in real-time to access the data.


Kimono is a way to turn websites into structured APIs from your browser in seconds. You don’t need to write any code or install any software to extract data with Kimono. The easiest way to use Kimono is to add our bookmarklet to your browser’s bookmark bar. Then go to the website you want to get data from and click the bookmarklet. Select the data you want and Kimono does the rest.

Kimono and provide scraping tools for anyone, even non-developers to scrape content from web pages, but also allow you to deploy an API from the content. While it is easy to deploy APIs using data and content from the other scraping providers I track on, the new breed of scraping services focus on API deployment as end-goal.

At API Strategy & Practice in Amsterdam, the final panel of the event was called “toward 1 million APIs”, and scraping came up as one possible way that we will get to APIs at this scale. Sometimes the stewards or owners of data just don’t have the resources to deploy APIs, and the only way to deploy an API will be to scrape data and content and publish as web API--either internally or externally by 3rd party.

I have a research site setup to keep track of scraping news I come across, as well as any companies and tools I discover. Beyond ScraperWiki, Kimono and I’m watching these additional scraping services.


The product of over 50 person years of engineering effort, AlchemyAPI is a text mining platform providing the most comprehensive set of semantic analysis capabilities in the natural language processing field. Used over 3 billion times every month, AlchemyAPI enables customers to perform large-scale social media monitoring, target advertisements more effectively, track influencers and sentiment within the media, automate content aggregation and recommendation, make more accurate stock trading decisions, enhance business and government intelligence systems, and create smarter applications and services.

Common Crawl

Common Crawl is a non-profit foundation dedicated to providing an open repository of web crawl data that can be accessed and analyzed by everyone. Common Crawl Foundation is a California 501(c)(3) registered non-profit founded by Gil Elbaz with the goal of democratizing access to web information by producing and maintaining an open repository of web crawl data that is universally accessible and analyzable.


Convextra allows you collect valuable data from internet and represents it in easy-to-use CVS format for forther utilization.


Page Munch is a simple API that allows you to turn webpages into rich, structured JSON. Easily extract photos, videos, event, author and other metadata from any page on the internet in milliseconds.


PromptCloud opeartes on “Data as a Service” (DaaS) model and deals with large-scale data crawl and extraction, using cutting-edge technologies and cloud computing solutions (Nutch, Hadoop, Lucene, Cassandra, etc). Its proprietary software employs machine learning techniques to extract meaningful information from the web in desired format. These data could be from reviews, blogs, product catalogs, social sites, travel data—basically anything and everything on WWW. It’s a customized solution over simply being a mass-data crawler, so you only get the data you wish to see. The solution provides both deep crawl and refresh crawl of the web pages in a structured format.


Scrapinghub is a company that provides web crawling solutions, including a platform for running crawlers, a tool for building scrapers visually, data feed providers (DaaS) and a consulting team to help startups and enterprises build and maintain their web crawling infrastructures.

Screen Scraper

Copying text from a web page. Clicking links. Entering data into forms and submitting. Iterating through search results pages. Downloading files (PDF, MS Word, images, etc.).

Web Scrape Master

Scrape web without writing code for it; To create value from the sea of data being published over web. Data is Currency. API. Web scrape master provides a very simple API for retrieving scrape data.

If you know of scraping services I don't have listed, or scraping tools that aren't included in my research, please let me know. I think scraping services that get it right, will continue to play a vital role in API deployment and getting us to 1M APIs.

16 Areas Of My Core API Research

When I first started API Evangelist, I wanted to better understand the business of APIs, which really focused on API management. Over the course of four years, the list of companies delivering API management services has expanded with new entrants, an evolved with acquisitions of some of the major players. Visit my API management research site for more news, companies, tools and analysis from this part of API operations.

API Management

In 2011, people would always ask me, which API management company will help you with deployment? For a while, the answer was none of them, but this soon changed with new players emerging, and expanding of services from existing providers. To help me understand the expanding API lifecycle I started two new separate research areas:

API Design
API Deployment

Once you design, deploy your API, and you get a management plan in place, you have to start looking at how you are going to make money and get the word out. In an effort to better understand the business of APIs, I setup research sites for researching the monetizationand evangelizing of APIs:

API Evangelism
API Monetization

As the API space has matured, I started seeing that I would have to pay better attention to not just providing APIs, but also better track on the consumption of APIs. With this in mind I start looking into how APIs are discovered and what service and tools developers are using when integrating with APIs:

API Discovery
API Integration

While shifting my gaze to what developers are up to when building applications using APIs, I couldn’t help but notice the rapidly expanding world of backend as a service, often called BaaS:

Backend as a Service (BaaS)

As I watch the space, I carefully tag interesting news, companies, services and tools that have an API focus, or influence the API sector in any way. Four areas I think are the fastest growing, and hold the most potential are:


In 2013, I saw also saw a conversation grow around an approach to designing APIs, called hypermedia. Last year things moved beyond just academic discussion around designing hypermedia APIs, to actual rubber meeting the road with companies deploying well designed hypermedia APIs. In January I decided that it was time to regularly track on what is going on in the hypermedia conversation:.

Hypermedia APIs

After that there are three areas that come up regularly in monitoring of the space, and pieces of the API puzzle that I don’t just think are important to API providers and consumers, they are areas I actively engage in:

Single Page Apps (SPA)

There are other research projects I have going on, but these area reflect the core of my API monitoring and research. They each live as separate Github repositories, accessible through Github pages. I publish news, companies, tools, and my analysis on a regular basis. I use them for my own education, and I hope you can find useful as well.


API Providers Guide - API Deployment

Prepared By Kin Lane

July 2014

API Providers Guide - API Deployment

Table of Contents

  • Overview of API Deployment
  • API Deployment Building Blocks
  • Tools for API Deployment
  • Cloud API Deployment Platforms
  • Using API Gateways for Deployment
  • Legitimizing Scraping As A Data Source For APIs
  • From Deployment to Management

Web Harvesting to API with

I had a demo of a new data extraction service today called The service allows you to harvest or scrape data from websites and then output in machine readable formats like JSON. This is very similar to Needlebase, a popular scraping tool that was acquired and then shut down by Google early in 2012. Except I’d say represents a simpler, yet at the same time a more sophisticated approach to harvesting of web data and publishing than Needlebase.

Using you can target web pages, where the content resides that you wish to harvest, define the rows of data, label and associate them with columns in table you where the system will ultimately put your data, then extract the data complete with querying, filtering, pagination and other aspects of browsing the web you will need to get at all the data you desire.

After defining the data that will be extracted, and how it will be store you can stop and use the data as is, or you can setup a more ongoing, real-time connection with the data you are harvesting. Using connectors you pull the data regularly, identify when it changes, merge from multiple sources and remix data as needed.

Put The Data To Work
Using you can immediately extract the data you need and get to work, or establish an ongoing connection with your sources of data and use via the web app or you can manage and access via the API--giving you full control over your web harvesting connections, and the resulting data.

When getting to work using, you have the option to build your own connectors or explore a marketplace of existing data connectors, tailored to pull from some common sources like the Guardian or ESPN. The marketplace of connectors is a huge opportunity for data consumers as well as data scraping junkies (like me) to put their talents to use building unique and desireable data harvesting scripts.

I’ve written about database to API services like EmergentOne and SlashDB, I would put into the Harvest to API or ScrAPI category--allowing you to deploy APIs and machine readable datasets from any publicly available data, even if you aren’t a programmer.

I think ScrAPI services and tools will play an important role in the API economy. While data will almost always originate from a database, often times you can’t navigate existing IT bottlenecks to properly connect and deploy an API from that data source. Sometimes problem owners will have to circumvent existing IT infrastructure and harvest where the data is published on the open web.  Taking it upon themselves to generate the necessary API or machine readable formats that will be needed for the last mile of mobile and big data apps that will ultimately consume and depend on this data.

Can You Build an API Using Scraped Data?

I’ve come across several APIs lately that rely on getting their data from web scraping.  These API owners have written scripts that spider page(s) on a regular schedule, pull content, parse it and clean it up for storage in a database.

Once they’ve scraped or harvested this data from the web page source, they then get to work transforming and normalizing for access via a RESTful API.   With an API they can now offer a standard interface that reliably pulls data from a database and delivers it in XML or JSON to any developer.  

These API owners are still at the mercy of the website source that is being scraped.  What if the website changes something?  

This is definitely a real problem, and scraping is not the most reliable data source.  But if you think about it, APIs can go up and down in availability, making them pretty difficult to work with too. As long as you have failure built into your data acquisition process and hopefully have a plan B, the API can offer a consistent interface for your data.

Web scraping is not an ideal foundation for building a business around, but if the value of the data outweighs the work involved with obtaining, cleaning up, normalizing and delivering the data--then its worth it.   And this type of approach to data APIs will continue.

There is far more data available via web pages today then there is from web APIs.  This discrepancy leaves a lot of opportunities for web scrapers to build business around data you might own, and you make available online, but haven’t made time to open, control and monitor access to it via an APi.

The opportunity to build businesses around scraping will continue in any sector until dependable APIs are deployed that provide reasonable pricing, as well as clear incentive and revenue models for developers to build their businesses on.

Open APIs Give Content Providers More Control

I'm organizing my code libraries tonight. These are PHP, JavaScript, Regular Expressions, SQL, and other tools I use for different purposes.

One such purpose is harvesting and scraping. I have an extensive library of PHP code I've used in the last 5 years to pull web pages, parse tables, submit forms, and what not.

As I'm organizing these snippets of code into Snippely, I'm thinking about all the effort I've put into getting content.

I've harvested government data, craigslist posting, real estate listings, and a wide variety of news, products, and geo-data.

If I need some data, I much prefer using an API, but if I have a need and there is data available on a web page...I just harvest it.

If a content provider does not have an open API, I view them much differently than if they do. I see them as a source of content, there is no real relationship. When a content provider has an open API, I will use their API. If the API offers enough value, I will pay to use it, and establish a relationship with the content provider.

Content providers will have far more control over their content by providing an open API, even if the content is also available on their site. This control will allow owners to track usage and even monetize it.

I would much prefer to pull data and other content from an API rather than scrape or harvest it.

Setting Data Free with Scraping

I was just reading Setting Government Data Free with ScraperWiki from ProgrammableWeb. It led me to start playing with ScraperWiki:
  • Scraper: a computer program that copies structured information from webpages into a database
  • ScraperWiki: a website where people can write and repair public web scrapers and invent uses for the data
With 20 years of experience with databases, I have love of data and tools that make data more accessible. Scraping is an important tool in the liberation of data from web-based sources.

The data often is restricted by a lack of skills or resources by the owning party. They just don't have time or the understanding on how to publish the data so it is easily accessed and consumed. Other times they may purposely make it difficult to access.

I have a pretty mature set of scraping scripts that allow me to pull web pages, consume, iterate and parse the content. I then store as XML files on Amazon S3, relational tables in Amazon RDS or key-value pairs in Amazon SimpleDB.

I like what ScraperWiki is doing with not only democratizing data, but democratizing the tools and places to store the data. There is a lot of work to be done in liberating data for government, corporate and non-profit groups. We need all the people, tools, and standard processes we can get.

Transforming Text Into Knowledge API

I had a chance to play with the AlchemyAPI I came across today. AlchemyAPI is a semantic tagging and text mining Application Programming Interface (API).

I have about 10K web pages I want to extract top keywords and key phrases from. I want meaning extracted from the words on each page.

AlchemyAPI provides nine methods:
  • Named Entity Extraction - Identifies people, companies, organizations, cities, geographic features and other entities within content provided.
  • Topic Categorization - Applies a categorization for the content provided.
  • Language Detection - Provides language detection for the content provided.
  • Concept Tagging - Tagging of the content provided.
  • Keyword Extraction - Provides topic / keyword / tag extraction for the content provided.
  • Text Extraction / Web Page Cleaning - Provides mechanism to extract the page text from web pages.
  • Structured Content Scraping - Ability to mine structured data from web pages.
  • Microformats Parsing / Extraction - Extraction of hCard, adr, geo, and rel formatted content from any web page.
  • RSS / ATOM Feed Detection - Provides RSS / ATOM feed detection in any web pages.
I'm only using the keyword extraction and named entity extraction for what I am doing. The whole API provides some great tools to quickly harvest, scrape and process content from the open Internet.

Their API is extremely easy to use and you can be up and running in about 10 minutes harvesting and processing pages.

If you think there is a link I should have listed here feel free to tweet it at me, or submit as a Github issue. Even though I do this full time, I'm still a one person show, and I miss quite a bit, and depend on my network to help me know what is going on.