{"Scraping & APIs"}

I Am Using Kimono Labs To Fill In Gap For Companies Who Do Not Have AN RSS Feed For Their Blog

I am tracking on around 2500 companies who are doing interesting things in the API space. Out of these companies about 1000 of them have blogs, which for me is a pretty important signal. About 1/4 of these companies with blogs, do not have an RSS feed, which in 2014 seems a little odd to me, but maybe I'm an old timer.

I believe that a blog is one of the most important signals, any API provider come put out, right alongside Twitter and Github. Historically I depend on the Twitter account for these RSS-less blogs, but now I'm taking a different path, and using Kimono Labs to fill in the gap.

Using the Kimono Labs Chrome extension, I just visit the main blog page for one of these companies, and select the title of the first blog post. Kimono gives me the ability to name this field, which I usually just call “title”. Since the title is a link, Kimono also associates the link to the blog post, along with each title. You could also highlight the summary text, but I don’t need this, as I have secondary processes that runs and pulls the full content of a blog post, as well as the timestamp, author, and taking of a screenshot.

Within a couple of seconds, using Kimono Labs, I now have an API for each companies blog, assuming the role RSS would normally play. When I have time each week, I will generate an API for the most important companies I'm tracking on. Maybe someday I will close the entire blog syndication gap for the companies I track on, but for now its nice to be able to tackle the most high value companies, and know that I have a viable solution to the problem with Kimono Labs.

See The Full Blog Post


Legitimizing Scraping As A Data Source For APIs

Harvesting or scraping content and data from other public web sources has been something many do, but few will talk publicly about. While scraping does infringe on copyright in some situations, in many others situations, it is quickly becoming a legitimate way to acquire content or data, for use behind an API.

There is a lot of content available online, where the current stewards do not have the control, resources, or interest in making content available in a machine readable format. In these scenarios, for many developers, if you want the content, you just write a scrape script, and liberate it from the site, to be used as you wish.

In the last year, we are seeing a new breed of API service providers emerge, who assist users in deploying APis from data and content that is liberated though harvesting or scraping. For the first time, I'm seeing enough of these new tools and services, that I'm going to break out into its own area, and make part of this API deployment white paper.

See The Full Blog Post


Building Blocks Of API Deployment

As I continue my research the world of API deployment, I'm trying to distill the services, and tooling I come across, down into what I consider to be a common set of building blocks. My goal with identifying API deployment building blocks is to provide a simple list of what the moving parts are, that enable API providers to successfully deploy their services.

Some of these building blocks overlap with other core areas of my research like design, and management, but I hope this list captures the basic building blocks of what anyone needs to know, to be able to follow the world of API deployment. While this post is meant for a wider audience, beyond just developers, I think it provides a good reminder for developers as well, and can help things come into focus. (I know it does for me!)

Also there is some overlap between some of these building blocks, like API Gateway and API Proxy, both doing very similiar things, but labeled differently. Identifying building blocks for me, can be very difficult, and I'm constantly shifting definitions around, until I find a comfortable fit--so some of these will evolve, especially with the speed at which things are moving in 2014.

CSV to API - Text files that contain comma separate values or CSVs, is one of the quickest ways to convert existing data to an API. Each row of a CSV can be imported and converted to a record in a database, and easily generate a RESTful interface that represents the data stored in the CSV. CSV to API can be very messy depending on the quality of the data in the CSV, but can be a quick way to breathe new life into old catalogs of data lying around on servers or even desktops. The easiest way to deal with CSV is to import directly into database, than generate API from database, but the process can be done at time of API creation.
Database to API - Database to API is definitely the quickest way to generate an API. If you have valuable data, generally in 2013, it will reside in a Microsoft, MySQL, PostgreSQL or other common database platform. Connecting to a database and generate a CRUD, or create, read, updated and delete API on an existing data make sense for a lot of reason. This is the quickest way to open up product catalogs, public directories, blogs, calendars or any other commonly stored data. APIs are rapidly replace database connections, when bundled with common API management techniques, APIs can allow for much more versatile and secure access that can be made public and shared outside the firewall.
Framework - There is no reason to hand-craft an API from scratch these days. There are numerous frameworks out their that are designed for rapidly deploying web APIs. Deploying APIs using a framework is only an option when you have the necessary technical and developer talent to be able to understand the setup of environment and follow the design patterns of each framework. When it comes to planning the deployment of an API using a framework, it is best to select one of the common frameworks written in the preferred language of the available developer and IT resources. Frameworks can be used to deploy data APIs from CSVs and databases, content from documents or custom code resources that allow access to more complex objects.
API Gateway - API gateways are enterprise quality solutions that are designed to expose API resources. Gateways are meant to provide a complete solution for exposing internal systems and connecting with external platforms. API gateways are often used to proxy and mediate existing API deployments, but may also provide solutions for connecting to other internal systems like databases, FTP, messaging and other common resources. Many public APIs are exposed using frameworks, most enterprise APIs are deployed via API gateways--supporting much larger ideployments.
API Proxy - API proxy are common place for taking an existing API interface, running it through an intermediary which allows for translations, transformations and other added services on top of API. An API proxy does not deploy an API, but can take existing resources like SOAP, XML-RPC and transform into more common RESTful APIs with JSON formats. Proxies provide other functions such as service composition, rate limiting, filtering and securing of API endpoints. API gateways are the preffered approach for the enterprise, and the companies that provide services support larger API deployments.
API Connector - Contrary to an API proxy, there are API solutions that are proxyless, while just allowing an API to connect or plugin to the advanced API resources. While proxies work in many situations, allowing APIs to be mediated and transformed into required interfaces, API connectors may be preferred in situations where data should not be routed through proxy machines. API connector solutions only connect to existing API implementations are easily integrated with existing API frameworks as well as web servers like Nginx.
Hosting - Hosting is all about where you are going to park your API. Usual deployments are on-premise within your company or data center, in a public cloud like Amazon Web Services or a hybrid of the two. Most of the existing service providers in the space support all types of hosting, but some companies, who have the required technical talent host their own API platforms. With HTTP being the transport in which modern web APIs put to use, sharing the same infrastructure as web sites, hosting APIs does not take any additional skills or resources, if you already have a web site or application hosting environment.
API Versioning - There are many different approaches to managing different version of web APIs. When embarking on API deployment you will have to make a decision about how each endpoint will be versioned and maintained. Each API service provider offers versioning solutions, but generally it is handled within the API URI or passed as an HTTP header. Versioning is an inevitable part of the API life-cycle and is better to be integrated by design as opposed to waiting until you are forced to make a evolution in your API interface.
Documentation - API documentation is an essential building block for all API endpoints. Quality, up to date documentation is essential for on-boarding developers and ensuring they successfully integrate with an API. Document needs to be derived from quality API designs, kept up to date and made accessible to developers via a portal. There are several tools available for automatically generting documentation and even what is called interactive documentation, that allows for developers to make live calls against an API while exploring the documentation. API documentation is part of every API deployment.
Code Samples - Second to documentation, code samples in a variety of programming languages is essential to a successful API integration. With quality API design, generating samples that can be used across multiple API resources is possible. Many of the emerging API service providers and the same tools that generate API documentation from JSON definitions can also auto generate code samples that can be used by developers. Generation of code samples in a variety of programming languages is a requirement during API deployment.
Scraping - Harvesting or scraping of data from an existing website, content or data source. While we all would like content and data sources to be machine readable, sometimes you have just get your hands dirty and scrape it. While I don't support scraping of content in all scenarios, and business sectors, but in the right situations scraping can provide a perfectly acceptable content or data source for deploying an API.
Container - The new virtualization movement, lead by Docket, and support by Amazon, Google, Red Hat, Microsoft, and many more, is providing new ways to package up APIs, and deploy as small, modular, virtualized containers.
Github - Github provides a simple, but powerful way to support API deployment, allowing for publsihing of a developer portal, documentation, code libraries, TOS, and all your supporting API business building blocks, that are necessary for API effort. At a minimum Github should be used to manage public code libraries, and engage with API consumers using Github's social features.
Terms of Use / Service - Terms of Use provide a legal framework for developers to operate within. They set the stage for the business development relationships that will occur within an API ecosystem. TOS should protect the API owners company, assets and brand, but should also provide assurances for developers who are building businesses on top of an API. Make sure an APIs TOS pass insepection with the lawyers, but also strike a healthy balance within the ecosystem and foster innovation.

If there are any features, service or tools you depend on when deploying your APIs, please let me know at @kinlane. I'm not trying to create an exhaustive list, I just want to get idea for what is available across the providers, and where the gaps are potentially. 

I'm feel like I'm finally getting a handle on the building blocks for API design, deployment, and management, and understanding the overlap in the different areas. I will revisit my design and management building blocks, and evolve my ideas of what my perfect API editor would look like, and how this fits in with API management infrastructure from 3Scale, and even API integration.

Disclosure: 3Scale is an API Evangelist partner.

See The Full Blog Post


The Black, White And Gray of Web Scraping

There are many reasons for wanting to scrape data or content from a public website. I think these reasons can be easily represented as different shades of gray, the darker the grey being considered less legal, and the lighter the grey more legal you could consider it. You with me?

An example of darker grey would be scraping classified ad listings from craigslist for use on your own site. Where an example of lighter grey could be pulling a listing of veterans hospitals from the Department of Veterans Affairs website for use in a mobile app that supports veterans. One is corporate owned data, and the other is public data. The motives for wanting either set of data would potentially be radically different, and the restrictions on each set of data would be different as well.

Many opponents of scraping don't see the shades of grey, they just see people taking data and content that isn't theirs. Proponents of scraping will have an array of opinions ranging from, if it is on the web, it should be available to everyone, to people who only would scrape openly licensed or public data, and stay away from anything proprietary.

Scraping of data is never a black and white issue. I’m not blindly supporting scraping in any situation, but I'm a proponent of sensible approaches to harvesting of valuable information, development of open source tools, as well as services that assist users in scraping.

See The Full Blog Post


The Role Of Scraping In API Deployment

Scraping has been something I’ve done since I first started working on the web. Sometimes you just need some data or a piece of content that isn't available in a machine readable format, and the only way is to get it scrape it off a web page.

Scraping is widespread, but something very few individuals or companies will admit to doing. Just like writing scripts for pulling data from APIs, I write a lot of scripts that pull content and data from websites and RSS feeds. Even though I tend to write my own scripts for scraping, I’ve been closely watching the new breed of scraping tools like Scraperwiki:

ScraperWiki

ScraperWiki is a web-based platform for collaboratively building programs to extract and analyze public (online) data, in a wiki-like fashion. "Scraper" refers to screen scrapers, programs that extract data from websites. "Wiki" means that any user with programming experience can create or edit such programs for extracting new data, or for analyzing existing datasets. The main use of the website is providing a place for programmers and journalists to collaborate on analyzing public data

I was first attracted to Scraperwiki as a way to harvest Tweets, and further interested by their web and PDF extraction tools. Scraperwiki has already been around for a while, founded back in 2010, and recently there is a new wave of scraping tools that have emerged:

import.io

Importio turns the web into a database, releasing the vast potential of data trapped in websites. Allowing you to identify a website, select the data and treat it as a table in your database. In effect transform the data into a row and column format. You can then add more websites to your data set, the same as adding more rows and query in real-time to access the data.

Kimono

Kimono is a way to turn websites into structured APIs from your browser in seconds. You don’t need to write any code or install any software to extract data with Kimono. The easiest way to use Kimono is to add our bookmarklet to your browser’s bookmark bar. Then go to the website you want to get data from and click the bookmarklet. Select the data you want and Kimono does the rest.

Kimono and Import.io provide scraping tools for anyone, even non-developers to scrape content from web pages, but also allow you to deploy an API from the content. While it is easy to deploy APIs using data and content from the other scraping providers I track on, the new breed of scraping services focus on API deployment as end-goal.

At API Strategy & Practice in Amsterdam, the final panel of the event was called “toward 1 million APIs”, and scraping came up as one possible way that we will get to APIs at this scale. Sometimes the stewards or owners of data just don’t have the resources to deploy APIs, and the only way to deploy an API will be to scrape data and content and publish as web API--either internally or externally by 3rd party.

I have a research site setup to keep track of scraping news I come across, as well as any companies and tools I discover. Beyond ScraperWiki, Kimono and Import.io I’m watching these additional scraping services.

Alchemy

The product of over 50 person years of engineering effort, AlchemyAPI is a text mining platform providing the most comprehensive set of semantic analysis capabilities in the natural language processing field. Used over 3 billion times every month, AlchemyAPI enables customers to perform large-scale social media monitoring, target advertisements more effectively, track influencers and sentiment within the media, automate content aggregation and recommendation, make more accurate stock trading decisions, enhance business and government intelligence systems, and create smarter applications and services.

Common Crawl

Common Crawl is a non-profit foundation dedicated to providing an open repository of web crawl data that can be accessed and analyzed by everyone. Common Crawl Foundation is a California 501(c)(3) registered non-profit founded by Gil Elbaz with the goal of democratizing access to web information by producing and maintaining an open repository of web crawl data that is universally accessible and analyzable.

ConvExtra

Convextra allows you collect valuable data from internet and represents it in easy-to-use CVS format for forther utilization.

PageMunch

Page Munch is a simple API that allows you to turn webpages into rich, structured JSON. Easily extract photos, videos, event, author and other metadata from any page on the internet in milliseconds.

PromptCloud

PromptCloud opeartes on “Data as a Service” (DaaS) model and deals with large-scale data crawl and extraction, using cutting-edge technologies and cloud computing solutions (Nutch, Hadoop, Lucene, Cassandra, etc). Its proprietary software employs machine learning techniques to extract meaningful information from the web in desired format. These data could be from reviews, blogs, product catalogs, social sites, travel data—basically anything and everything on WWW. It’s a customized solution over simply being a mass-data crawler, so you only get the data you wish to see. The solution provides both deep crawl and refresh crawl of the web pages in a structured format.

Scrapinghub

Scrapinghub is a company that provides web crawling solutions, including a platform for running crawlers, a tool for building scrapers visually, data feed providers (DaaS) and a consulting team to help startups and enterprises build and maintain their web crawling infrastructures.

Screen Scraper

Copying text from a web page. Clicking links. Entering data into forms and submitting. Iterating through search results pages. Downloading files (PDF, MS Word, images, etc.).

Web Scrape Master

Scrape web without writing code for it; To create value from the sea of data being published over web. Data is Currency. API. Web scrape master provides a very simple API for retrieving scrape data.

If you know of scraping services I don't have listed, or scraping tools that aren't included in my research, please let me know. I think scraping services that get it right, will continue to play a vital role in API deployment and getting us to 1M APIs.

See The Full Blog Post


16 Areas Of My Core API Research

When I first started API Evangelist, I wanted to better understand the business of APIs, which really focused on API management. Over the course of four years, the list of companies delivering API management services has expanded with new entrants, an evolved with acquisitions of some of the major players. Visit my API management research site for more news, companies, tools and analysis from this part of API operations.

API Management

In 2011, people would always ask me, which API management company will help you with deployment? For a while, the answer was none of them, but this soon changed with new players emerging, and expanding of services from existing providers. To help me understand the expanding API lifecycle I started two new separate research areas:

API Design
API Deployment

Once you design, deploy your API, and you get a management plan in place, you have to start looking at how you are going to make money and get the word out. In an effort to better understand the business of APIs, I setup research sites for researching the monetizationand evangelizing of APIs:

API Evangelism
API Monetization

As the API space has matured, I started seeing that I would have to pay better attention to not just providing APIs, but also better track on the consumption of APIs. With this in mind I start looking into how APIs are discovered and what service and tools developers are using when integrating with APIs:

API Discovery
API Integration

While shifting my gaze to what developers are up to when building applications using APIs, I couldn’t help but notice the rapidly expanding world of backend as a service, often called BaaS:

Backend as a Service (BaaS)

As I watch the space, I carefully tag interesting news, companies, services and tools that have an API focus, or influence the API sector in any way. Four areas I think are the fastest growing, and hold the most potential are:

Aggregation
Reciprocity
Real-Time
Voice

In 2013, I saw also saw a conversation grow around an approach to designing APIs, called hypermedia. Last year things moved beyond just academic discussion around designing hypermedia APIs, to actual rubber meeting the road with companies deploying well designed hypermedia APIs. In January I decided that it was time to regularly track on what is going on in the hypermedia conversation:.

Hypermedia APIs

After that there are three areas that come up regularly in monitoring of the space, and pieces of the API puzzle that I don’t just think are important to API providers and consumers, they are areas I actively engage in:

Embeddable
Scraping
Single Page Apps (SPA)

There are other research projects I have going on, but these area reflect the core of my API monitoring and research. They each live as separate Github repositories, accessible through Github pages. I publish news, companies, tools, and my analysis on a regular basis. I use them for my own education, and I hope you can find useful as well.

See The Full Blog Post


Introduction

API Providers Guide - API Deployment

Prepared By Kin Lane

July 2014




API Providers Guide - API Deployment



Table of Contents

  • Overview of API Deployment
  • API Deployment Building Blocks
  • Tools for API Deployment
  • Cloud API Deployment Platforms
  • Using API Gateways for Deployment
  • Legitimizing Scraping As A Data Source For APIs
  • From Deployment to Management





See The Full Blog Post


Web Harvesting to API with Import.io

I had a demo of a new data extraction service today called Import.io. The service allows you to harvest or scrape data from websites and then output in machine readable formats like JSON. This is very similar to Needlebase, a popular scraping tool that was acquired and then shut down by Google early in 2012. Except I’d say Import.io represents a simpler, yet at the same time a more sophisticated approach to harvesting of web data and publishing than Needlebase.

Extract
Using Import.io you can target web pages, where the content resides that you wish to harvest, define the rows of data, label and associate them with columns in table you where the system will ultimately put your data, then extract the data complete with querying, filtering, pagination and other aspects of browsing the web you will need to get at all the data you desire.

Connect
After defining the data that will be extracted, and how it will be store you can stop and use the data as is, or you can setup a more ongoing, real-time connection with the data you are harvesting. Using Import.io connectors you pull the data regularly, identify when it changes, merge from multiple sources and remix data as needed.

Put The Data To Work
Using Import.io you can immediately extract the data you need and get to work, or establish an ongoing connection with your sources of data and use via the Import.io web app or you can manage and access via the Import.io API--giving you full control over your web harvesting connections, and the resulting data.

When getting to work using Import.io, you have the option to build your own connectors or explore a marketplace of existing data connectors, tailored to pull from some common sources like the Guardian or ESPN. The Import.io marketplace of connectors is a huge opportunity for data consumers as well as data scraping junkies (like me) to put their talents to use building unique and desireable data harvesting scripts.

I’ve written about database to API services like EmergentOne and SlashDB, I would put Import.io into the Harvest to API or ScrAPI category--allowing you to deploy APIs and machine readable datasets from any publicly available data, even if you aren’t a programmer.

I think ScrAPI services and tools will play an important role in the API economy. While data will almost always originate from a database, often times you can’t navigate existing IT bottlenecks to properly connect and deploy an API from that data source. Sometimes problem owners will have to circumvent existing IT infrastructure and harvest where the data is published on the open web.  Taking it upon themselves to generate the necessary API or machine readable formats that will be needed for the last mile of mobile and big data apps that will ultimately consume and depend on this data.

See The Full Blog Post


Can You Build an API Using Scraped Data?

I’ve come across several APIs lately that rely on getting their data from web scraping.  These API owners have written scripts that spider page(s) on a regular schedule, pull content, parse it and clean it up for storage in a database.

Once they’ve scraped or harvested this data from the web page source, they then get to work transforming and normalizing for access via a RESTful API.   With an API they can now offer a standard interface that reliably pulls data from a database and delivers it in XML or JSON to any developer.  

These API owners are still at the mercy of the website source that is being scraped.  What if the website changes something?  

This is definitely a real problem, and scraping is not the most reliable data source.  But if you think about it, APIs can go up and down in availability, making them pretty difficult to work with too. As long as you have failure built into your data acquisition process and hopefully have a plan B, the API can offer a consistent interface for your data.

Web scraping is not an ideal foundation for building a business around, but if the value of the data outweighs the work involved with obtaining, cleaning up, normalizing and delivering the data--then its worth it.   And this type of approach to data APIs will continue.

There is far more data available via web pages today then there is from web APIs.  This discrepancy leaves a lot of opportunities for web scrapers to build business around data you might own, and you make available online, but haven’t made time to open, control and monitor access to it via an APi.

The opportunity to build businesses around scraping will continue in any sector until dependable APIs are deployed that provide reasonable pricing, as well as clear incentive and revenue models for developers to build their businesses on.

See The Full Blog Post


Open APIs Give Content Providers More Control

I'm organizing my code libraries tonight. These are PHP, JavaScript, Regular Expressions, SQL, and other tools I use for different purposes.

One such purpose is harvesting and scraping. I have an extensive library of PHP code I've used in the last 5 years to pull web pages, parse tables, submit forms, and what not.

As I'm organizing these snippets of code into Snippely, I'm thinking about all the effort I've put into getting content.

I've harvested government data, craigslist posting, real estate listings, and a wide variety of news, products, and geo-data.

If I need some data, I much prefer using an API, but if I have a need and there is data available on a web page...I just harvest it.

If a content provider does not have an open API, I view them much differently than if they do. I see them as a source of content, there is no real relationship. When a content provider has an open API, I will use their API. If the API offers enough value, I will pay to use it, and establish a relationship with the content provider.

Content providers will have far more control over their content by providing an open API, even if the content is also available on their site. This control will allow owners to track usage and even monetize it.

I would much prefer to pull data and other content from an API rather than scrape or harvest it.

See The Full Blog Post


Setting Data Free with Scraping

I was just reading Setting Government Data Free with ScraperWiki from ProgrammableWeb. It led me to start playing with ScraperWiki:
  • Scraper: a computer program that copies structured information from webpages into a database
  • ScraperWiki: a website where people can write and repair public web scrapers and invent uses for the data
With 20 years of experience with databases, I have love of data and tools that make data more accessible. Scraping is an important tool in the liberation of data from web-based sources.

The data often is restricted by a lack of skills or resources by the owning party. They just don't have time or the understanding on how to publish the data so it is easily accessed and consumed. Other times they may purposely make it difficult to access.

I have a pretty mature set of scraping scripts that allow me to pull web pages, consume, iterate and parse the content. I then store as XML files on Amazon S3, relational tables in Amazon RDS or key-value pairs in Amazon SimpleDB.

I like what ScraperWiki is doing with not only democratizing data, but democratizing the tools and places to store the data. There is a lot of work to be done in liberating data for government, corporate and non-profit groups. We need all the people, tools, and standard processes we can get.

See The Full Blog Post


Transforming Text Into Knowledge API

I had a chance to play with the AlchemyAPI I came across today. AlchemyAPI is a semantic tagging and text mining Application Programming Interface (API).

I have about 10K web pages I want to extract top keywords and key phrases from. I want meaning extracted from the words on each page.

AlchemyAPI provides nine methods:
  • Named Entity Extraction - Identifies people, companies, organizations, cities, geographic features and other entities within content provided.
  • Topic Categorization - Applies a categorization for the content provided.
  • Language Detection - Provides language detection for the content provided.
  • Concept Tagging - Tagging of the content provided.
  • Keyword Extraction - Provides topic / keyword / tag extraction for the content provided.
  • Text Extraction / Web Page Cleaning - Provides mechanism to extract the page text from web pages.
  • Structured Content Scraping - Ability to mine structured data from web pages.
  • Microformats Parsing / Extraction - Extraction of hCard, adr, geo, and rel formatted content from any web page.
  • RSS / ATOM Feed Detection - Provides RSS / ATOM feed detection in any web pages.
I'm only using the keyword extraction and named entity extraction for what I am doing. The whole API provides some great tools to quickly harvest, scrape and process content from the open Internet.

Their API is extremely easy to use and you can be up and running in about 10 minutes harvesting and processing pages.

See The Full Blog Post