API Scraping Organizations

These are the organizations I come across in my research who are doing interesting things in the API space. They could be companies, institutions, government agencies, or any other type of organizational entity. My goal is to aggregate so I can stay in tune with what they are up to and how it impacts the API space.

Embedly

Embedly provides a platform and suite of tools to make embedding and previewing links simple. Embedly helps publishers and consumers manage embed codes from more than 100 Websites and APIs, including YouTube, Flickr, Ustream, Picassa, Hulu, Twitpic, Quantcast, and CrunchBase. It automatically convert links from these sources into embedded media on the fly.

Saplo

Saplo uses innovative semantic technologies to analyze text in a way that mimic how humans read and evaluate text. Saplo help organisations extract and refine valuable information hidden in large text collections. Saplo have five different services; Entity Tagging, Topic Tags, Related & Similar Articles, Contextual recognition and Sentiment Analysis.

TextRazor

The service provides analysis of selected text passages to identify named entities and statements of fact with disambiguation to distinguish similar text strings. It applies machine learning algorithms and natural language processing to connect a text sample with a knowledge base and identify known elements and their relationships. API methods support submission of a text sample to be parsed. 

Apifier

In late 2014, we needed a web scraper for one of our consulting projects, but couldn't find anything suitable. Therefore we decided to build a better scraper and it turned out people really liked it. Few months later, the project was selected with 32 others from 6500 applications to the inaugural Y Combinator Fellowship programme in August 2015. Apifier launched publicly in October 2015.

ScraperWiki

ScraperWiki is a web-based platform for collaboratively building programs to extract and analyze public (online) data, in a wiki-like fashion. Scraper refers to screen scrapers, programs that extract data from websites. Wiki means that any user with programming experience can create or edit such programs for extracting new data, or for analyzing existing datasets. The main use of the website is providing a place for programmers and journalists to collaborate on analyzing public data

CommonCrawl

Common Crawl is a non-profit foundation dedicated to providing an open repository of web crawl data that can be accessed and analyzed by everyone. Common Crawl Foundation is a California 501(c)(3) registered non-profit founded by Gil Elbaz with the goal of democratizing access to web information by producing and maintaining an open repository of web crawl data that is universally accessible and analyzable.

Scrapinghub

Scrapinghub is a company that provides web crawling solutions, including a platform for running crawlers, a tool for building scrapers visually, data feed providers (DaaS) and a consulting team to help startups and enterprises build and maintain their web crawling infrastructures.

PromptCloud

PromptCloud opeartes on “Data as a Service” (DaaS) model and deals with large-scale data crawl and extraction, using cutting-edge technologies and cloud computing solutions (Nutch, Hadoop, Lucene, Cassandra, etc). Its proprietary software employs machine learning techniques to extract meaningful information from the web in desired format. These data could be from reviews, blogs, product catalogs, social sites, travel data—basically anything and everything on WWW. It’s a customized solution over simply being a mass-data crawler, so you only get the data you wish to see. The solution provides both deep crawl and refresh crawl of the web pages in a structured format.

ConvExtra

Convextra allows you collect valuable data from internet and represents it in easy-to-use CVS format for forther utilization.

Screen Scraper

Copying text from a web page. Clicking links. Entering data into forms and submitting. Iterating through search results pages. Downloading files (PDF, MS Word, images, etc.).

Mozenda

Web data extraction and mashups are easy with Mozenda. We're industry leaders in screen scraping and data integration.

HPE Haven OnDemand

HPE Haven OnDemand is a platform for building cognitive computing solutions using text analysis, speech recognition, image analysis, indexing and search APIs. Simply put, developers and businesses use APIs to add advanced capabilities such as natural language processing, machine learning, and predictive analytics to their applications.

Aylien

AYLIEN Text Analysis API is a package of Natural Language Processing, Information Retrieval and Machine Learning tools for extracting meaning and insight from textual and visual content with ease. At AYLIEN, we’re harnessing the potential of your data. Whether you're a news organization, a developer, a savvy marketer or an academic, you'll soon see what a dose of AYLIEN intelligence can do for you. Our text API allows you to monitor the sentiment of your brand, analyze documents or summarize and classify large amounts of text.

Scrapy

An open source and collaborative framework for extracting the data you need from websites. In a fast, simple, yet extensible way. Scrapy is a fast high-level screen scraping and web crawling framework, used to crawl websites and extract structured data from their pages. It can be used for a wide range of purposes, from data mining to monitoring and automated testing.

ParseHub

Turn dynamic websites into APIs. You can extract data from anywhere. ParseHub works with single-page apps, multi-page apps and just about any other modern web technology. ParseHub can handle Javascript, AJAX, cookies, sessions and redirects. You can easily fill in forms, loop through dropdowns, login to websites, click on interactive maps and even deal with infinite scrolling.

WrapAPI

Build an API on top of any website. Turn any website...into a parameterized APIBuild, share, and use APIs made from webpages. Use WrapAPI to scrape sites, build better UIs, and automate online tasks.

Apache Nutch

Nutch is a well matured, production ready Web crawler. Nutch 1.x enables fine grained configuration, relying on Apache Hadoop™ data structures, which are great for batch processing. - See more at: http://nutch.apache.org/index.html#sthash.dehuG4St.dpuf

Dandelion API

Context Intelligence: from text to actionable data. Extract meaning from unstructured text and put it in context with a simple API.Thanks to its revolutionary technology, Dandelion API works well even on short and malformed texts in English, French, German, Italian and Portuguese.

Moz

Moz is a software as a service (SaaS) company based in Seattle, Washington, U.S.A., that sells inbound marketing and marketing analytics software subscriptions. It was founded by Rand Fishkin and Gillian Muessig in 2004 as a consulting firm and shifted to software development in 2008. The company hosts a website which includes an online community of more than one million globally based digital marketers and marketing related tools.

ScrapeLogo

ScrapeLogo has been discovered and developed by Maintop Businesses, originally only for internal purposes. It was coded as an independent service for several Maintop’s B2B projects. When requests from other companies multiplied, a private beta version was launched too. We are now looking for the first beta testers, who would like to show company logos on their websites and help us improve the quality and precision of our algorithm.

import.io

Importio turns the web into a database, releasing the vast potential of data trapped in websites. Allowing you to identify a website, select the data and treat it as a table in your database. In effect transform the data into a row and column format. You can then add more websites to your data set, the same as adding more rows and query in real-time to access the data.

Diffbot

<p>Diffbot provides a set of APIs that enable developers to easily use web data in their own applications. Diffbot analyzes documents much like a human would, using the visual properties to determine how the parts of the page fit together. The algorithm uses statistical techniques to automatically and reliably determine the structural organization of a page, independent of layout and the language of the text.

AlchemyAPI

The product of over 50 person years of engineering effort, AlchemyAPI is a text mining platform providing the most comprehensive set of semantic analysis capabilities in the natural language processing field. Used over 3 billion times every month, AlchemyAPI enables customers to perform large-scale social media monitoring, target advertisements more effectively, track influencers and sentiment within the media, automate content aggregation and recommendation, make more accurate stock trading decisions, enhance business and government intelligence systems, and create smarter applications and services.

Bitext

Bitext delivers the most precise and granular text analytics solution on the market, with an accuracy rate above 90%. We are computational linguists first. Our technology really understands sentence structure and its different layers of meaning, so it always produces the richest results.

If you think there is an organization I should have listed here feel free to tweet it at me, or submit as a Github issue. Even though I do this full time, I'm still a one person show, and I miss quite a bit, and depend on my network to help me know what is going on.