{"Scraping & APIs"}

Scraping & APIs Tooling

These are the open source scraping related tooling that I am tracking on as part of my research, and can be used as part of the scraping process.

aduana

Frontera backend to guide a crawl using PageRank, HITS or other ranking algorithms based on the link


aile

Automatic Item List Extraction


Apache Tika

The Apache Tikau0026trade; toolkit detects and extracts metadata and structured text content from various documents using existing parser libraries.


aylien_newsapi_csharp

AYLIENs officially supported .Net (C#) client library for accessing News API


aylien_newsapi_go

AYLIENs officially supported Go client library for accessing News API


aylien_newsapi_java

AYLIENs officially supported Java client library for accessing News API


aylien_newsapi_nodejs

AYLIENs officially supported Node.js client library for accessing News API


aylien_newsapi_php

AYLIENs officially supported PHP client library for accessing News API


aylien_newsapi_python

AYLIENs officially supported Python client library for accessing News API


aylien_newsapi_ruby

AYLIENs officially supported Ruby client library for accessing News API


aylien_textapi_csharp

AYLIENs officially supported .Net (C#) client library for accessing Text API


aylien_textapi_go

AYLIENs officially supported Go client library for accessing Text API


aylien_textapi_java

AYLIENs officially supported Java client library for accessing Text API


aylien_textapi_nodejs

AYLIENs officially supported node.js client library for accessing Text API


aylien_textapi_php

AYLIENs officially supported PHP client library for accessing Text API


aylien_textapi_python

AYLIENs officially supported Python client library for accessing Text API


aylien_textapi_ruby

AYLIENs officially supported Ruby client library for accessing Text API


blacksheepwall

blacksheepwall is a hostname reconnaissance tool written in Go.


client-js

Import.io JavaScript client library


client-js-mini

A mini version of the import.io JavaScript client library


dateparser

python parser for human readable dates


diffbot-bash-client

A Diffbot API client for Bash


diffbot-c-client

A Diffbot API client for C


diffbot-clojure-client

Clojure interface to the Diffbot API http://www.diffbot.com/


diffbot-coffeescript-client

Diffbot API for node.js (CoffeeScript)


diffbot-cpp-client

A Diffbot API client for C++


diffbot-csharp-client

A Diffbot API client for C#


diffbot-dart-client

Diffbot API client for Dart


diffbot-delphi-client

A Diffbot API client for Delphi


diffbot-erlang-client

A Diffbot API client for Erlang


diffbot-excel-client

Using Diffbot from Microsoft Excel


diffbot-go-client

A Diffbot API client for Go


diffbot-google-apps-client

A Diffbot API Demo from a Google Apps Spreadsheet


diffbot-groovy-client

A Diffbot API client for Groovy


diffbot-haskell-client

Simple client for the Diffbot API


diffbot-js-client

A Diffbot API client for Javascript


diffbot-lua-client

A Diffbot APi client for Lua


diffbot-matlab-client

A Diffbot API client for MATLAB


diffbot-objc-client

A Diffbot API client for Objective C


diffbot-octave-client

A Diffbot API client for Octave


diffbot-php-client

PHP interface for the Diffbot API


diffbot-plsql-client

Diffbot API client for PL/SQL


diffbot-powershell-client

A Diffbot API client for Powershell


diffbot-prolog-client

A Diffbot API client for Prolog


diffbot-python-client

Python Diffbot API Client


diffbot-r-client

A Diffbot client library for the R language.


diffbot-ruby-client

Official Diffbot Ruby API Client


diffbot-rust-client

A rust client library for the DiffbotAPI


diffbot-scala-client

A Diffbot API client for Scala


diffbot_rapidminer

A Diffbot client for RapidMiner 6.1 or above to analyze web pages.


extruct

Extract embedded metadata from HTML markup


frontera

A scalable frontier for web crawlers


havenondemand-angular

Angular.js IDOL OnDemand client


havenondemand-asp.net

Haven OnDemand client library for ASP.NET


havenondemand-go

Client library for Go to call Haven OnDemand APIs


havenondemand-ios-swift

Haven OnDemand library for iOS - Swift. Release V1.0


havenondemand-java

Haven OnDemand client library for Java


havenondemand-node

Node.js Client library to call Haven OnDemand APIs


havenondemand-php

Haven Ondemand library for PHP


havenondemand-python

Python client library for Haven OnDemand


havenondemand-r

Client library for R to call Haven OnDemand APIs


havenondemand-ruby

Ruby Gem to help call Haven OnDemand APIs.


havenondemand-salesforce

Client library for Sales Force to call Haven OnDemand APIs


headline_analysis

Analyzing news headlines for fun and profit


instarecon

Automated digital reconnaissance


irobot

robots.txt file inspection


knock

Knock Subdomain Scan


loginform

Fill HTML login forms automatically


magic-summary-tool

ScraperWiki tool to summarise stuff about any table of data


mdr

A python library detect and extract listing data from HTML page.


pagemunch-node

A node.js module for the PageMunch web crawler API


pagemunch-objc

An ObjectiveC SDK for the PageMunch web crawler API


pagemunch-php

A PHP library for the PageMunch web crawler API


pagemunch-ruby

A Ruby gem for the PageMunch web crawler API


page_finder

Find which links on a web page are pagination links


pdfminer

Python PDF Parser


PhantomJS

PhantomJS is a headless WebKit scriptable with a JavaScript API. It has fast and native support for various web standards: DOM handling, CSS selector, JSON, Canvas, and SVG.


phantomjs

Scriptable Headless WebKit


Portia

Portia is a tool for visually scraping web sites without any programming knowledge. Just annotate web pages with a point and click editor to indicate what data you want to extract, and portia will learn how to scrape similar pages from the site. Portia has a web based UI served by a Twisted server, so you can install it on almost any modern platform.


portia

Visual scraping for Scrapy


PowerShellAylien

PowerShell Module to interact with AYLIEN Text Analysis API - a package consisting of eight differen


python-scrapinghub

A client interface for Scrapinghubs API


reppy

Modern robots.txt Parser for Python


saplo4java-2.0

Java Library for Saplo Text Analysis API 2.0


saplo4matlab-2.0

Matlab Library for Saplo Text Analysis API 2.0


saplo4php-2.0

PHP Library for Saplo Text Analysis API 2.0


saplo4python-2.0

Python Library for Saplo Text Analysis API 2.0


scaws

Extensions for using Scrapy on Amazon AWS


scmongo

MongoDB extensions for Scrapy


scrapely

A pure-python HTML screen-scraping library


Scraper - Google Chrome Extension

Scraper is a simple data mining extension for Google Chromeu0026trade; that is useful for online research when you need to quickly analyze data in spreadsheet form.


scraperwiki-php

Non-packaged PHP libraries that ScraperWiki Classic uses


scraperwiki-python

ScraperWiki Python library for scraping and saving data


scraperwiki-ruby

ScraperWiki Ruby library for scraping and saving data


Scrapy

Scrapy is a fast high-level screen scraping and web crawling framework, used to crawl websites and extract structured data from their pages. It can be used for a wide range of purposes, from data mining to monitoring and automated testing.


scrapy

Scrapy, a fast high-level web crawling u0026 scraping framework for Python.


scrapylib

Collection of Scrapy utilities (extensions, middlewares, pipelines, etc)


scrapyrt

Scrapy realtime


shub

Scrapinghub Command Line Client


splash

Lightweight, scriptable browser as a service with an HTTP API


spreadsheet-upload-tool

A ScraperWiki tool for uploading structured data from a CSV or Excel spreadsheet


subbrute

A DNS meta-query spider that enumerates DNS records, and subdomains.


textrazor-java

Java SDK for the TextRazor Text Analytics API


textrazor-php

PHP SDK for the TextRazor Text Analytics API


textrazor-python

Python SDK for the TextRazor Text Analytics API


url-py

URL Transformation, Sanitization


If there is an "open source tool" that should be listed here, submit as a Github Issue, and I will consider adding.