This software is designed to extract text content from HTML pages. Many webmasters have pages that need updating of content, but don't want to manually find and extract new content. Thus, a content extractor such as this is born.

You can choose any site then simply push GET request to dragon extractor and he give to you main text content. Like a boilerpipe extractor

Dragon extractor

Usage

This usually works fairly well, but you can adjust the extraction parameters to suit your needs. First, you can use several extraction strategies . Second, you can choose from several output modes. Theses options can be specified using additional GET parameters:

Simply send request to http://allextract.appspot.com/parse?url=http://someurl?&extractor=ArticleExtractor&output=htmlFragment

To change the extraction strategy, add the extractor parameter, with one of the following values:

Strategy	Description
ArticleExtractor	(default). Uses ArticleExtractor : A full-text extractor which is tuned towards news articles. In this scenario it achieves higher accuracy than DefaultExtractor.
DefaultExtractor	Uses DefaultExtractor : A quite generic full-text extractor, but usually not as good as ArticleExtractor.
LargestContentExtractor	Uses LargestContentExtractor: Like DefaultExtractor, but only keeps the largest content block. Good for non-article style texts with only one main content block.
KeepEverythingExtractor	Uses KeepEverythingExtractor: Treats everything as "content". Useful to track down SAX parsing errors.
CanolaExtractor	Uses CanolaExtractor: A full-text extractor .

To change the output format, add the output parameter, with one of the following values:

Output Format	Description
html	(default). Output the whole HTML document and highlight the extracted main content
htmlFragment	Output only those HTML fragments that are regarded main content
text	Output the extracted main content as plain text

Dragon extractor is based on boilerplate extractor.

Boilerplate Detection using Shallow Text Features

Paper

Christian Kohlschütter, Peter Fankhauser and Wolfgang Nejdl,
Boilerplate Detection using Shallow Text Features,
WSDM 2010 -- The Third ACM International Conference on Web Search and Data Mining New York City, NY USA.

Download PDF

ABSTRACT. In addition to the actual content Web pages consist of navigational elements, templates, and advertisements. This boilerplate text typically is not related to the main content, may deteriorate search precision and thus needs to be detected properly. In this paper, we analyze a small set of shallow text features for classifying the individual text elements in a Web page. We compare the approach to complex, state-of-the-art techniques and show that competitive accuracy can be achieved, at almost no cost. Moreover, we derive a simple and plausible stochastic model for describing the boilerplate creation process. With the help of our model, we also quantify the impact of boilerplate removal to retrieval performance and show significant improvements over the baseline. Finally, we extend the principled approach by straight-forward heuristics, achieving a remarkable accuracy.

WSDM2010 presentation

L3S-GN1 dataset

The data is available online, for free but only for research purposes. Click here to access the dataset (please follow the instructions at the login prompt).

Code

Please check out Boilerpipe, the boilerplate removal library based upon the paper.

The boilerpipe library provides algorithms to detect and remove the surplus "clutter" (boilerplate, templates) around the main textual content of a website.

The library already provides specific strategies for common tasks (for example: news article extraction) and may also be easily extended for individual problem settings. Extracting content is very fast (milliseconds), just needs the input document (no global or site-level information required) and usually quite accurate.

Boilerpipe is a Java library written by Christian Kohlschütter. It is released under the Apache License 2.0.