If you need some software development based on content extractor please mail me smpolshin@gmail.com
This software is designed to extract text content from HTML pages. Many webmasters have pages that need updating of content, but don't want to manually find and extract new content. Thus, a content extractor such as this is born.
You can choose any site then simply push GET request to dragon extractor and he give to you main text content. Like a boilerpipe extractor
Usage
This usually works fairly well, but you can adjust the extraction parameters to suit your needs. First, you can use several extraction strategies . Second, you can choose from several output modes. Theses options can be specified using additional GET parameters:
Simply send request to http://allextract.appspot.com/parse?url=http://someurl?&extractor=ArticleExtractor&output=htmlFragment
To change the extraction strategy, add the extractor parameter, with one of the following values:
Strategy | Description |
---|---|
ArticleExtractor | (default). Uses ArticleExtractor : A full-text extractor which is tuned towards news articles. In this scenario it achieves higher accuracy than DefaultExtractor. |
DefaultExtractor | Uses DefaultExtractor : A quite generic full-text extractor, but usually not as good as ArticleExtractor. |
LargestContentExtractor | Uses LargestContentExtractor: Like DefaultExtractor, but only keeps the largest content block. Good for non-article style texts with only one main content block. |
KeepEverythingExtractor | Uses KeepEverythingExtractor: Treats everything as "content". Useful to track down SAX parsing errors. |
CanolaExtractor | Uses CanolaExtractor: A full-text extractor . |
To change the output format, add the output parameter, with one of the following values:
Output Format | Description |
---|---|
html | (default). Output the whole HTML document and highlight the extracted main content |
htmlFragment | Output only those HTML fragments that are regarded main content |
text | Output the extracted main content as plain text |
Christian Kohlschütter, Peter Fankhauser and Wolfgang Nejdl,
Boilerplate Detection using Shallow Text Features,
WSDM 2010 -- The Third ACM International Conference on Web Search and Data Mining New York City, NY USA.
Download PDF
ABSTRACT. In addition to the actual content Web pages consist of navigational elements, templates, and advertisements. This boilerplate text typically is not related to the main content, may deteriorate search precision and thus needs to be detected properly. In this paper, we analyze a small set of shallow text features for classifying the individual text elements in a Web page. We compare the approach to complex, state-of-the-art techniques and show that competitive accuracy can be achieved, at almost no cost. Moreover, we derive a simple and plausible stochastic model for describing the boilerplate creation process. With the help of our model, we also quantify the impact of boilerplate removal to retrieval performance and show significant improvements over the baseline. Finally, we extend the principled approach by straight-forward heuristics, achieving a remarkable accuracy.
WSDM2010 presentation
L3S-GN1 dataset
The data is available online, for free but only for research purposes. Click here to access the dataset (please follow the instructions at the login prompt).
Please check out Boilerpipe, the boilerplate removal library based upon the paper.
The boilerpipe library provides algorithms to detect and remove the surplus "clutter" (boilerplate, templates) around the main textual content of a website.
The library already provides specific strategies for common tasks (for example: news article extraction) and may also be easily extended for individual problem settings. Extracting content is very fast (milliseconds), just needs the input document (no global or site-level information required) and usually quite accurate.
Boilerpipe is a Java library written by Christian Kohlschütter. It is released under the Apache License 2.0.