Aspects


vs


Home Page
About Us

Boilerpipe

The boilerpipe library for Java provides algorithms to detect and remove the surplus "clutter" (boilerplate, templates) around the main textual content of a web page.

Goose

The Goose library is, according to its website, a Html content / article extractor in Scala



Others

Example

Goose allows more fields published date author main image in article and a few more than boilerpipe title content

from question  

Extract content from a page

2 readability library content is passable slower on average than goose but faster than boilerpipe

from question  

Identifying large bodies of text via BeautifulSoup or other python based extractors

Back to Home
Data comes from Stack Exchange with CC-BY-SA-4.0