The boilerpipe library for Java provides algorithms to detect and remove the surplus "clutter" (boilerplate, templates) around the main textual content of a web page.
The Goose library is, according to its website, a Html content / article extractor in Scala
2 readability library content is passable slower on average than goose but faster than boilerpipe
Identifying large bodies of text via BeautifulSoup or other python based extractors
Goose allows more fields published date author main image in article and a few more than boilerpipe title content
Extract content from a page