The boilerpipe library for Java provides algorithms to detect and remove the surplus "clutter" (boilerplate, templates) around the main textual content of a web page.
The Goose library is, according to its website, a Html content / article extractor in Scala
Goose allows more fields published date author main image in article and a few more than boilerpipe title content
Extract content from a page
2 readability library content is passable slower on average than goose but faster than boilerpipe
Identifying large bodies of text via BeautifulSoup or other python based extractors