Book Image

Instant jsoup How-to

By : Pete Houston
Book Image

Instant jsoup How-to

By: Pete Houston

Overview of this book

As you might know, there are a lot of Java libraries that support parsing HTML content out there. Jsoup is yet another HTML parsing library, but it provides a lot of functionalities and boasts much more interesting features when compared to others. Give it a try, and you will see the difference! Instant jsoup How-to provides simple and detailed instructions on how to use the Jsoup library to manipulate HTML content to suit your needs. You will learn the basic aspects of data crawling, as well as the various concepts of Jsoup so you can make the best use of the library to achieve your goals. Instant jsoup How-to will help you learn step-by-step using real-world, practical problems. You will begin by learning several basic topics, such as getting input from a URL, a file, or a string, as well as making use of DOM navigation to search for data. You will then move on to some advanced topics like how to use the CSS selector and how to clean dirty HTML data. HTML data is not always safe, and because of that, you will learn how to sanitize the dirty documents to prevent further XSS attacks. Instant jsoup How-to is a book for every Java developer who wants to learn HTML manipulation quickly and effectively. This book includes the sample source code for you to refer to with a detailed explanation of every feature of the library.
Table of Contents (7 chapters)

Transforming HTML elements (Must know)


Basically, an HTML parser does two things—extraction and transformation. While the extraction is described in previous recipes, this recipe is going to talk about transformation or modification.

How to do it...

In this section, I'm going to show you how to use Jsoup library to modify the following HTML page:

<html>
  <head>
    <title>Section 04: Modify elements' contents</title>
  </head>
  <body>
    <h1>Jsoup: the HTML parser</h1>
  </body>
</html>

Into this result we are adding some minor changes:

<html>
  <head>
    <title>Section 04: Modify elements' contents</title>
    <meta charset="utf-8" />
  </head>
  <body class=" content">
    <h1>Jsoup: the HTML parser</h1>
    <p align="center">Author: Johnathan Hedley</p>
    <p>It is a very powerful HTML parser! I love it so much...</p>
  </body>
</html>

Perform the following tasks:

  • Add a <meta> tag to <head>

  • Add a <p> tag for body content description

  • Add a <p> tag for body content author

  • Add an attribute to the <p> tag of the author

  • Add the class for the <body> tag

The previous tasks will be implemented in the following way:

  1. Add a <meta> tag to <head>.

    Element tagMetaCharset = new Element(Tag.valueOf("meta"), "");
    doc.head().appendChild(tagMetaCharset);
  2. Add a <p> tag for body content description.

    Element tagPDescription = new Element(Tag.valueOf("p"), "");
    tagPDescription.text("It is a very powerful HTML parser! I love it so much...");
    doc.body().appendChild(tagPDescription);
  3. Add a <p> tag for body content author.

    tagPDescription.before("<p>Author: Johnathan Hedley</p>");
  4. Add an attribute to the <p> tag of the author.

    Element tagPAuthor = doc.body().select("p:contains(Author)").first();
    tagPAuthor.attr("align", "center");
  5. Add a class for the <body> tag.

    doc.body().addClass("content");

The complete example source code for this section is available at \source\Section04.

How it works...

As you see, the <meta> tag doesn't exist, so we need to create a new Element that represents the <meta> tag.

Element tagMetaCharset = new Element(Tag.valueOf("meta"), "");
tagMetaCharset.attr("charset", "utf-8");

The constructor of the Element object requires two parameters; one is the Tag object, and the other one is the base URI of the element. Usually, the base URI when creating the Tag object is an empty string, which means you can add the base URI when you want to specify where this Tag object should belong. One thing worth remembering is that the Tag class doesn't have a constructor and developers need to create it through the static method Tag.valueOf(String tagName) in order to create a Tag object.

In the next line, the attr(String key, String value) method is used to set the attribute value, where key is the name of the attribute.

doc.head().appendChild(tagMetaCharset);

Instead of looking up the <head> or <body> tag, Jsoup already provides two methods to get these two elements directly, which makes it very convenient to append a new child to the <head> tag. If you want to insert the <meta> tag before <title>, you can use the prependchild() method instead. The call to appendChild() will add a new element at the end of the list, while prependChild() will add a new element as the first child of the list.

Element tagPDescription = new Element(Tag.valueOf("p"), "");
  tagPDescription.text("It is a very powerful HTML parser! I love it so much...");

doc.body().appendChild(tagPDescription);

The second task is performed by the same code, basically.

Sometimes, you may find it too complicated to create objects and add to the parents; Jsoup provides support for the adding of objects to the HTML string the other way around.

tagPDescription.before("<p>Author: Johnathan Hedley</p>");

The third task is done by directly adding an HTML string as a sibling of the previous <p> tag. The before(Node node) method is similar to prependChild(Node node) but applied for inserting siblings.

The next task is to add the align=center attribute to the author <p> tag that we've just added. Up to this point, you may have learned various ways to navigate to this tag; well, I choose one easy way to achieve the task, that is, making a CSS selector get to the first <p> tag that contains the text Author in its HTML content.

Element tagPAuthor = doc.body().select("p:contains(Author)").first();
tagPAuthor.attr("align", "center");

The previous line performs a pseudo selector to demonstrate, and we add the attribute to it.

The final task can easily be achieved by using the addClass(String classname) method:

doc.body().addClass("content");

If you try to add an already existing class name, it won't add because Jsoup is smart enough to ensure that a class name only appears once in an element.

There's more...

What you previously saw is just a demonstration of the Jsoup library's capabilities in manipulating HTML elements contents through some common methods.

You will find more useful and convenient methods while working with Jsoup through its API reference page.