Basically, an HTML parser does two things—extraction and transformation. While the extraction is described in previous recipes, this recipe is going to talk about transformation or modification.
In this section, I'm going to show you how to use Jsoup library to modify the following HTML page:
<html> <head> <title>Section 04: Modify elements' contents</title> </head> <body> <h1>Jsoup: the HTML parser</h1> </body> </html>
Into this result we are adding some minor changes:
<html> <head> <title>Section 04: Modify elements' contents</title> <meta charset="utf-8" /> </head> <body class=" content"> <h1>Jsoup: the HTML parser</h1> <p align="center">Author: Johnathan Hedley</p> <p>It is a very powerful HTML parser! I love it so much...</p> </body> </html>
Perform the following tasks:
Add a
<meta>
tag to<head>
Add a
<p>
tag for body content descriptionAdd a
<p>
tag for body content authorAdd an attribute to the
<p>
tag of the authorAdd the class for the
<body>
tag
The previous tasks will be implemented in the following way:
Add a
<meta>
tag to<head>
.Element tagMetaCharset = new Element(Tag.valueOf("meta"), ""); doc.head().appendChild(tagMetaCharset);
Add a
<p>
tag for body content description.Element tagPDescription = new Element(Tag.valueOf("p"), ""); tagPDescription.text("It is a very powerful HTML parser! I love it so much..."); doc.body().appendChild(tagPDescription);
Add a
<p>
tag for body content author.tagPDescription.before("<p>Author: Johnathan Hedley</p>");
Add an attribute to the
<p>
tag of the author.Element tagPAuthor = doc.body().select("p:contains(Author)").first(); tagPAuthor.attr("align", "center");
Add a class for the
<body>
tag.doc.body().addClass("content");
The complete example source code for this section is available at \source\Section04
.
As you see, the <meta>
tag doesn't exist, so we need to create a new Element
that represents the <meta>
tag.
Element tagMetaCharset = new Element(Tag.valueOf("meta"), ""); tagMetaCharset.attr("charset", "utf-8");
The constructor of the Element
object requires two parameters; one is the Tag
object, and the other one is the base URI of the element. Usually, the base URI when creating the Tag
object is an empty string, which means you can add the base URI when you want to specify where this Tag
object should belong. One thing worth remembering is that the Tag
class doesn't have a constructor and developers need to create it through the static method Tag.valueOf(String tagName)
in order to create a Tag
object.
In the next line, the attr(String key, String value)
method is used to set the attribute value, where key
is the name of the attribute.
doc.head().appendChild(tagMetaCharset);
Instead of looking up the <head>
or <body>
tag, Jsoup already provides two methods to get these two elements directly, which makes it very convenient to append a new child to the <head>
tag. If you want to insert the <meta>
tag before <title>
, you can use the prependchild()
method instead. The call to appendChild()
will add a new element at the end of the list, while prependChild()
will add a new element as the first child of the list.
Element tagPDescription = new Element(Tag.valueOf("p"), ""); tagPDescription.text("It is a very powerful HTML parser! I love it so much..."); doc.body().appendChild(tagPDescription);
The second task is performed by the same code, basically.
Sometimes, you may find it too complicated to create objects and add to the parents; Jsoup provides support for the adding of objects to the HTML string the other way around.
tagPDescription.before("<p>Author: Johnathan Hedley</p>");
The third task is done by directly adding an HTML string as a sibling of the previous <p>
tag. The before(Node node)
method is similar to prependChild(Node node)
but applied for inserting siblings.
The next task is to add the align=center
attribute to the author <p>
tag that we've just added. Up to this point, you may have learned various ways to navigate to this tag; well, I choose one easy way to achieve the task, that is, making a CSS selector get to the first <p>
tag that contains the text Author
in its HTML content.
Element tagPAuthor = doc.body().select("p:contains(Author)").first(); tagPAuthor.attr("align", "center");
The previous line performs a pseudo selector to demonstrate, and we add the attribute to it.
The final task can easily be achieved by using the addClass(String classname)
method:
doc.body().addClass("content");
If you try to add an already existing class name, it won't add because Jsoup is smart enough to ensure that a class name only appears once in an element.