We are one step closer to data crawling techniques, and this recipe is going to give you an idea on how to parse all the URLs within an HTML document.
In this task, we are going to parse all links in, http://jsoup.org.
Load the
Document
class structure from the page.Document doc = Jsoup.connect(URL_SOURCE).get();
Select all the URLs in the page.
Elements links = doc.select("a[href]");
Output the results.
for(Element url: links) { System.out.println(String.format("* [%s] : %s ", url.text(), url.attr("abs:href"))); }
The complete example source code for this section is available at \source\Section06
.
Up to this point, I think you're already familiar with CSS selector and know how to extract contents from a tag/node.
The sample code will select all <a>
tags with an href
attribute and print the output:
System.out.println(String.format("* [%s] : %s ", url.text(), url.attr("abs:href")));
If you simply print the attribute value like url.attr("href")
, the output will print exactly like the HTML source, which means some links are relative and not all are absolute. The meaning of abs:href
here is to give a resolution for the absolute URL.
In HTML, the <a>
tag is not the only one that contains a URL, there are other tags also, such as <img>
, <script>
, <iframe>
, and so on. So how are we going to get their links?
If you pay attention to these tags, you can see that they have the same common attribute, src
. So the task is quite simple: retrieve all tags containing the src
attribute inside:
Element results = doc.select("[src]");
Note
The following is a very good link listing from the Jsoup author:
http://jsoup.org/cookbook/extracting-data/example-list-links