-
Book Overview & Buying
-
Table Of Contents
-
Feedback & Rating
Java Data Science Cookbook
By :
A large amount of data, nowadays, can be found on the Web. This data is sometimes structured, semi-structured, or even unstructured. Therefore, very different techniques are needed to extract them. There are many different ways to extract web data. One of the easiest and handy ways is to use an external Java library named JSoup. This recipe uses a certain number of methods offered in JSoup to extract web data.
In order to perform this recipe, we will require the following:
jsoup-1.9.2.jar file. Add the JAR file to your Eclipse project an external library.extractDataWithJsoup(String url). The parameter is the URL of any webpage that you need to call the method. We will be extracting web data from this URL: public void extractDataWithJsoup(String href){
connect() method by sending the URL where we want to connect (and extract data). Then, we will be chaining a few more methods with it. First, we will chain the timeout() method that takes milliseconds as parameters. The methods after that define the user-agent name during this connection and whether attempts will be made to ignore connection errors. The next method to chain with the previous two is the get() method that eventually returns a Document object. Therefore, we will be holding this returned object in doc of the Document class: doc =
Jsoup.connect(href).timeout(10*1000).userAgent
("Mozilla").ignoreHttpErrors(true).get();IOException, we will be using a try...catch block as follows: Document doc = null;
try {
doc = Jsoup.connect(href).timeout(10*1000).userAgent
("Mozilla").ignoreHttpErrors(true).get();
} catch (IOException e) {
//Your exception handling here
}
We are not used to seeing times in milliseconds. Therefore, it is a nice practice to write 10*1000 to denote 10 seconds when millisecond is the time unit in a coding. This enhances readability of the code.
Document object. If you want to extract the title of the URL, you can use title method as follows: if(doc != null){
String title = doc.title();
body() method with the text() method of a Document object, as follows:String text = doc.body().text();
select() method of a Document object with the a[href]parameter. This gives you all the links at once: Elements links = doc.select("a[href]");
for (Element link : links) {
String linkHref = link.attr("href");
String linkText = link.text();
String linkOuterHtml = link.outerHtml();
String linkInnerHtml = link.html();
System.out.println(linkHref + "t" + linkText + "t" +
linkOuterHtml + "t" + linkInnterHtml);
}
}
}
The complete method, its class, and the driver method are as follows:
import java.io.IOException;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;
public class JsoupTesting {
public static void main(String[] args){
JsoupTesting test = new JsoupTesting();
test.extractDataWithJsoup("Website address preceded by http://");
}
public void extractDataWithJsoup(String href){
Document doc = null;
try {
doc = Jsoup.connect(href).timeout(10*1000).userAgent
("Mozilla").ignoreHttpErrors(true).get();
} catch (IOException e) {
//Your exception handling here
}
if(doc != null){
String title = doc.title();
String text = doc.body().text();
Elements links = doc.select("a[href]");
for (Element link : links) {
String linkHref = link.attr("href");
String linkText = link.text();
String linkOuterHtml = link.outerHtml();
String linkInnerHtml = link.html();
System.out.println(linkHref + "t" + linkText + "t" +
linkOuterHtml + "t" + linkInnterHtml);
}
}
}
}