Book Image

Java Data Science Cookbook

By : Rushdi Shams
Book Image

Java Data Science Cookbook

By: Rushdi Shams

Overview of this book

If you are looking to build data science models that are good for production, Java has come to the rescue. With the aid of strong libraries such as MLlib, Weka, DL4j, and more, you can efficiently perform all the data science tasks you need to. This unique book provides modern recipes to solve your common and not-so-common data science-related problems. We start with recipes to help you obtain, clean, index, and search data. Then you will learn a variety of techniques to analyze, learn from, and retrieve information from data. You will also understand how to handle big data, learn deeply from data, and visualize data. Finally, you will work through unique recipes that solve your problems while taking data science to production, writing distributed data science applications, and much more - things that will come in handy at work.
Table of Contents (16 chapters)
Java Data Science Cookbook
Credits
About the Author
About the Reviewer
www.PacktPub.com
Customer Feedback
Preface

Extracting web data from a URL using JSoup


A large amount of data, nowadays, can be found on the Web. This data is sometimes structured, semi-structured, or even unstructured. Therefore, very different techniques are needed to extract them. There are many different ways to extract web data. One of the easiest and handy ways is to use an external Java library named JSoup. This recipe uses a certain number of methods offered in JSoup to extract web data.

Getting ready

In order to perform this recipe, we will require the following:

  1. Go to https://jsoup.org/download, and download the jsoup-1.9.2.jar file. Add the JAR file to your Eclipse project an external library.

  2. If you are a Maven fan, please follow the instructions on the download page to include the JAR file into your Eclipse project.

How to do it...

  1. Create a method named extractDataWithJsoup(String url). The parameter is the URL of any webpage that you need to call the method. We will be extracting web data from this URL:

            public void extractDataWithJsoup(String href){  
    
  2. Use the connect() method by sending the URL where we want to connect (and extract data). Then, we will be chaining a few more methods with it. First, we will chain the timeout() method that takes milliseconds as parameters. The methods after that define the user-agent name during this connection and whether attempts will be made to ignore connection errors. The next method to chain with the previous two is the get() method that eventually returns a Document object. Therefore, we will be holding this returned object in doc of the Document class:

            doc = 
              Jsoup.connect(href).timeout(10*1000).userAgent
                ("Mozilla").ignoreHttpErrors(true).get();
  3. As this code throws IOException, we will be using a try...catch block as follows:

            Document doc = null; 
            try { 
             doc = Jsoup.connect(href).timeout(10*1000).userAgent
               ("Mozilla").ignoreHttpErrors(true).get(); 
               } catch (IOException e) { 
                  //Your exception handling here 
            } 
    

    Tip

    We are not used to seeing times in milliseconds. Therefore, it is a nice practice to write 10*1000 to denote 10 seconds when millisecond is the time unit in a coding. This enhances readability of the code.

  4. A large number of methods can be found for a Document object. If you want to extract the title of the URL, you can use title method as follows:

             if(doc != null){ 
              String title = doc.title(); 
    
  5. To only extract the textual part of the web page, we can chain the body() method with the text() method of a Document object, as follows:

            String text = doc.body().text();
    
  6. If you want to extract all the hyperlinks in a URL, you can use the select() method of a Document object with the a[href]parameter. This gives you all the links at once:

            Elements links = doc.select("a[href]"); 
    
  7. Perhaps you wanted to process the links in a web page individually? That is easy, too--you need to iterate over all the links to get the individual links:

            for (Element link : links) { 
                String linkHref = link.attr("href"); 
                String linkText = link.text(); 
                String linkOuterHtml = link.outerHtml(); 
                String linkInnerHtml = link.html();  
            System.out.println(linkHref + "t" + linkText + "t"  +  
              linkOuterHtml + "t" + linkInnterHtml);       
            }  
    
  8. Finally, close the if-condition with a brace. Close the method with a brace:

        } 
        }  

The complete method, its class, and the driver method are as follows:

import java.io.IOException; 
import org.jsoup.Jsoup; 
import org.jsoup.nodes.Document; 
import org.jsoup.nodes.Element; 
import org.jsoup.select.Elements; 
 
public class JsoupTesting { 
   public static void main(String[] args){ 
      JsoupTesting test = new JsoupTesting(); 
      test.extractDataWithJsoup("Website address preceded by http://"); 
   } 
 
   public void extractDataWithJsoup(String href){ 
      Document doc = null; 
      try { 
         doc = Jsoup.connect(href).timeout(10*1000).userAgent
             ("Mozilla").ignoreHttpErrors(true).get(); 
      } catch (IOException e) { 
         //Your exception handling here 
      } 
      if(doc != null){ 
         String title = doc.title(); 
         String text = doc.body().text(); 
         Elements links = doc.select("a[href]"); 
         for (Element link : links) { 
            String linkHref = link.attr("href"); 
            String linkText = link.text(); 
            String linkOuterHtml = link.outerHtml(); 
            String linkInnerHtml = link.html(); 
            System.out.println(linkHref + "t" + linkText + "t"  + 
                linkOuterHtml + "t" + linkInnterHtml); 
         } 
      } 
   } 
}