Sign In Start Free Trial
Account

Add to playlist

Create a Playlist

Modal Close icon
You need to login to use this feature.
  • Book Overview & Buying Java Data Science Cookbook
  • Table Of Contents Toc
  • Feedback & Rating feedback
Java Data Science Cookbook

Java Data Science Cookbook

By : Shams
close
close
Java Data Science Cookbook

Java Data Science Cookbook

By: Shams

Overview of this book

If you are looking to build data science models that are good for production, Java has come to the rescue. With the aid of strong libraries such as MLlib, Weka, DL4j, and more, you can efficiently perform all the data science tasks you need to. This unique book provides modern recipes to solve your common and not-so-common data science-related problems. We start with recipes to help you obtain, clean, index, and search data. Then you will learn a variety of techniques to analyze, learn from, and retrieve information from data. You will also understand how to handle big data, learn deeply from data, and visualize data. Finally, you will work through unique recipes that solve your problems while taking data science to production, writing distributed data science applications, and much more - things that will come in handy at work.
Table of Contents (10 chapters)
close
close

Extracting web data from a URL using JSoup

A large amount of data, nowadays, can be found on the Web. This data is sometimes structured, semi-structured, or even unstructured. Therefore, very different techniques are needed to extract them. There are many different ways to extract web data. One of the easiest and handy ways is to use an external Java library named JSoup. This recipe uses a certain number of methods offered in JSoup to extract web data.

Getting ready

In order to perform this recipe, we will require the following:

  1. Go to https://jsoup.org/download, and download the jsoup-1.9.2.jar file. Add the JAR file to your Eclipse project an external library.
  2. If you are a Maven fan, please follow the instructions on the download page to include the JAR file into your Eclipse project.

How to do it...

  1. Create a method named extractDataWithJsoup(String url). The parameter is the URL of any webpage that you need to call the method. We will be extracting web data from this URL:
            public void extractDataWithJsoup(String href){  
    
  2. Use the connect() method by sending the URL where we want to connect (and extract data). Then, we will be chaining a few more methods with it. First, we will chain the timeout() method that takes milliseconds as parameters. The methods after that define the user-agent name during this connection and whether attempts will be made to ignore connection errors. The next method to chain with the previous two is the get() method that eventually returns a Document object. Therefore, we will be holding this returned object in doc of the Document class:
            doc = 
              Jsoup.connect(href).timeout(10*1000).userAgent
                ("Mozilla").ignoreHttpErrors(true).get();
  3. As this code throws IOException, we will be using a try...catch block as follows:
            Document doc = null; 
            try { 
             doc = Jsoup.connect(href).timeout(10*1000).userAgent
               ("Mozilla").ignoreHttpErrors(true).get(); 
               } catch (IOException e) { 
                  //Your exception handling here 
            } 
    

    Tip

    We are not used to seeing times in milliseconds. Therefore, it is a nice practice to write 10*1000 to denote 10 seconds when millisecond is the time unit in a coding. This enhances readability of the code.

  4. A large number of methods can be found for a Document object. If you want to extract the title of the URL, you can use title method as follows:
             if(doc != null){ 
              String title = doc.title(); 
    
  5. To only extract the textual part of the web page, we can chain the body() method with the text() method of a Document object, as follows:
            String text = doc.body().text();
    
  6. If you want to extract all the hyperlinks in a URL, you can use the select() method of a Document object with the a[href]parameter. This gives you all the links at once:
            Elements links = doc.select("a[href]"); 
    
  7. Perhaps you wanted to process the links in a web page individually? That is easy, too--you need to iterate over all the links to get the individual links:
            for (Element link : links) { 
                String linkHref = link.attr("href"); 
                String linkText = link.text(); 
                String linkOuterHtml = link.outerHtml(); 
                String linkInnerHtml = link.html();  
            System.out.println(linkHref + "t" + linkText + "t"  +  
              linkOuterHtml + "t" + linkInnterHtml);       
            }  
    
  8. Finally, close the if-condition with a brace. Close the method with a brace:
        } 
        }  

The complete method, its class, and the driver method are as follows:

import java.io.IOException; 
import org.jsoup.Jsoup; 
import org.jsoup.nodes.Document; 
import org.jsoup.nodes.Element; 
import org.jsoup.select.Elements; 
 
public class JsoupTesting { 
   public static void main(String[] args){ 
      JsoupTesting test = new JsoupTesting(); 
      test.extractDataWithJsoup("Website address preceded by http://"); 
   } 
 
   public void extractDataWithJsoup(String href){ 
      Document doc = null; 
      try { 
         doc = Jsoup.connect(href).timeout(10*1000).userAgent
             ("Mozilla").ignoreHttpErrors(true).get(); 
      } catch (IOException e) { 
         //Your exception handling here 
      } 
      if(doc != null){ 
         String title = doc.title(); 
         String text = doc.body().text(); 
         Elements links = doc.select("a[href]"); 
         for (Element link : links) { 
            String linkHref = link.attr("href"); 
            String linkText = link.text(); 
            String linkOuterHtml = link.outerHtml(); 
            String linkInnerHtml = link.html(); 
            System.out.println(linkHref + "t" + linkText + "t"  + 
                linkOuterHtml + "t" + linkInnterHtml); 
         } 
      } 
   } 
} 
Visually different images
CONTINUE READING
83
Tech Concepts
36
Programming languages
73
Tech Tools
Icon Unlimited access to the largest independent learning library in tech of over 8,000 expert-authored tech books and videos.
Icon Innovative learning tools, including AI book assistants, code context explainers, and text-to-speech.
Icon 50+ new titles added per month and exclusive early access to books as they are being written.
Java Data Science Cookbook
notes
bookmark Notes and Bookmarks search Search in title playlist Add to playlist download Download options font-size Font size

Change the font size

margin-width Margin width

Change margin width

day-mode Day/Sepia/Night Modes

Change background colour

Close icon Search
Country selected

Close icon Your notes and bookmarks

Confirmation

Modal Close icon
claim successful

Buy this book with your credits?

Modal Close icon
Are you sure you want to buy this book with one of your credits?
Close
YES, BUY

Submit Your Feedback

Modal Close icon
Modal Close icon
Modal Close icon