Book Image

Scala Microservices

By : Selvam Palanimalai, Jatin Puri
Book Image

Scala Microservices

By: Selvam Palanimalai, Jatin Puri

Overview of this book

<p>In this book we will learn what it takes to build great applications using Microservices, the pitfalls associated with such a design and the techniques to avoid them. </p><p>We learn to build highly performant applications using Play Framework. You will understand the importance of writing code that is asynchronous and nonblocking and how Play leverages this paradigm for higher throughput. The book introduces Reactive Manifesto and uses Lagom Framework to implement the suggested paradigms. Lagom teaches us to: build applications that are scalable and resilient to failures, and solves problems faced with microservices like service gateway, service discovery, communication and so on. Message Passing is used as a means to achieve resilience and CQRS with Event Sourcing helps us in modelling data for highly interactive applications. </p><p>The book also shares effective development processes for large teams by using good version control workflow, continuous integration and deployment strategies. We introduce Docker containers and Kubernetes orchestrator. Finally, we look at end to end deployment of a set of scala microservices in kubernetes with load balancing, service discovery and rolling deployments. </p><p></p>
Table of Contents (12 chapters)

Business idea

To be able to implement a search engine, you need to decide the source of data. You decide to implement a search based on the data available at Stack Overflow, GitHub, and LinkedIn.

Data collection

The data has to be scraped from Stack Overflow, GitHub, and LinkedIn. We will need to write a web crawler that will crawl all the publicly available data on each source, and then parse the HTML files and try extracting data from each. Of course, we will need to write specific parsers for Stack Overflow, GitHub, and LinkedIn, as the data that you wish to extract will be differently structured for each respective site. For example, GitHub can provide the probable location of the users if the developer has mentioned them.

GitHub also provides an API (https://developer.github.com/v3/) to obtain the desired information and so do Stack Overflow (https://api.stackexchange.com/docs) and LinkedIn (https://developer.linkedin.com/). It makes a lot of sense to use these APIs because the information obtained is very structured and intact. They are much easier to maintain when compared to HTML parsers--as parsers can fail anytime as they are subject to a change in the page source.

Maybe you wish to have a combination of both, so as to not rely on just APIs, as the website owners could simply disable them temporarily without prior notification for numerous reasons, such as a higher load on their servers due to some event, disabling for public users to safeguard their own applications, high throttling from your IP address, or a temporary IP ban. Some of these services do provide a larger rate limit if purchased from them, but some don't. So, crawling is not dispensable.

The data collected individually from the aforementioned sites would be very rich. Stack Overflow can provide us with the following:

  • The list of all users on their website along with individual reputation, location, display name, website, URL, and more.
  • All the tags on the Stack Exchange platform. This information can be useful to generate our database of tags, such as Android, Java, iOS, JavaScript, and many more.
  • A split of reputation gained by individual users on different tags. At the time of writing, Jon Skeet, who has the highest reputation on Stack Overflow, had 18,286 posts on C#, 10,178 posts on Java, and so on. Reputation on each tag can give us a sense of how knowledgeable the developer is about a particular technology.

The following piece of code is the JSON response on calling the URL https://api.stackexchange.com/docs/users, which provides a list of all users in descending order in respect to their reputation:

    { 
      "items": [ 
       { 
         "badge_counts": { 
         "bronze": 7518, 
         "silver": 6603, 
         "gold": 493 
       }, 
         "account_id": 11683, 
         "is_employee": false, 
         "last_modified_date": 1480008647, 
         "last_access_date": 1480156016, 
         "age": 40, 
         "reputation_change_year": 76064, 
         "reputation_change_quarter": 12476, 
         "reputation_change_month": 5673, 
         "reputation_change_week": 1513, 
         "reputation_change_day": 178, 
         "reputation": 909588, 
         "creation_date": 1222430705, 
         "user_type": "registered", 
         "user_id": 22656, 
         "accept_rate": 86, 
         "location": "Reading, United Kingdom", 
         "website_url": "http://csharpindepth.com", 
         "link": "http://stackoverflow.com/users/22656/jon-skeet", 
         "profile_image":   
"https://www.gravatar.com/avatar
/6d8ebb117e8d83d74ea95fbdd0f87e13?s=128&d=identicon&r=PG", "display_name": "Jon Skeet" }, ..... }

In a similar manner, GitHub can also provide statistics for each user based on the user's contribution to different repositories via its API. A higher number of commits to a Scala-based repository on GitHub might represent his/her prowess with Scala. If the contributed repository has a higher number of stars and forks, then the contribution by a developer to such a repository gives him a higher score, as the repository is of a higher reputation. For example, there is strong probability that a person contributing to Spring's source code might actually be strong with Spring when compared to a person with a pet project based on Spring that is not starred by many. It is not a guarantee, but a matter of probability.

LinkedIn can give a very structured data of the current occupation, location, interests, blog posts, connections, and others.

Apart from the aforementioned sources, it might be a good idea to also build an infrastructure to manually insert and correct data. You could have a small operations team later, who will be able to delete/update/add entries to refine the data.

Once the data is collected, you will need to transform all the data collected to some desired format and have persistence storage for it. The data will also have to be processed and indexed to be made available in-memory, maybe by using Apache Lucene (http://lucene.apache.org/core/), and be able to execute faster queries on it. Querying data that is readily available on RAM are manifold times faster to access when compared to reading from disk.

Linking users across sites

Now that we have planned how to collect all the developer data based on the data available, we will also have to build our single global developer database across all websites comprising of such developer of information a Name, Contact Information, Location, LinkedIn handle, and Stack Overflow handle. This would, of course, be obtained by scraping from data generated by a web-crawler or API.

We need to have the ability to link people. For example, a developer with the handle abcxyz on LinkedIn might be the same person with the same/different handle on GitHub. So, now we can associate different profiles on different websites to a single user. This would provide much richer data that would leave is in a better position to rate that particular person.

Rank developers

We also need to have the ability to rate developers. This is a difficult problem to solve. We could calculate a rank for each user for each website and do a normalization over all the other websites. However, we need to be careful of data inconsistencies. For example, a user might have a higher score on GitHub but a poor score on Stack Overflow (maybe because he is not very active on Stack Overflow).

Ultimately, we would need a rank of each developer for each specific technology.

User interaction

Now that our backend is sorted out, we will of course need a fancy but minimal user interface for the HR manager to search. A query engine will be needed to be able to parse the search queries the users enter. For example, users might enter Full Stack Engineers in Singapore. So, we will need an engine to understand the implication of being Full Stack.

Maybe there is a need to also provide a Domain Specific Language (DSL) that users could query for complex searches, such as Location in (Singapore, Malaysia) AND Language in (Java, JavaScript) Knowledge of (Spring, Angular).

There will also be a need for the user interface to have a web interface for visualization of the responses, and store user preferences, such as default city or technology, and past searches.

Now that most of the functionality is sorted out on paper, coupled with confidence with the business idea and its ability to have an impact on how people search for talent online, confidence is sky high. The spontaneous belief is that it has all the potential to be the next market disruptor.