Book Image

Machine Learning Quick Reference

By : Rahul Kumar
Book Image

Machine Learning Quick Reference

By: Rahul Kumar

Overview of this book

Machine learning makes it possible to learn about the unknowns and gain hidden insights into your datasets by mastering many tools and techniques. This book guides you to do just that in a very compact manner. After giving a quick overview of what machine learning is all about, Machine Learning Quick Reference jumps right into its core algorithms and demonstrates how they can be applied to real-world scenarios. From model evaluation to optimizing their performance, this book will introduce you to the best practices in machine learning. Furthermore, you will also look at the more advanced aspects such as training neural networks and work with different kinds of data, such as text, time-series, and sequential data. Advanced methods and techniques such as causal inference, deep Gaussian processes, and more are also covered. By the end of this book, you will be able to train fast, accurate machine learning models at your fingertips, which you can easily use as a point of reference.
Table of Contents (18 chapters)
Title Page
Copyright and Credits
About Packt
Contributors
Preface
Index

TF-IDF


As we understood the limitation of count vectorization that a highly frequent word might spoil the party. Hence, the idea is to penalize the frequent words occurring in most of the documents by assigning them a lower weight and increasing the weight of the words that appear in a subset of documents. This is the principle upon which TF-IDF works.

TF-IDF is a measure of how important a term is with respect to a document and the entire corpus (collection of documents):

TF-IDF(term) = TF(term)* IDF(term)

Term frequency (TF) is the frequency of the word appearing in the document out of all the words in the same document. For example, if there are 1,000 words in a document and we have to find out the TF of a word NLP that has appeared 50 times in that very document, we use the following:

TF(NLP)= 50/1000=0.05

 

Hence, we can conclude the following:

TF(term) = Number of times the term appears in the document/total number of terms in the document

In the preceding example , comprised of three documents...