Book Image

Learning Data Mining with Python

Book Image

Learning Data Mining with Python

Overview of this book

Table of Contents (20 chapters)
Learning Data Mining with Python
Credits
About the Author
About the Reviewers
www.PacktPub.com
Preface
Index

Attributing documents to authors


Authorship analysis has a background in stylometry, which is the study of an author's style of writing. The concept is based on the idea that everyone learns language slightly differently, and measuring the nuances in people's writing will enable us to tell them apart using only the content of their writing.

The problem has been historically performed using manual analysis and statistics, which is a good indication that it could be automated with data mining. Modern authorship analysis studies are almost entirely data mining-based, although quite a significant amount of work is still done with more manually driven analysis using linguistic styles.

Authorship analysis has many subproblems, and the main ones are as follows:

  • Authorship profiling: This determines the age, gender, or other traits of the author based on the writing. For example, we can detect the first language of a person speaking English by looking for specific ways in which they speak the language...