Tf-Idf (Term Frequency-Inverse Document Frequency) Vectorization is a statistical method used to evaluate the importance of a word in a document relative to a collection of documents, also known as a corpus. The key idea behind Tf-Idf is to increase the weight of terms that appear frequently in a specific document while reducing the weight of terms that appear frequently across all documents. This is achieved through two main components: Term Frequency (TF), which measures how often a term appears in a document, and Inverse Document Frequency (IDF), which assesses how important a term is by considering its presence across all documents in the corpus.
The mathematical formulation is given by:
where and
By transforming documents into a Tf-Idf vector, this method enables more effective text analysis, such as in information retrieval and natural language processing tasks.
Start your personalized study experience with acemate today. Sign up for free and find summaries and mock exams for your university.