MCS 311
Special Topics in Data Mining (Text Analytics)
Text
analytics is the process of analyzing unstructured (text) data to extract
meaning/sense out of it. Burgeoning text data in digital form motivates automated
analysis of text documents strewn over WWW in varied languages and formats.
Perceived as offshoot of information
retrieval, encompassing specialized data mining methods for extracting
patterns from text data nearly two decades ago, the field has seen exponential
growth and expansion based on use of sophisticated NLP techniques, statistical,
machine learning methods.
Text
Christopher D. Manning Prabhakar
Raghavan, Hinrich Schutze, An
Introduction to Information Retrieval, Cambridge University Press,
2009.
(Book
Page)(Online
Reading) (Resources)
Additional
References:
1.
Introduction to
Data Mining: Tan, Stienbach and Vipin
Kumar, (Pearson Education 2018)
2.
Data Mining:
Concepts and Techniques, Han and Kamber (Morgan Koffmann, 2010)
Course Plan
Week |
Lecture--Monday 2-3 PM and Thursday 2-4 PM in Committee Room |
Lab Exercises / Assignment |
1 |
Boolean Retrieval (Chap 1) |
Introduction to Text Mining in Python/R, NLTK |
2-3 |
The term vocabulary & postings lists (Chap 2) |
Extracting text from different type of file (pdf, xml, html, etc.) Tokenizing and creating dictionary |
3-4 |
Dictionaries and tolerant retrieval (Chap 3) |
Preprocessing text--focus on difference between stemming and lemmatization, different types of stemmers, POS Tagging |
4-5 |
Scores, weights, vector spaces (Chap 6) |
Create word cloud, tf-idf matrix for the text data given in assignment 2a. |
6-7 |
Matrix decompositions and latent semantic indexing (Chap 18) |
Mid-term exam as per the schedule displayed on notice board |
7-9 |
Language models for information retrieval (Chap 12) |
Coming soon |
8-10 |
Text Classification (Chap 13, 14) |
Coming soon |
11-13 |
Text Clustering (Chap 16, 17) |
Final Lab Exam |
Grading Scheme
70 marks
for Major Exam
Assignments
All assignments are to be submitted before
change of date in the calendar. Deadlines are firm. Please mail the assignments
at vb.ducs@gmail.com. The subject line
should contain MCS-311 ASSN
Roll-no (N being the assignment number)
Assignment 1: Write a 1-2 page note on the
history of search engines. Submit as txt file. (Due date: 23 Aug 2018) Back
Assignment 2a: You are given a folder
containing documents (text/pdf files).
Normalize the text and create a similarity
matrix using Jaccard index.
Apply hierarchical clustering. Cut the dendrogram at k and identify clusters of similar documents.
Submit 1-2 page report describing the observations.
(Due date: 3 Oct 2018) Back
Assignment 2b: Create Tf-idf
matrix of the collection.
Using Cosine distance, create a similarity matrix.
Assignment 2c: Perform LSA using reduced latent space with 4
dimensions.
For each topic identify the set of 5 top weighted
terms.
Find the similarity matrix for the documents in the
reduced space.
Apply hierarchical clustering. Cut the dendrogram at k and identify clusters of similar documents.
Observe the differences in the clustering results and
submit 1-2 page report describing the observations. Highlight the differences
and explain the results.
(Due date: 29 Oct 2018) Back
Syllabus for
Minor 1: Chap 1, 2, 3
and 6 from the text.