MCS 311  Special Topics in Data Mining (Text Analytics)

Text analytics is the process of analyzing unstructured (text) data to extract meaning/sense out of it. Burgeoning text data in digital form motivates automated analysis of text documents strewn over WWW in varied languages and formats. Perceived as offshoot of information retrieval, encompassing specialized data mining methods for extracting patterns from text data nearly two decades ago, the field has seen exponential growth and expansion based on use of sophisticated NLP techniques, statistical, machine learning methods.

Text

Christopher D. Manning Prabhakar Raghavan, Hinrich Schutze, An Introduction to Information Retrieval, Cambridge University Press, 2009.

(Book Page)(Online Reading) (Resources)

Additional References:

1.     Introduction to Data Mining: Tan, Stienbach and Vipin Kumar, (Pearson Education 2018)

2.     Data Mining: Concepts and Techniques, Han and Kamber (Morgan Koffmann, 2010)

 

Course Plan 

Week

Lecture--Monday 2-3 PM and Thursday 2-4 PM in Committee Room

Lab Exercises / Assignment

1

Boolean Retrieval (Chap 1)

Introduction to Text Mining in Python/R, NLTK

Assignment 1

2-3

The term vocabulary & postings lists (Chap 2)

Extracting text from different type of file (pdf, xml, html, etc.)

Tokenizing and creating dictionary

3-4

Dictionaries and tolerant retrieval (Chap 3)

Preprocessing text--focus on difference between stemming and lemmatization, different types of stemmers, POS Tagging

Assignment 2a

4-5

Scores, weights, vector spaces (Chap 6)

Create word cloud, tf-idf matrix for the text data given in assignment 2a.

Assignment 2b

6-7

Matrix decompositions and latent semantic indexing (Chap 18)

Mid-term exam as per the schedule displayed on notice board

Assignment 2c

7-9

Language models for information retrieval (Chap 12)

Coming soon

8-10

Text Classification (Chap 13, 14)

Coming soon

11-13

Text Clustering (Chap 16, 17)

Final Lab Exam

Grading Scheme

30 marks for internal assessment. Breakup will be announced in class

70 marks for Major Exam

Assignments

All assignments are to be submitted before change of date in the calendar. Deadlines are firm. Please mail the assignments at vb.ducs@gmail.com. The subject line should contain MCS-311 ASSN Roll-no (N being the assignment number)

Assignment 1: Write a 1-2 page note on the history of search engines. Submit as txt file. (Due date: 23 Aug 2018) Back

Assignment 2a: You are given a folder containing documents (text/pdf files).

Normalize the text and create a similarity matrix using Jaccard index.

Apply hierarchical clustering. Cut the dendrogram at k and identify clusters of similar documents.

Submit 1-2 page report describing the observations. (Due date: 3 Oct 2018) Back

Assignment 2b: Create Tf-idf matrix of the collection.

Using Cosine distance, create a similarity matrix.

Assignment 2c: Perform LSA using reduced latent space with 4 dimensions.

For each topic identify the set of 5 top weighted terms.

Find the similarity matrix for the documents in the reduced space.

Apply hierarchical clustering. Cut the dendrogram at k and identify clusters of similar documents.

Observe the differences in the clustering results and submit 1-2 page report describing the observations. Highlight the differences and explain the results.

(Due date: 29 Oct 2018) Back

Syllabus for Minor 1: Chap 1, 2, 3 and 6 from the text.