2021-Fall-DSC180A02-Capstone: Text Mining and NLP

Undergraduate Class, HDSI, UCSD, 2021

Class Time: Wednesdays, 1 to 1:50 PM Pacific Time. Room: https://ucsd.zoom.us/j/91491702947.

Overview

This capstone section mainly focuses on text mining and natural language processing. We will explore cutting-edge research papers in these areas together and try to replicate some experiments for a deeper, better understanding.

We will mostly have discussions in a Q&A form, instead of traditional lectures. Due to the COVID-19, the discussions will be online over Zoom.

Papers to Read

Mining Quality Phrases from Massive Text Corpora
Jialu Liu*, Jingbo Shang*, Chi Wang, Xiang Ren and Jiawei Han. SIGMOD 2015. [code]
Automated Phrase Mining from Massive Text Corpora
Jingbo Shang, Jialu Liu, Meng Jiang, Xiang Ren, Clare R Voss and Jiawei Han. TKDE 2018. [code]
UCPhrase: Unsupervised Context-aware Quality Phrase Tagging
Xiaotao Gu*, Zihan Wang*, Zhenyu Bi, Yu Meng, Liyuan Liu, Jiawei Han and Jingbo Shang. KDD 2021. [arXiv:2105.14078] [code]

Tips

These three papers are highly related. Please read them one by one.
The Github README files also provide useful information.
PDFs of these papers can be found online or here

Schedule

Week	Date	Discussion Focus
1	09/29	General Overview (a short lecture by Jingbo Shang)
2	10/06	Introduction & Motivation
3	10/13	Datasets and Experiment Design
4	10/20	Experimental Results - Analysis
5	10/27	Experimental Results - Replication
6	11/03	Case Studies
7	11/10	Application Brainstorming
8	11/17	Possible Extension
9	11/24	Report Writing Discussion
10	12/01	Elevator Pitch

Discussion Questions

Week 2: Introduction & Motivation

Why do we want to study phrase mining? What’s the advantage of phrases over unigrams?
What’s the major problem when someone is going to apply SegPhrase to a new corpus? Is there any human effort?
What’s the motivation of AutoPhrase? Compared with SegPhrase, which parts do you believe are novel?
What’s the motivation of UCPhrase? Compared with AutoPhrase and SegPhrase, what are the major invotations in UCPhrase?

Week 3: Datasets and Experiment Design

How many datasets are used in the papers? How many domains and languages are covered?
Why do we want to use such a diverse set of datasets? How this is related to the claims in the papers?
Why do we want to evaluate the results following the pooling strategy? Think about how much human effort is required, if we are not using pooling.
Why the UCPhrase has some different evaluation settings than AutoPhrase and SegPhrase?

Week 4: Experimental Results - Analysis

Please outline the claims in these three papers.
How can we understand each table and figure? What are the takeaways? One or two sentences per table/figure should be enough.
For each claim, where are the experimental results supporting it?

Week 5: Experimental Results - Replication

Carefully check the README file in the AutoPhrase repo. What is the relation between autophrase.sh and phrasal_segmentation.sh?
Try to run AutoPhrase using the DBLP.5k.txt and DBLP.txt datasets as the input corpus. It should be runnable on your laptop. Let me know if you encounter any issue.
Please eyeball the results from the two runs and try to compare them from the following aspects:
- The number of high-quality phrases (e.g., > 0.5)
- Unigram phrase vs. multi-word phrase
- Top a few high-quality phrases (e.g., >0.9) vs. those borderline phrases (e.g., ~0.5)

Week 6: Case Studies

Why do we need case studies in addition to the quantitative results?
How case studies further the claims in the papers?
Do you have any interesting findings from either the case studies presented in the papers or the results you got from Week 5?

Week 7: Application Brainstorming

What kind of applications do you think could be benefited from phrase mining? Why?
Try to think broadly for more domains/languages.
Based on your proposed applications, can we apply SegPhrase/AutoPhrase directly?
Do you think there is some necessary adaption? If yes, how? If no, why?

Week 8: Possible Extension

What are the drawbacks of these three papers? Do you see any limitations?
Can we do better in order to address these limitations?

Week 9: Report Writing Discussion

Do you have any questions about the final report writing?
How to prepare informative Figures and Tables?
How to properly cite previous work?
How to make the proposal look more promising?

Week 10: Elevator Pitch

We will have a timed rehearsal for the evevator pitch.

Share on

Twitter Facebook LinkedIn

Colt Jensen