KDD 2022 Tutorial

4 minute read

Published:

New Frontiers of Scientific Text Mining: Tasks, Data, and Tools

Xuan Wang, Hongwei Wang, Heng Ji, Jiawei Han

Department of Computer Science, University of Illinois at Urbana-Champaign

Time: Aug 14, 2022, 9am - 12pm ET

Location: Washington DC, USA

Abstract:

Exploring the vast amount of rapidly growing scientific text data is highly beneficial for real-world scientific discovery. However, scientific text mining is particularly challenging due to the lack of specialized domain knowledge in natural language context, complex sentence structures in scientific writing, and multi-modal representations of scientific knowledge. This tutorial presents a comprehensive overview of recent research and development on scientific text mining, focusing on the biomedical and chemistry domains. First, we introduce the motivation and unique challenges of scientific text mining. Then we discuss a set of methods that perform effective scientific information extraction, such as named entity recognition, relation extraction, and event extraction. We also introduce real-world applications such as textual evidence retrieval, scientific topic contrasting for drug discovery, and molecule representation learning for reaction prediction. Finally, we conclude our tutorial by demonstrating, on real-world datasets (COVID-19 and organic chemistry literature), how the information can be extracted and retrieved, and how they can assist further scientific discovery. We also discuss the emerging research problems and future directions for scientific text mining.

Tutorial Recording:

A recording of our tutorial will be available after the conference.

Slides [Combined]:

  • Introduction [Slides]
  • Part I: Scientific Information Extraction and Analysis [Slides]
  • Part II: Scientific Information Search and Evidence Mining [Slides]
  • Part III: Topic Discovery, Text Classification, and Multi-Dimensional Text Analysis [Slides]
  • Summary and Future Directions [Slides]

Presenters:

XuanXuan Wang is a Ph.D. candidate at Computer Science Department, University of Illinois at Urbana-Champaign. Her research focuses on mining and constructing structured knowledge from massive unstructured corpora with minimum human supervision, emphasizing applications to biological and health sciences. She received M.S degree in Statistics and M.S. degree in Biochemistry from University of Illinois at Urbana-Champaign in 2017 and 2015, respectively, and B.S. degree in Biological Science from Tsinghua University in 2013. She is the recipient of YEE Fellowship Award in 2020-2021.
HongweiHongwei Wang is a postdoctoral researcher at Computer Science Department, University of Illinois Urbana-Champaign. His research interests include machine learning and data mining, particularly in graph representation learning mechanisms, algorithms, and their applications in real-world data mining scenarios such as knowledge graphs, recommender systems, social networks, and sentiment analysis. He received Ph.D. degree from Department of Computer Science, Shanghai Jiao Tong University in 2018, and B.E. degree from ACM Class, Shanghai Jiao Tong University in 2014. He was a postdoctoral researcher at Computer Science Department, Stanford University, from 2019 to 2021. He was one of the recipients of 2020 CCF (China Computer Federation) Outstanding Doctoral Dissertation Award and 2018 Google Ph.D. Fellowship.
HengHeng Ji is a Professor at Computer Science Department of University of Illinois Urbana-Champaign, and an Amazon Scholar. She received her B.A. and M. A. in Computational Linguistics from Tsinghua University, and her M.S. and Ph.D. in Computer Science from New York University. Her research interests focus on Natural Language Processing, especially on Multimedia Multilingual Information Extraction, Knowledge Base Population and Knowledge-driven Generation. She was selected as “Young Scientist” and a member of the Global Future Council on the Future of Computing by the World Economic Forum in 2016 and 2017. The awards she received include “AI’s 10 to Watch” Award by IEEE Intelligent Systems in 2013, NSF CAREER award in 2009, Google Research Award in 2009 and 2014, IBM Watson Faculty Award in 2012 and 2014 and Bosch Research Award in 2014-2018, Amazon AWS Award in 2021, ACL2020 Best Demo Paper Award, and NAACL2021 Best Demo Paper Award. She has coordinated the NIST TAC Knowledge Base Population task since 2010. She has served as the Program Committee Co-Chair of many conferences including NAACL-HLT2018. She is elected as the North American Chapter of the Association for Computational Linguistics (NAACL) secretary 2020-2021. Additional information is available at my website.
JiaweiJiawei Han is Michael Aiken Chair Professor, Department of Computer Science, University of Illinois at Urbana-Champaign. His research areas encompass data mining, text mining, data warehousing, and information network analysis, with over 1000 research publications. He is Fellow of ACM, Fellow of IEEE, and received numerous prominent awards, including ACM SIGKDD Innovation Award (2004) and IEEE Computer Society W. Wallace McDowell Award (2009). He delivered 50+ conference tutorials or keynote speeches (e.g., SIGKDD 2017-2021 tutorials and WSDM 2018 keynote).