Modern Natural Language Processing Techniques for Scientific Web Mining: Tasks, Data, and Tools

Xuan Wang, Hongwei Wang, Heng Ji, Jiawei Han

Department of Computer Science, University of Illinois at Urbana-Champaign

Time: April 26, 2022 15:45 PM - 17:15 PM (CET)

Location: Online, hosted by Lyon, France


This tutorial targets researchers and practitioners who are interested in natural language processing (NLP) technologies for scientific web mining. Exploring the vast amount of rapidly growing scientific literature available on the web is highly beneficial for scientific discovery. However, scientific web mining is particularly challenging due to the lack of specialized domain knowledge in natural language context, complex sentence structures in scientific writing, and multi-modal representations of scientific knowledge.

This tutorial presents a comprehensive overview of recent research and development on using NLP techniques for scientific web mining, focusing on the biomedical and chemistry domains. First, we introduce the motivation and unique challenges of web mining in the scientific domains. Then we discuss a set of methods that perform effective information extraction (named entity recognition, relation extraction, and event extraction), information retrieval (textual evidence retrieval, cross-modal molecule retrieval, and chemical reaction tracking) from scientific literature, and their applications on reaction prediction. Finally, we conclude our tutorial by demonstrating, on real-world datasets (COVID-19 and organic chemistry literature), how the information can be extracted and retrieved, and how they can assist further exploratory analysis. We also discuss the emerging research problems and future directions of using NLP techniques for scientific web mining.

Tutorial Recording:

A recording of our tutorial will be available after the conference.


  • Introduction [Slides]
  • Scientific Information Extraction and Analysis [Slides]
  • Scientific Information Search and Evidence Mining [Slides]
  • Summary and Future Directions [Slides]


Xuan Xuan Wang is a Ph.D. candidate at Computer Science Department, University of Illinois at Urbana-Champaign. Her research focuses on mining and constructing structured knowledge from massive unstructured corpora with minimum human supervision, emphasizing applications to biological and health sciences. She received M.S degree in Statistics and M.S. degree in Biochemistry from University of Illinois at Urbana-Champaign in 2017 and 2015, respectively, and B.S. degree in Biological Science from Tsinghua University in 2013. She is the recipient of YEE Fellowship Award in 2020-2021.
Hongwei Hongwei Wang is a postdoctoral researcher at Computer Science Department, University of Illinois Urbana-Champaign. His research interests include machine learning and data mining, particularly in graph representation learning mechanisms, algorithms, and their applications in real-world data mining scenarios such as knowledge graphs, recommender systems, social networks, and sentiment analysis. He received Ph.D. degree from Department of Computer Science, Shanghai Jiao Tong University in 2018, and B.E. degree from ACM Class, Shanghai Jiao Tong University in 2014. He was a postdoctoral researcher at Computer Science Department, Stanford University, from 2019 to 2021. He was one of the recipients of 2020 CCF (China Computer Federation) Outstanding Doctoral Dissertation Award and 2018 Google Ph.D. Fellowship.
Heng Heng Ji is a Professor at Computer Science Department of University of Illinois Urbana-Champaign, and an Amazon Scholar. She received her B.A. and M. A. in Computational Linguistics from Tsinghua University, and her M.S. and Ph.D. in Computer Science from New York University. Her research interests focus on Natural Language Processing, especially on Multimedia Multilingual Information Extraction, Knowledge Base Population and Knowledge-driven Generation. She was selected as “Young Scientist” and a member of the Global Future Council on the Future of Computing by the World Economic Forum in 2016 and 2017. The awards she received include “AI’s 10 to Watch” Award by IEEE Intelligent Systems in 2013, NSF CAREER award in 2009, Google Research Award in 2009 and 2014, IBM Watson Faculty Award in 2012 and 2014 and Bosch Research Award in 2014-2018, Amazon AWS Award in 2021, ACL2020 Best Demo Paper Award, and NAACL2021 Best Demo Paper Award. She has coordinated the NIST TAC Knowledge Base Population task since 2010. She has served as the Program Committee Co-Chair of many conferences including NAACL-HLT2018. She is elected as the North American Chapter of the Association for Computational Linguistics (NAACL) secretary 2020-2021. Additional information is available at my website.
Jiawei Jiawei Han is Michael Aiken Chair Professor, Department of Computer Science, University of Illinois at Urbana-Champaign. His research areas encompass data mining, text mining, data warehousing, and information network analysis, with over 1000 research publications. He is Fellow of ACM, Fellow of IEEE, and received numerous prominent awards, including ACM SIGKDD Innovation Award (2004) and IEEE Computer Society W. Wallace McDowell Award (2009). He delivered 50+ conference tutorials or keynote speeches (e.g., SIGKDD 2017-2021 tutorials and WSDM 2018 keynote).