A Revival of Data Dependencies for Improving Data Quality
Title: A Revival of Data Dependencies for Improving Data Quality
Speaker: Prof. Wenfei Fan( School of Informatics, University of Edinburgh & Bell Labs, Actel-Lucent)
Time: 4:00pm, Wednesday,Sept.3
Venue: Lecture Room,State Key Lab of Computer Science, Level 3 Building #5, Institute of Software, CAS
Abstract: Recent statistics reveal that 1%-5% of real-world data in enterprises is dirty: inconsistent, inaccurate, incomplete and/or stale. The prevalent use of Internet has been increasing the risks, in an unprecedent scale, of creating and propagating dirty data. Dirty data is estimated to cost US industry alone billions of dollars a year. There is no reason to believe that the scale of the problem is any different in any other society that is dependent on information technology. This highlights the need for principled approaches to improving data quality.
This talk presents a constraint-based approach to improving data quality. It is based on revisions of functional dependencies and inclusion dependencies, for determining whether the data is clean or not. As opposed to traditional database dependencies that were developed for improving the quality of schema, the revised constraints are for improving the quality of the data. Based on the revised constraints, practical techniques have been developed for cleaning dirty data, which effectively reduce human efforts and improve data quality.
Bio of the speaker:
Wenfei Fan is the Professor of Web Data Management in the School of Informatics, University of Edinburgh, and a Research Scientist at Bell Laboratories, Alcatel-Lucent. He received his PhD from the University of Pennsylvania, and his MS and BS from Peking University. He is a recipient of the Roger Needham Award in 2008, the Chang Jiang Scholar Award in 2007, the Outstanding Overseas Young Scholar Award in 2003, the Career Award in 2001, the ICDE Best Paper Award in 2007, and the Best Paper of the Year Award from Computer Networks in 2002. His current research interests include data quality, data integration, integrity constraints, distributed query processing, Web services and XML.