Features
- Cover Type: Hard Cover with 224 pages
- Published by: Wiley-Interscience
- Edition: 1st Edition May 9, 2003
- Written in: English
- ISBN 10 Number: 0471268518
- ISBN 13 Number: 978-0471268512
-
Book Dimensions:
9.3 x 6.4 x 0.8 inches
- Weighs: 1.2 pounds
Product Review
"provides a uniquely integrated approachfor serious data analysts everywhere" --
Zentralblatt Math, Vol. 1027, 2004"Statisticians not conversant with today's statistical take on DQ should read this book…and be stimulated to do important research in DQ." (
Journal of the American Statistical Association, March 2006)
"…uniquely integrates several approaches for data cleaning and exploration…" (
Journal of Statistical Computation & Simulation, April 2004)
"provides a uniquely integrated approachfor serious data analysts everywhere" (
Zentralblatt Math, Vol. 1027, 2004)
Product Description
- Written for practitioners of data mining, data cleaning and database management.
- Presents a technical treatment of data quality including process, metrics, tools and algorithms.
- Focuses on developing an evolving modeling strategy through an iterative data exploration loop and incorporation of domain knowledge.
- Addresses methods of detecting, quantifying and correcting data quality issues that can have a significant impact on findings and decisions, using commercially available tools as well as new algorithmic approaches.
- Uses case studies to illustrate applications in real life scenarios.
- Highlights new approaches and methodologies, such as the DataSphere space partitioning and summary based analysis techniques.
Exploratory Data Mining and Data Cleaning will serve as an important reference for serious data analysts who need to analyze large amounts of unfamiliar data, managers of operations databases, and students in undergraduate or graduate level courses dealing with large scale data analys is and data mining.
Reader ReviewsThis is the best deep and practical introduction to data cleaning that I have seen. It provides an excellent overview of the practical problems in data cleaning, gives a good intuitive feeling for the core issues of outliers and robust statistics, and overviews of a good set of techniques for addressing data cleaning issues in a practical but relatively deep manner. It doesn't try to provide cookbook solutions, and instead points out the complexities and leaves the reader with a toolbox to work on tackling them. The really interested reader will want to augment the book with some other reading, including (on the practical side) a book or website of tips on how to express robust statistics in SQL (the O'Reilly book on TransactSQL has good stuff), and (on the more statistical side) a deeper introduction to robust statistics (e.g. Rousseeuw and Leroy's Robust Regression and Outlier Detection). In a future edition it would be nice to see more discussion of timeseries outliers, as well as an SQL cookbook that will run on commodity databases of modest size (which is the common case in practice, as opposed to the massive TelCo databases that the authors discuss).