Features
- Cover Type: Paperback with 328 pages
- Published by: No Starch Press March 30, 2007
- Written in: English
- ISBN 10 Number: 1593271204
- ISBN 13 Number: 978-1593271206
-
Book Dimensions:
9.1 x 6.9 x 0.9 inches
- Weighs: 1.4 pounds
Reader Reviews
"Webbots, Spiders, adn Screen Scrapers" is a solid book for building basic scripts to do web scraping. Michael Schrenk goes covers the "should you do this" aspect very well, and devotes much of the book to these kinds of topics. On that reason alone I give him major kudos, "just because you CAN do a thing, doesn't mean you SHOULD." Technically the book and examples are very basic and beginner level. All code is procedural and has absolutely no references to object oriented programming at all. This is great for a simple project, but building anything larger than a targetted webbot or two is beyond the scope of this book. I was very dismayed at Mr. Schrenk's opinion of regular expressions: "The use of regular expressions is a parsing language in itself, and most modern programming languages support aspects of regular expressions. In the right hands, regular expressions are also useful for parsing and substituting text; however, they are famous for thier sharp learning curve and cryptic syntax. I avoid regular expressions whenever possible." This disregard for regular expressions effectively wipes out a powerful toolset for budding developers. Regular expressions are no harder to learn than PHP. The reasons for his disdain for them is also flawed: "The regular expression engine used by PHP is not as efficient as engines used in other languages, and is certainly less efficient than PHP's built-in functions for parsing HTML." PHP uses the same regular expression engine used (very effectively) in PERL with the use of the preg_* functions. There has been many studies that show preg_* style expressions outperform basic text matching in PHP. In this assesment the author is terribly wrong. The book does a great job of explaining how to make single use scripts for scraping, but never how to create a larger infrastructure. There is no focus on creating multi process engines with pcntl_fork(), or proc_open(), these are critical for scaling web scraping applications. A single script scraping a few hundred websites on a single thread would take ages over a multi-threaded engine. If you are looking to break into web scraping and not sure where to start, this is likely the best (and possibly only) book on the market. If you are intermediate or advanced you will quickly question the author's logic and see that scaling will become the number one issue you have to over come.
Comment (1) | |
(Report this)