Mike Loukides has an excellent piece on O’Reilly Radar entitled “What is data science?” In the article, Loukides covers making data products, the data lifecyle, working with data at scale (Big Data), story telling and data scientists.
Throughout the article, Loukides introduces the reader to many data science concepts, tools, experts and skills.
Calling out several items, I love the “data exhaust” term:
“These recommendations are “data products” that help to drive Amazon’s more traditional retail business. They come about because Amazon understands that a book isn’t just a book, a camera isn’t just a camera, and a customer isn’t just a customer; customers generate a trail of “data exhaust” that can be mined and put to use, and a camera is a cloud of data that can be correlated with the customers’ behavior, the data they leave every time they visit the site.”
I think this “make lemonade” sentiment on data quality is crucial:
“Once you’ve parsed the data, you can start thinking about the quality of your data. Data is frequently missing or incongruous. If data is missing, do you simply ignore the missing points? That isn’t always possible. If data is incongruous, do you decide that something is wrong with badly behaved data (after all, equipment fails), or that the incongruous data is telling its own story, which may be more interesting? It’s reported that the discovery of global warming was delayed because automated data collection tools discarded readings that were too low 1. In data science, what you have is frequently all you’re going to get. It’s usually impossible to get “better” data, and you have no alternative but to work with the data at hand.”
The big data definition is excellent. It’s about the problem, not the (product) solutions:
“The most meaningful definition I’ve heard: “big data” is when the size of the data itself becomes part of the problem. We’re discussing data problems ranging from gigabytes to petabytes of data. At some point, traditional techniques for working with data run out of steam.”
And the information platforms / dataspaces concept ties to my active information tier:
“What are we trying to do with data that’s different? According to Jeff Hammerbacher 2 (@hackingdata), we’re trying to build information platforms or dataspaces. Information platforms are similar to traditional data warehouses, but different. They expose rich APIs, and are designed for exploring and understanding the data rather than for traditional analysis and reporting. They accept all data formats, including the most messy, and their schemas evolve as the understanding of the data changes.”
If you want to learn something today, read the article. Then bookmark it for future reference.