Monday, 10 November 2008

Day 1 - 17:45 : When you have too much data, good enough is good enough (ARC303)

Presented by Pat Helland, this session was a high level talk that resided largely in the theoretical space and offered no real answers but rather problems were posed and left as exercises for the attendee to think about.

The session aimed to challenge how we think about data and how rigid and prescriptive we are about our interfaces to the data and the usages of that data.

This culminated in a look at how organisations often find themselves compromising data quality the larger that data gets. Amazon was used as a case study, where specifically their merchant API contracts aren't overly prescriptive about what data they expect in an effort to encourage merchant adoption over data quality. Instead, they have processes which attempt to reconcile data together, but ultimately they sacrifice data quality for the sake of simplicity for the merchants.

For example, take shoes - they have no unique code - there's no ISBN or similar unique identification systems, yet if you were able to buy shoes on Amazon, pair of shoes X from manufacturer Y, sold by merchant A would appear as the same product on the site sold by merchant B, with the same unique Amazon product code. Merchant A may have sent distinctly less data than Merchant B, yet the Amazon service is able to apply logic to work out that the products are the same thing and present them as such.

Merchant A and B both might send the colour and manufacturer name for instance, whilst merchant B might provide a host of additional information on top of this that can flesh out the product data. The idea being the colour and manufacturer name might be the prescriptive contract whilst the additional information from Merchant B is completely optional, not strictly defined and would be used to flesh out extra information on the product data (for both merchants) if it was available (from either). He even went so far as to suggest that contracts offer key/value pairs to allow any data to be passed optionally and used. (This gives me chills of the bad variety, I've got to say).

Of course this is where the quality issues appear - it's not always 100% possible to match the two together so sometimes the same product will be brought into the catalogue as separate items. For Amazon, this is deemed acceptable rather than forcing a regimented API that merchants must adhere to.

The ideas presented revolved around how classic RDBMS systems offer crisp answers over relatively small amounts of data, but new systems have huge amounts of data, high rates of change and large volumes of queries. As systems grow, data quality and it's meaning becomes more fuzzy - any schema, if it's even present, may vary across data and the origin of the data may be stale and we must be able to work with this data within given tolerances of staleness.

For example, if we have an ordering API that allows our customers to place orders for products with us and that API exposes also a list of prices that updates at midnight each evening. If someone submitted an order at 11:59pm to your services that was processed at 12:01am, is your system going to reject the order because the pricing is stale? No - it should allow either the stale or the current pricing to be used for a period of time before enforcing such rules.

This was discussed further as the concept of inside data and outside data. Inside data being the transactional systems we're all used to - you start a process, you freeze the database in time using a transaction and then commit when you're done. This is the historic model of databases, but today we have services to contend with that are outside of the transaction and so aren't within the same space/time as the database transaction. We have to deal with this in our systems in future.

To cut the remainder of the story short, he theorised that for many businesses, just like Amazon, they are happy with "good enough" if it gives benefit elsewhere.

No comments:

Post a Comment