Skip to main content

Capture Everything

This week I've started planning for the next version of our data collection system. The key realization for me is that I do not know all the questions we will need to answer in the future. Our current focus is on specific sequences of click events, but in the future we might want to look at browser versions or behavioral patterns related to IP addresses. If we don't capture user-agent, for example, we won't be able to answer questions about browser versions. If we don't capture IP then we cannot look for patterns in IP addresses. We should store data in a way that maximizes the range of questions we can address in the future.

In the past few years, the cost of storing data have continued to fall. We use AWS extensively.  Amazon S3 costs are very reasonable and guarantees a high level of availability. Also, lower compute costs and open source tools like Hadoop that process large data volumes have greatly increased our ability to extract valuable insights from data. So storing more data than we need causes minimal inconvenience.

Another consideration is that the cost and difficulty of modifying data ingestion is higher than other parts of a data processing system. There is nothing to reconcile captured data to in the wild. We need to take extra steps to ensure data doesn't get misplaced, malformed or lost. I don't want to change those systems frequently.

All of these factors push me towards wanting to capture as much information as I possibly can,  and in as unaltered a format as allowed by the components I am working with for data ingestion. This approach maximizes our ability to answer questions in the future, incurs negligible storage cost and reduces the frequency of expensive changes to the ingestion flow.

Comments

Popular posts from this blog

ReactJS, NPM and Maven

I'm just starting to get into working with ReactJS, Facebook's open source rendering framework. My project uses SpringBoot for annotation-driven dependency injection and MVC. I thought it would be great if I could use a bit of ReactJS to enhance the application. If you're looking for a basic conceptual intro, I recommend ReactJS for Stupid People and of course the official documentation  is quite good. In full disclosure, I still have no idea how to do "flux" yet. As an experienced Java backend developer, I'm pretty decent at hacking Maven builds - which is precisely what this blog post is going to be about. First, a word about how React likes to be built. Like many front-end tools, there is a toolkit for the node package manager (NPM). From the command prompt, one might run npm install -g react-tools  which installs the jsx command. The  jsx  command provides the ability to transform JSX syntax into ordinary JavaScript, which is precisely what I want...

Solved: Unable to Locate Spring Namespace Handler

I attempted to run a Spring WebMVC application, and when starting up the application complained that it didn't know how to handle the MVC namespace in my XML configuration. The project runs JDK 7 and Spring 4.0.6 using Maven as the build system. The following is my XML configuration file: <?xml version="1.0" encoding="UTF-8"?> <beans xmlns="http://www.springframework.org/schema/beans"        xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"        xmlns:mvc="http://www.springframework.org/schema/mvc"        xsi:schemaLocation="         http://www.springframework.org/schema/beans         http://www.springframework.org/schema/beans/spring-beans.xsd         http://www.springframework.org/schema/mvc         http://www.springframework.org/schema/mvc/spring-mvc.xsd">          <mvc:annotation-driven/> ...

AWS S3 versus CloudFront Performance

Yesterday I took Amazon CloudFront for a spin. Creating the CloudFront distribution was pretty simple - the wizard process flowed nicely. I found myself relying on the help text in places, but the most surprising thing was how long it took for the distribution to become enabled. I didn't time it exactly, but I probably spent 45 minutes waiting for my new CloudFront distribution to change from "In Progress" to "Enabled" status. The performance is a bit confusing. Compared to the S3 bucket, I didn't see any improvement in performance in a few tries - in fact, the CloudFront CDN performance was worse than the S3 bucket on its own for my 217 KB image file. I decided to take a larger sample, loading the same image 30 times in Chrome and noting the timing data from the "network" tab in the developer tools. I'm located in Brooklyn, have CloudFront configured for the US/Europe with download mode configured. My S3 bucket is in the US Standard zone, w...