Archives

Visual pattern recognition

Today finds me in Frankfurt working at a customer site. What with jetlag and late nights it leaves me less time to work on blogs. But it’s Saturday and I have some time, so here goes. Today’s blog is a brief survey of interesting visualization techniques for complex data. I have a lot of experience in this area dating back to my graduate studies in which part of my work was to reconstruct the tracks of subatomic particles through detectors. In my experiment – E802, at the Brookhaven National Laboratory – the accelerator took silicon atoms, stripped the electrons off them and accelerated them up to ultra-relativistic speeds. The accelerator would then spray the particles onto a thin strip of metal and the resulting collisions would spray particles like pions, kaons, protons and deuterons (and so on, there were many types created!) into the detector apparatus. This is essentially what the experiment setup looked like:

The collision would spray particles into the detector apparatuses T1, T2, T3 and T4. These detectors consisted of wires strung from side to side of a large plastic frame. When the particle passed through, it would ionize the pressurized gas, and the ions would drift to the nearest wire and get detected as a pulse in the electronics. One of the trickiest aspects of such an experiment is to know when exactly an interesting collision occurred and to make sure that you read out the data at precisely the right time. Since collisions were happening all the time you have only a very small window during which you can read out the data. I’m leaving out many details here. Nonetheless for my purpose today I’ll demonstrate how the pattern recognition algorithm worked. The magnet between T2 and T3 bent the charged particles – positive one direction, negative the other. The amount of bend determined the momentum of the particle – if you had two particles of the same type, the faster one would bend less than the slower one. This is what a simple pattern might look like:

The purpose of my algorithm was to connect the dots, combine the information with other information available in other detectors after T4, and determine what type of particle it was and how fast it was going. Thing is, the detectors were not perfect. Sometimes (as in T2 here) they would not detect any ionization. Sometimes they detected other particles that were flying around the room, and you had no idea how to connect those signals up from one detector to the next.

Sometimes certain wires went haywire and fired all the time. And sometimes you got enormous “splat” events that looked like this:

In such an environment it was extremely important to review the algorithm by scanning through events by hand and trying to see if you could reconstruct any more patterns that the algorithm did not find, or reject any paths that the algorithm had reconstructed erroneously. Ultimately what we used for the data analysis were the reconstructed tracks only – we removed all the extraneous data and dealt with only the good information.

This is not necessarily a good recipe for much of the complex data we receive every day. There is far too much ambiguity for any sort of program to know exactly what it is we want to know. Search engine results are a good example of this. They try really hard, but there is ambiguity in the way we express what we are searching for, in the possible interpretations of the results, and in the available data – that is, the web pages themselves are ambiguous. Another example of this is one I touched on in an earlier blog, where I proposed that programs that monitor real time data use sound feedback to help grab our attention when something has changed in the data that is coming in.

I came across an interesting iPhone application today that helps people visualize complex information. Initially the application was intended for medical purposes but I could see it being used in any number of situations where you are dealing with numeric data or data that can be converted somehow to numeric. Check out the photo gallery – gory, but cool. Not sure compelled the developers to deploy the app to the iPhone instead of to the internet generally but I suppose if you can represent complex data in a very small visual display then you can do it for a larger display. It may even be that forcing it into a very small window helps you focus on removing as much extraneous information as possible.

The final category of visual displayer that I’d like to talk about today is a class of programs that help you visualize what’s on your hard disk. For instance, DiskView is a simple classifier that uses the extension information (plus other information) in your file name to classify the information. What’s especially important is that it not only tells you the information in a highly readable format, it also lets you navigate quickly to the information in which you are interested. Check out the flash demo. These folks really understand how to convey complex information – in this case a demonstration of how you would use a relatively complex product – to a reader. Such a demo is quickly becoming the only acceptable way to present this sort of complex information. The extra time invested today in setting up and recording such a demo not only generates more sales, but helps train users quickly so they get the most out of the product.

For sheer visual fun, however, I really like WinDirStat. I have one animated gif that basically demonstrates how it works, from LifeHacker.

Data centralization

I read an interesting article about how the pendulum is swinging back to centralization of IT resources. Two of the most important results are an increased effectiveness in decision rights – the way technology investment decisions are made – and in information flows from IT to the rest of the business. This is related to, but not identical to some of the observations I’ve been making in this blog. With centralized data, everyone can see the same information as it

Read more …

Mining for information

Mining for information is established on a much less solid foundation because the nature of the information you will uncover is by definition unknown, and hence the potential profits are also unknown. I’m not talking about business information systems that analyze sales trends and help identify emerging trends so you can make money by recognizing and keeping up with the trends. These are clearly tied to the profit motive and the business imperative is undeniable. I’m talking about the other types of information in the system such as project statistics, emails, twitter feeds, source code repositories, feature requests, performance statistics, operations logs, and so

Read more …

Tweets in your information architecture

If there’s one thing we’ve learned over the past 30+ years of user interface development, it’s that a given action should be invoked via multiple gestures. For a given command there is no single gesture that everyone can agree on is perfect for the task. Personally, when I’ve used a command more than about 3 or 4 times I start looking for its keyboard equivalent. Other people seem content to use a mouse for everything, but my carpal tunnel starts

Read more …

When to purge data

Occasionally I have to shift around my financial records from the early stone age – 1997 and before. My wife gets on my case to get rid of the paperwork but I never seem to get around to it. It doesn’t take up much room so why bother? In the back of my mind I can imagine a social anthropologist finding the files in an attic some 200 years from now and wondering in awe over the quaint practices of

Read more …

Baselines

Volunteering is a great way to give back to the community, meet new and interesting people, and keep me from playing Civilization until the wee hours of the morning. OK, sometimes not so wee. I know it’s seriously time for bed when I hear traffic on the road – that means people are heading in to work and perhaps I should think about it too. Fortunately I don’t do that too often. Anyway, one of the things I have volunteered

Read more …

Categories of metrics

In my previous post I discussed the challenges related to quality of predictions and inferences you can make from data you collect in a corporate repository.  Fortunately, the data you collect are not all required for making inferences. Some of them are extremely useful no matter how much data you have. It all depends on what you intend to do with it.

The following is a list of categories of metrics that can be used to shed light on your organization

Read more …

Metrics and macroeconomics

I can see in my mind a Hollywood movie where a junior employee (a junior actress in a cameo role) runs to the shop floor to her cigar-chomping boss (a well-known character actor in the latter part of his illustrious career), shows him a chart that clearly demonstrates a fatal flaw in the factory operations, he takes decisive actions to fix the situation, and together they take over the world of widget manufacturers. This, of course, never happens. Somehow Hollywood

Read more …

Metrics frustration

Ultimately, that is the lesson of metrics in a knowledge-based industry: No one metric is going to tell you the answer you are looking for. Metrics that work in some circumstances will not work in others. Even from one company to the next, in the same industry, following the same methodologies, you are going to find some unexplainable

Read more …

User data and web service constraints

Cloud computing is so big these days that I’m even getting emails from more or less reputable organizations imploring me to jump on the bandwagon and get rich off this latest fad. Clearly we’re well on our way if not already in the trough of disillusionment. As we pull out of the trough, we will have to deal with some very challenging issues – both technical and economic. On the technical side, I’ve been dealing with an interesting aspect of

Read more …