Today I’m trying to clear up a major correlation misconception I have consistently run across over the last two years when speaking with customers, prospects, and even fellow employees. People believe there is a silver bullet for device correlation, and that asset deduplication can be achieved by technology alone.
Underlying Causes Of Miscorrelation
Culprit #1: Bad Data Leads to Unresolvable Technology Conflicts
When working with customers, the most common bad data problems we see come from primarily two sources: vulnerability scanners and CMDBs.
Vulnerability scanners are notorious for providing incorrect OS type and OS versions, often taking guesses based on indirect data collection and fingerprinting methods. Poor hygiene and error prone manual data input are the primary causes for bad CMDB data, but these problems are further exacerbated by constant changes to upstream data sources.
Culprit #2: Lack of Commonly Agreed Naming Standards
The fact is that the very data sources that contain asset data elements and inform any inventory management tool are complicit in the problem because they lack consistency. In the absence of conventional naming standards, high integrity correlation is difficult to achieve.
Each technology manufacturer is left to their own devices (no pun intended) when it comes to how they handle the naming convention for key data fields — as well as the way data is actually populated.
Important data elements like MAC address, IP address, OS type, and OS version are not labelled the same way from one data source to the next. Using MAC address as the example, we see the field names range from MAC Address to Network Interface Manufacturer: MAC Address, and from NIC to Adapter Address, and from Ethernet Hardware Address to Physical Address.
Now imagine this problem across 20 data fields and 20unique data sources. It is a tangled skein fraught with challenges.
Culprit #3: Lack of Overlapping Field Data from Data Sources
Another major cause of inefficient correlation boils down to having a lack of overlap for the key data fields between data sources. Using only a limited number of sources translates into less data points. With fewer data points to work with, correlation algorithms will struggle to produce accurate results. Some data sources come sparsely populated with the data elements relevant to correlation. Some customers only leverage a few data sources in their attempt to build an inventory. In either case, the simple lack of data provides low confidence results.
Techniques To Improve Correlation
Technique #1: Normalization
One of the best ways to solve the lack of a common naming convention for data fields is to build a normalization framework into the data collection process. Given the large number of data fields available in many data sources — and the sheer number of available data sources themselves — it isn’t feasible to normalize every single field of data. That said, there are a finite number of common fields across data sources that are most critical for device correlation and deduplication. Some of these include:
- MAC address
- Device ID
- Device serial number
- Asset name
- Device name
- Host name
Any collection or aggregation should include normalization to enable efficient and easy to manage correlation of common device characteristic values.
Technique #2: A Preponderance of Comparison Metrics
As mentioned above, it isn’t feasible to normalize every single field from every single data source. But it is important to consider a fairly sizable number of these fields as part of the variable comparison metrics.
Correlation needs enough common data points to confirm that the device from one data source is indeed the same device from another data source. The more of these comparison points a platform can incorporate into the normalization framework and correlation algorithm, the better the deduplication results.
Technique #3: Data Source Diversity
The criticality of data source diversity can’t be overstated. Data source diversity statistically improves the density of correlation points that are available to the deduplication algorithm.
Using a varied array or categories of data sources not only provides a wealth of context, it also ensures the requisite overlap of common fields needed for tight correlation. It is within this very overlap that we reach a statistical confidence interval to match up devices from a range of disparate sources. Achieving a high level of confidence ultimately means that the correlation engine does not combine device data that are not a match.
Correlation & Deduplication: An Imperfect Science
Even when the above techniques are combined with other tricks of the trade, miscorrelations will always occur with an out-of-the-box solution. Every customer will have different data situations. Even customers with the exact same data sources will present different correlation challenges because of the differences in field population between companies.
One hundred percent accuracy is not possible with technology alone. Regardless of the techniques leveraged, big data collisions will occur and edge cases will still be prevalent, leaving some small percentage of devices miscorrelated. It isn’t possible to avoid this situation, because technology simply cannot account for every single possibility. This is why achieving a complete, credible and unique inventory has been so elusive until now.