Data Quality in Discovery

By: admin
Date: Dec 18, 2020
Comment: 1
Category: General

The quality of data one gets out of a discovery process is extremely important for everything one does with that data. Bad data means bad decisions and acceptance for systems making use of the data. Examples we saw in the articles and videos on CMDB integration and security process and operations were always stressing this point.

But what is data quality really? Which qualities and what do they mean in reality? In this article and video we will have a closer look at the need, the dimensions of quality, and concrete examples of what it means in reality.

https://youtu.be/ngRn1lZza2E

Importance of Data Quality

The users of a discovery system are other tools such as a CMDB foremost, but also IT and security monitoring tools and more. These clients rely on the discovery having done a good job to identify, classify and relate resources in its object model. They cannot easily fix defects in the discovery data or if, then with high often manual effort. Clearly this needs to be avoided. Therefore beyond the functionality, the quality of the data provided by a discovery tool needs major attention.

Quality Dimensions

A lot of software always focuses on the functionality of some component. But its qualities are equally important. Quality can be defined as:

Quality is the standard of something as measured against other things of a similar kind; the degree of excellence of something.
OxfordLanguages

Qualities can be grouped in a quality tree, e.g. for software quality there is a quality tree defined in ISO 25010. Such a quality tree helps to organize the different kinds of qualities into categories.
In this article, we are looking into the quality of data instead of software and are focusing on the principal categories only and the purpose of network discovery.

The following 6 qualities are essential for discovery:

Timeliness
Accuracy
Completeness
Consistency
Validity
Uniqueness

We will have a look at each quality in the following sections.

Timeliness

The timeliness quality answers the question “Is the data available when you need it?”. For discovery, this means that the information in the data model has to be always up-to-date and accessible. Being up-to-date means that discovery needs to constantly discover the assets in the managed environment, or at least as often as possible and feasible.

For this to work the discovery needs to be scheduled often and run fast while minimizing the impact on the discovered environment itself. An efficient and intelligent discovery process is required to satisfy this quality.

A high level of parallelism will be able to discover more devices and details per hour but should not exhaust the discovery server, so it needs to be tunable. Different network zone will require different levels of up-to-dateness and therefore some flexibility in using different discovery jobs for these is critical. The schedule should be configurable to the needs of the business and allow also blackout time when e.g. a backup already saturates the network.

Accuracy

Accuracy is the quality for the question “How well does a piece of data reflect the reality?” In the case of discovery, is the data in the discovery model consistent with the state and details of the real assets? Discovery should not produce invalid or wrong information as many other systems, such as CMDBs will depend on it. Having high accuracy at the source and that is what discovery is, is the best way to achieve high quality in the end.

A discovery tool should be able to self-qualify its data quality and give recommendations and explanations for these to the user so that he can help improve the information. This could be for example to add more credentials for protocols or systems or modify some configuration settings of a device.

Completeness

Completeness relates to the question “Does it fulfill your expectations of what is comprehensive?”. Having transparency on the completeness of the list of systems on the one hand and the amount of details discovered for each device is critical.

Otherwise, as we have seen in https://blog.jdisc.com/2020/12/04/discovery-for-operational-security-audits/ a missing discovered asset can have a significant impact. A discovery tool should therefore have good usable diagnostic and troubleshooting tools to find unidentified devices, issues in discovery through protocols or access to devices, parsing errors or even duplicate device names.

Consistency

The quality dimension consistency should answer the question “Does data stored in one place match relevant data stored elsewhere?”. A discovery service is the basis for a CMDB and information in both should be consistent as well as information discovered from two discovery servers (e.g. firewall between a DMZ and the enterprise Intranet).

Validity

Validity answers the question “It the information in a specific format, does it follow business rules or it is in an unusable format?”. For the discovery tool, this is the question of how it stores its information and how users and external systems can access the data.

The object model of a discovery tool should provide a consistent way of how its information is represented so that other solutions can rely on its information quality.

Uniqueness

The uniqueness quality dimension answers the question “Is this the only instance in which this information appears in the database?”. This might sound like a strange question, but is very important and relates to the process of normalization.

In the real world, the same asset is captured by multiple systems and each system can identify the asset and its properties with a slightly different identifier or name. For a human being this is usually not a problem as our brain automatically recognizes that both assets are indeed the same. A computer is not able to do this as easily as we do.

JDisc Whitepaper

On this topic of data quality in network discovery, there is, for a while, also a white paper available on the JDisc documentation page for download with more details and concrete examples.

Summary

Six dimensions determine data quality and data quality is the critical property of a discovery service. JDisc discovery, therefore, puts an exceptionally high focus on this aspect. It ensures a fast and scalable discovery process, the highest accuracy due to built-in heuristics for a heterogeneous world, measured visualized data quality indicators, diagnostics, and troubleshooting tools to improve accuracy at the source, a concise object model, and a sophisticated normalization process to ensure uniqueness.

JDisc Discovery is the quality network discovery solution your business needs.

Categories