We are living in a data-driven economy, where enterprises are increasingly collecting and processing large volumes of data in order to optimize their business processes and drive their decision making. Moreover, the millions of enterprises and the billions of internet users worldwide produce mind-boggling amounts of data. This explains why BigData is steadily one of the most trending technologies of our time. To the average reader the term BigData signifies very large data volumes, yet there is no strict threshold above which data are classified as BigData. Rather, BigData applications are defined as the ones that handle very large data volumes, which far exceed the capacity and capabilities of conventional database systems. Moreover, BigData systems are usually characterized by their ability to handle data from a great variety of heterogeneous sources, while being able to deal with streaming data that are characterized by very high ingestion rates. Volume, Variety and Velocity are three of the most prominent characteristics of BigData applications, which are commonly characterized as the 3Vs of BigData. These 3Vs were originally introduced in a paper by Gartner back in 2001 and are still the three most commonly used properties for characterizing BigData.
Over the years, there has been some inflation in the number of Vs that are used to describe and characterize BigData applications. For example, a fourth V that usually accompanies BigData description is “Veracity”, which refers to the fact that BigData applications deal with datasets that are uncertain, imprecise and difficult to trust. Likewise, several BigData experts have also introduced “Value” as a core element of the datasets of BigData applications. Value refers to the business value of the BigData application and is considered as a prerequisite for all non-trivial enterprise-scale applications. If you bet that 5Vs are enough to describe BigData, you are probably wrong as many researchers and practitioners have recently introduced more Vs in order to characterize BigData applications. The 15 most popular Vs of BigData are therefore as follows:
- Volume: This V refers to the amount of data processed as part of the application in a given time frame. Volume is therefore a property that is not associated with a single numerical threshold. Applications that process large amounts of information in quite short timescale are also characterized as BigData. For example, processing matrices with billions of rows and billions of columns in very short timescales (e.g., few seconds) are BigData operations. Likewise, batch operations that process many petabytes overnight can be classified as BigData too.
- Velocity: The Velocity of a data application is reflected on the speed of the data that is ingested in the system, as well as the speed of the transformed data that is produced as output from it. For instance, the numerous data streams that stem from an autonomous vehicle are characterized by high velocity. In order to successfully deal with high Velocity streams, responsive and effective systems that offer ultra-low latency are needed.
- Variety: Variety is the term used to characterize the high number and the great heterogeneity that are handled in the scope of a BigData application. This means dealing with different formats, different rates of ingestion of data streams, different connectors to databases and datastores, as well as handling of both data in motion (i.e. streams) and data at rest. It also implies processing of structured, unstructured and semi-structured data. BigData developers and deployers have usually hard times when it comes to process and analyze data of great variety, as they need to do a lot of preprocessing and harmonization work in order to collect and consolidate data from the various sources. This preprocessing part is usually more labor intensive and time consuming than the data analysis itself.
- Veracity: Veracity refers to the reliability and trustworthiness of the data sources that provide data in a given context, which also provides some insight about how meaningful is to analyze these data. The trustworthiness of a dataset hinges on various factors such as the methodology of the data collection, the statistical properties of the data, whether some “noise” was present in the data and more. A key characteristic of the Veracity property is that it drops as the other Vs (e.g., variety, volume) increase.
- Value: In order for a BigData application to be meaningful, it has to deliver business value to the customer. Value can be reflected in cost-efficiency (e.g., cost savings), but also in the ability to perform things that were not possible without the BigData. As outlined, the Value of BigData is typically translated to more efficient business processes, as well as improved decision making.
- Viability: This V refers to the data model used for capturing and analyzing BigData. BigData sets are said to be viable if they constitute good representations of the real world. In most cases, the data models used in BigData applications provide representations of a small subset of the real world (i.e. a “mini-world”) as a means of accelerating calculations and ensuring the computational viability of problem to be solved.
- Validity: This property is similar to veracity in the sense that is characterizes the accuracy and correctness of the data for its target use. As such it is one more measure of the data processing and data cleansing efforts that will be required prior to analyzing datasets. A very good validity degree for a dataset is an indication of good data governance practices, including regular quality audits.
- Vulnerability: The Vulnerability of a BigData dataset indicates whether it is susceptible to data breaches and cyber security attacks. The recent Cambridge Analytica data breach on Facebook data is only the tip of the iceberg, as many other providers of BigData have been attacked over the years. Hence, vulnerability is one of the BigData properties that you certainly want to minimize.
- Variability: Variability may sound like Variety, but it’s not actually used to characterize the diversity of the data sources. Rather it refers to the consistency of the datasets. High variability means that datasets have several inconsistencies, which are in practice reflected on the presence of many outliers and anomalies. Likewise, variability can be also used to indicate large inconsistencies in the ingestion rates of the streaming sources that may comprise the BigData application.
- Volatility: This V is associated with the life cycle of the datasets, including the data storage and archival policies that datasets are subject to. Volatility is about considering how much historical data you need to process, in order to establish proper rules for data currency and availability. Storage is currently very cheap, but the volume of BigData may be a good reason for specifying policies where some of the older data (e.g., data with low business value and use) get eventually deleted.
- Vocabulary: Vocabulary refers to the semantics and ontologies that should accompany a BigData set towards its effective analysis. The quality of a dataset’s vocabulary is a key to developing smart applications, such as Artificial Intelligence. It should capture the structural relationships between the entities that comprise a dataset. On the other hand, datasets that are not accompanied by a proper vocabulary are more difficult to mine and more susceptible to biased processing.
- Visualization: At the end of the day, BigData datasets and the outcomes of their analysis should be properly visualized and presented to business users. In this context, the visualization V denotes how challenging is to visualize the datasets of a BigData system. Visualizing very large datasets is very challenging and requires novel models for data representation and in several cases novel diagrams and graphs for the ultimate presentation of the data.
- Viralilty: This V comes from the word “viral”. Its purpose is to denote whether and how data spreads among other users and applications. As many BigData applications and datasets are connected to the internet, their spreading power can be indicative of their popularity and usefulness at the same time.
- Viscosity: Viscosity provides a measure of a data scientist’s difficulty in working with a particular dataset. It is closely related to other Vs, such as Velocity and Veracity, as we know that the higher the Velocity and the Veracity of a dataset the more difficult it will be to analyze it.
- Vagueness: Vagueness reflects the difficulty in derived meaningful (rather than vague) patterns and correlations from a dataset. In most cases it’s really straightforward to identify some patterns and rules in a dataset. However, it much more challenging to interpret them and to avoid spurious and problematic patterns of knowledge. Vagueness is a measure of our ability to derive meaningful patterns of knowledge that we can later understand and interpret.
These 15 Vs describe some of the most commonly used properties of BigData i.e. the properties that matter the most. However, the list of Vs is not exhaustive, as other sources and articles list up to 50 different Vs. In your next BigData application it’s probably worth thinking about the Vs of your datasets and their properties that facilitate or hinder their successful processing and integration in applications with significant business value.