Big Data

Big data is a very large volume of heterogeneous data, whose handling cannot be managed by traditional data mining and processing techniques or software. New sources of Alternative data come from the Web (news, press releases, blogs), mobiles and social media (texts, posts, audio, video), satellites and drones (pictures), private company information (financial statements, job postings), Internet of Things (sensor signals, traffic information and transportation of goods), just to name a few.

In particular, Big Data are easily defined by the 4 V’s

1) High Volume While there is not a particular size threshold to define Big Data (it all depends on the particular kind of data, the processing technology used and on the question the data scientist is trying to find an answer to), we all agree that Big Data will probably include at least multiple Tera or Petabytes

2) High Variety (structured, unstructured, semi-structured) and hence, Variability (or complexity)

While structured data can be usually identified by an exact location in a file (such as the tabular data found in a relational databases or spreadsheet), and are easily found through simple algorithms, unstructured data can be stored in non-uniform formats (audio, video, texts, emails, and images…) and require complex analytical skills and advanced technology. Semi-structured data are found in the XML (Extensible Markup Language) format, a textual language used to exchange data on the web. The user-defined data tags make them machine-readable.

3) Another featured is high Velocity, the speed at which data are generated and analyzed. But speed is rather important in the decision making related to the data we have or wish we had. The value added by ‘Alternative’ data quickly evaporates once the competitors gain access as well.

4) Furthermore, in Big Data Veracity is uncertain. Many Big Data can be unreliable, such as customer sentiments in social media (which are uncertain by nature, since they entail human judgement).