The volume of data stored on our planet, which can be collected by a range of equipment and sensors, backed up on the “cloud” or shared on the Internet, has reached a mind-boggling 33 zettabytes and is expected to increase five-fold by 2025. Its usefulness lies in the fact it can be used for modeling, prediction and decision-making purposes.
A data point represents a record based on observing the world around us. Being a representation of this observation, it can be coded and backed up, including by digital means. When data is accessible, it can be computed and analyzed. Statistics models data as the realization of a random variable, thereby paving the way for its use by so-called machine learning algorithms, which are vital components of intelligent systems.
The rise of the Internet, search engines and web page indexation was what first enabled big data to appear, as well as the ability to process complex data (not just SQL) and the need for scaling (MapReduce system). More recently, three developments have come together to once again pave the way for artificial intelligence, heralding a second step change: the availability of huge annotated databases, the significant increase in computing power and the progress achieved by machine learning algorithms, especially deep learning.
This latter step-change, which turns the data into the raw material that is essential to the design of intelligent systems, presents a number of challenges. The direct influence of data quality on learning algorithm outputs, fairness and privacy during data collection and processing, as well as the reliability of data, are all issues which will need to be addressed.
This text summarizes Florence d’Alché-Buc’s presentation at the ENGIE “Réveil Digital” event on 5 June 2019