How to speed up your AI projects
Speed Up AI Projects for Beginners |
The correct orchestration of the data pipeline determines the speed of processing at AI.
IT managers should be aware that the success of AI projects depends on the scope and quality of the data. Many AI applications require a lot of data to develop resilient models. Only when they are available in sufficient quantities - and this also in usable quality - can the AI models be trained, on the basis of which predictions, assessments and recommendations are then given.
Basically, AI projects like other data-centric projects are confronted with the 3 Vs: Volume, Velocity and Variety. The more extensive the database (volume), the better the models generally become. These enormous amounts of data must be moved as quickly and efficiently as possible between the CPU, storage and other instances (velocity). Typically, this data is no longer available in simple relational databases, but in a wide variety of formats: from structured data from transactions or sensors to Bluetooth signals to unstructured data such as text, images or videos (Variety).
The three Vs are a challenge in themselves. In the case of AI projects, the fact that each of the three components changes dramatically with each stage of the data processing is aggravating. For example, the amount of data in the initial data collection can be in the petabyte range, but only gigabytes of data may be included in the actual training and the finished model may only include a fraction of it.
Orchestration is what matters
In addition, the requirements for reading and writing data are very different in the different phases: Data collection is 100 percent with writing, preparation and training are 50/50-Read/Write mix and analysis is 100 percent read.
The solution to the problem with the three Vs is to orchestrate the data pipeline according to the requirements. A data pipeline processes the data in a sequence of interconnected processing steps in several phases. Gartner stressed in a recent report how important it is to understand these phases: "The success of AI and machine learning initiatives depends on the orchestration of effective data pipelines, which provide the high quality of data in the right formats in the various phases of the AI pipeline in a timely manner".
The data pipeline starts with the collection of the data (ingest phase).
The four phases of the data pipeline
In this first phase, data from a variety of different sources must be compiled and managed. The scope of the recorded data can vary as well as the data format. In this phase, there are almost exclusively sequential write operations in the memory.
Artificial Intelligence Projects in c++ |
Artificial intelligence projects in c++
In the preparation phase, data must be labeled, compressed, deduplicated, transformed and cleaned. Data that is missing or incomplete will be enriched and inconsistencies eliminated. This is an iterative process with varying amounts of data that must be read and written both randomly and sequentially.
For the training - the third phase of the AI pipeline - the data sets must be moved comprehensively. This step is extremely resource-intensive and involves the repeated execution of mathematical functions on the prepared data to achieve the desired results. The storage requirements are correspondingly high.For more complex models such as deep learning models with many layers, these are even higher. In order to make this process as efficient as possible, the computer utilization should be optimized.
Finally, in inferencing, the trained model is used in operative operation for decisions. It can be used in the data center or on-site on edge devices. In this phase, when reading the trained model, the data migrates from memory to the CPU, where the current data is written, which is evaluated. The derived results are then transported back to the training component in order to improve the accuracy. Real-time edge implementations therefore lead to an even higher power requirement.
Storage optimization in every phase
At all stages, the combination of Intel’s memory technologies and Intel Optane technology can help speed up these processes enormously. Optane technology offers high performance and low latency for fast storage and caching. 3D NAND SSD, in turn, consolidate memory requirements, scale with growing memory requirements and speed up access.
Optane plays a crucial role in all phases of the data pipeline.
- When collecting data, AI workloads benefit from the high write performance and low latency of Intel Optane technology.
- The preparation phase consumes up to 80 percent of AI resources. Therefore, storage devices such as Optane Ssds with low latency, high quality of service and high throughput are important to reduce preparation time. Optane also offers a balanced mix of read/write performance, reducing data preparation time.
- During training, the high random read TPT and low latency of Optane SSD’s critical training resources can be used optimally. In addition, Optane Ssds can accelerate the processing of temporary data during data modeling.
- Optane also pays off in the practical use of the models - inferencing. The Optane technology makes inferencing much faster, which pays off especially for trained models on the edge.
Numerous companies are now using Optane technology in the AI and analytics environment. The US healthcare company Montefiore in the Bronx, for example, has developed an AI solution with its patient-centered platform for analytic-machine learning, with which multiple data stores can be tapped - regardless of where the information is located or how it is structured. The decisive building block here are two 375 GB Intel Optane SSD. They not only ensure fast storage, but are also configured as a method for expanding the system memory. According to the IT managers, this system thus has the potential to become a "real power plant" for the data.