Components in Electronics October 2025

Artificial Intelligence Technology

semantics and long-range dependencies in language. During inferencing, the model uses the token weightings to connect them and produce, one token at a time, a coherent response to a prompt.

Fig 1: A vision language model identifying multiple elements within a video feed

vulnerabilities. On-device processing also makes it easier to comply with local laws concerning data sovereignty and privacy that may be affected by cloud hosting in a different territory. With applications such as crowd or traffic control in public places, AI agents at the edge can ensure that personally identifiable information is only transmitted where necessary, such as registration plates if a vehicle is seen committing a traffic violation.

The costs of communicating, recording and processing large quantities of data provide further reasons for implementing AI in edge devices. From a cost perspective, while initial hardware investments for edge AI agents might be higher, the long-term operational costs are significantly lower. Cloud AI incurs recurring costs based on data transfer, storage and compute usage, which can escalate rapidly with continuous, high-volume data streams. Edge AI agents, by processing data locally and only sending summarized insights (not raw video) to the cloud (if anything), dramatically cut down on these ongoing cloud expenses, offering a more predictable and often lower total cost of ownership. The agentic nature further amplifies these benefits, as these intelligent entities can autonomously manage tasks, optimize resource usage, and make

proactive decisions without constant human intervention or cloud-based orchestration. All of that leads to greater efficiency and resilience in disconnected or intermittently connected environments.

Foundation models

Agentic AI will often rely on the reasoning- like capabilities of large language models (LLMs) or foundation models to orchestrate and schedule other AI models. In surveillance and security applications, these models will often be a combination of CNN and larger, more capable foundation models. Foundation models rely on the Transformer neural-network structure. Introduced in the 2017 paper “Attention Is All You Need” written by a group of computer scientists working at Google, the Transformer is key to training an AI model on the relationships between words in text. Each word roughly equates to a token that is represented in a vector embedding. This ‘embedding’ maps the token into a vector space with many dimensions that is designed so that tokens with similar meanings and usages cluster together. The self-attention training mechanism used by the Transformer adds contextual information, weighting its relationships to all the other tokens in the training data. This lets the model capture

Though developed for natural language, the Transformer and the token-based processing it enables have gone much further thanks to their generality. Transformer networks support the transformation of an input sequence of tokens into a different but contextually relevant output sequence. The tokens for each of these streams can come from visual and audio data as much as textual. This versatility has provided the foundation for a wide range of multimedia- oriented models, including vision-language models (VLMs) and vision-language action (VLA) models.

Perception and reasoning Systems can go further in harnessing the ability of foundation models to make connections between diverse data inputs and generate reasoning-like conclusions. Over the past few years, developers have implemented techniques like chain-of-thought prompting to lead foundation models to greater levels of insight and capabilities on problems in logic and common-sense reasoning. Early applications of chain-of-thought prompting involved high levels of human intervention.

 intelligence

If a foundation model is trained to generate chain-of-thought prompts automatically and then follow those prompts or feed them to a separate model, the result is an AI that can act as an autonomous agent. Such agents can call external tools such as databases and simulation engines using API calls, letting the foundation build potentially complex sequences of actions that respond to changes in real-time inputs.

The mix of models in an Agentic AI application for surveillance and security will depend on the various trade-offs that developers need to consider. Some may employ several simpler CNNs, each tuned to a particular sensor modality or source. Others

Fig 2: A

demonstration of Ambarella’s CV72 AI

system-on-chip (SoC) running CLIP to identify elements within a 30-minute video (left) and the same proces- sor running Phi-2 to describe a live-video feed.

www.cieonline.co.uk

may have more complex requirements that benefit from a VLM, where the inputs from multiple cameras as well as audio sources will need to be considered. But at the core is a foundation model that takes in multimodal inputs to produce intelligence tokens that further feed other action-based VLA models in an agentic framework to perform necessary actions.

On edge platforms, the ability to generate a stream of tokens in real time relies on hardware acceleration that is efficient for performing inferencing for Transformer- based models. Many hardware platforms can deliver high throughput for CNN-type models. But in dispensing with convolutions and recurrent neural network structures, the Transformer’s use of memory and compute during inferencing is sufficiently different to require specific acceleration support. That is why Ambarella has developed an AI SoC architecture that can handle simultaneous inferencing for both Transformers and more traditional neural- network models. Now in its third generation, the CVflow AI accelerator integrated into Ambarella’s latest SoCs is architected together with all of the other processing blocks – including image signal processing, a video pipeline and general compute – to ensure energy-efficient execution and high throughput. This enables the architecture to process tokens at the real-time rates needed to bring VLMs and agentic AI into surveillance and security, while staying within the power and thermal budgets to do so inside the camera. Additionally, Ambarella’s N1x family of edge infrastructure AI SoCs, featuring its CVflow 3.0 AI acceleration architecture, provide the ability to aggregate and perform AI processing on the data from multiple cameras, in an on- premise AI box, without having to upgrade each camera.

In both cases, the result is an architecture where AI agents can run at the edge without the requirement for cloud connectivity. As these systems become more capable, we will see networks of edge devices that can self-organize and collaborate to handle increasingly complex tasks autonomously. These edge AI agents will significantly enhance situational awareness, response times and overall operational efficiency. They will transform how safety and surveillance systems are designed and implemented. The most responsive systems will be those that are built on hardware designed from the ground-up to handle the underlying Transformer-based workloads.

ambarella.com Components in Electronics October 2025 19

Page 1 | Page 2 | Page 3 | Page 4 | Page 5 | Page 6 | Page 7 | Page 8 | Page 9 | Page 10 | Page 11 | Page 12 | Page 13 | Page 14 | Page 15 | Page 16 | Page 17 | Page 18 | Page 19 | Page 20 | Page 21 | Page 22 | Page 23 | Page 24 | Page 25 | Page 26 | Page 27 | Page 28 | Page 29 | Page 30 | Page 31 | Page 32 | Page 33 | Page 34 | Page 35 | Page 36 | Page 37 | Page 38 | Page 39 | Page 40 | Page 41 | Page 42 | Page 43 | Page 44 | Page 45 | Page 46 | Page 47 | Page 48 | Page 49 | Page 50 | Page 51 | Page 52 | Page 53 | Page 54 | Page 55 | Page 56 | Page 57 | Page 58 | Page 59 | Page 60

orderForm.title