Android in the AR world is in China, Rokid detonates the spatial computing frenzy

Original Source: Light Cone Intelligence

Author: Liu Yuqi

Image source: Generated by Unbounded AI‌

It may be hard for you to imagine that in a space without a display screen or a mouse, you can complete a 5,000-word article with just a pair of AR glasses and a pocket host.

That's right, on August 26, at the 2023 Rokid Jungle new product launch conference, such a scene is actually happening. At the meeting, Rokid released Rokid AR Studio, a consumer-grade OST (optical see-through) personal spatial computing platform, including two major hardware products, Rokid Max Pro (4,999 yuan) and Rokid Station Pro (3,999 yuan).

Zhu Mingming, founder and CEO of Rokid, said at the press conference: "Spatial computing can be more naturally integrated into daily life and work, and let Rokid AR Studio become your first spatial computer."

This is very different from people's perception of AR glasses in the past. Before this, AR glasses have been "locked" in the entertainment scene, relying on the two pillar industries of film and television and games to survive, but Rokid AR Studio has truly become a personal productivity tool, IM software, writing articles, writing code, searching information, etc. And other work scenarios can be completed with the latest hardware.

**The expansion of usage scenarios allows AR devices to shift from marginalized scenarios to more practical use values. When consumers are willing to pay, the entire AR industry chain will enter the positive cycle of the consumer market. **

Zhu Mingming, the boss who said he is a "social fear", is a complete product and technology control. He once killed two versions of the first draft of the product design internally, which almost drove the product department "crazy". But when the product department secretly took out the designed product, Zhu Mingming immediately ordered all resources to be devoted to this product. "I only care about one statistic, which is the user's usage time. At present, our real user's usage time is close to one and a half hours, and the weekly retention rate exceeds 20%. If this is done, users will grow naturally."

**The accumulated number of users has reached the million level, which also means that the AR industry has entered the second stage of software system and ecological construction. In recent years, more and more system vendors, application software vendors, and content vendors have joined the construction of the AR ecosystem. **

"A group of lunatics, a dream, ten years."

As Zhu Mingming said, it took Rokid 10 years to go from entertainment scenes to productivity tools. Behind this is not only a leap in thinking, but also a big step forward from hardware technology to software technology, and even the entire industry chain. Apple and Rokid have started the second stage of the AR competition, and the competition in the industry is also accelerating.

**Monocular SLAM, how to redefine interaction? **

In the entire press conference, the most surprising thing was not the body of Rokid Max Pro 76g, but only one camera, which was able to complete SLAM (spatial positioning technology), micro-gesture interaction, first-person perspective sharing, Visual positioning VPS capabilities and other integrated interactive methods. **

After experiencing physical interaction (handle), voice interaction, and gesture interaction, AR/VR devices are developing toward eye tracking and the current multi-sensory fusion interaction solution.

However, the interaction of multi-sensory integration has higher requirements for hardware. In addition to meeting the basic needs, it is also necessary to capture user actions and gestures from all directions and from multiple angles in order to accurately complete the interaction.

**How difficult is it to complete SLAM interaction with a single camera? **

The visual SLAM method consists of two modules, one is Tracking, known 3D point position, basic positioning; the other is Mapping, update the position of 3D point. Regardless of which link or method, monocular means that only one camera can be selected, as well as a fixed position and fixed angle, which poses great challenges to the recognition range, tracking speed and accuracy.

"The industry believes that monocular SLAM is unbelievable and difficult to achieve," Zhu Mingming jokingly said, "This may also be an affirmation of Rokid."

At present, the few AR glasses with spatial interaction on the market will be equipped with at least three cameras to undertake algorithm functions. **The difference in visual routes has also formed two camps: VST (video perspective) represented by Apple and OST (optical perspective) represented by Rokid. **

Still taking the Apple Vision Pro as an example, it uses 12 cameras to "stack" fast positioning capture, high-precision panoramic perception, and precise tracking, and uses VST to display the external world on the terminal screen through the cameras. The camera shoots in real time to see the outside world.

However, the method of stacking hardware for interaction has increased the cost and doubled the price at the same time, which has caused two major landing problems: the weight of the machine and the difficulty of mass production. This is the fundamental reason why Apple Vision Pro is priced at $3,499 and will not be mass-produced until 2024.

However, the OST solution that Rokid insists on has certain technical barriers. Due to the complex pipeline design, the limited viewing angle of the display screen, and the high cost of optical components, Rokid can only pass Technological breakthroughs to reduce superimposed costs.

And how does the monocular SLAM that makes the industry think "unbelievable" do it? After the meeting, Lightcone Intelligence had an in-depth exchange with Zhu Mingming, and found that Rokid's "unique trick" is to use AI algorithms to break through the barriers of hardware. **

Zhu Mingming introduced that although the monocular SLAM technology has existed for a long time, it has never been applied to AR glasses. The front camera of the mobile phone also applies this kind of technology. The only difference is: the algorithm.

From AI to AR, this is a road that seems to span but is actually integrated, but it is also because of Rokid’s accumulation in the AI field in the past few years, through the multi-dimensional visual algorithm model, including visual positioning and enhancement, digital human technology , 2D/3D gesture recognition, OCR recognition and other technologies allow AI to land in specific scenarios.

For example, the AR visual positioning and enhancement function is to solve and break through the single-purpose limitation. By constructing a centimeter-level visual map, the virtual information can be accurately superimposed and fused in the real object world to achieve high-precision 3D reconstruction of objects and scenes.

Wang Junjie, vice president of Rokid and head of the XR center, said: "Spatial positioning is based on SLAM technology, and then stable and natural interaction can be performed in space. It takes 1 to 2 seconds to quickly initialize through the algorithm to establish a mapping space."

On the market, most devices still use binocular solutions, but binocular fusion also has many problems. In addition to the cost of adding an extra camera, it is also necessary to continuously use algorithms to fit the data of the two cameras in real time. This leads to more complex issues.

From this point of view, if the monocular solution can be carried out smoothly, Rokid will take the lead in stepping on a technological trend. Previously, Rokid was also the industry's first manufacturer of Station hosts. The solution of separating glasses and hosts has been proven to be the optimal solution for industry experience.

In addition, in the gesture recognition, Rokid adopts the interactive mode of micro-gestures, and you can click and select with a pinch of your fingers; you can also switch the interface or content you are browsing by moving the gesture left and right. Logical definitions such as simple pinch and slide gestures are more natural and get started faster.

According to our on-site test results, Rokid can realize bare-hand space interaction with both hands. At present, Rokid’s gesture recognition algorithm supports complex scene recognition such as horizontal/spatial axis rotation, bright/dark light, etc. At the same time, there are many types of recognizable gestures. , The algorithm is precise, the overall recognition rate is about 90%, and it has millisecond-level recognition response capability and 99% reliability guarantee.

According to Rokid, based on the deep learning algorithm and a large amount of experimental data, the monocular 3D gesture algorithm can reconstruct hand posture parameters in real time on the mobile terminal, including hand 6DoF, hand joint point 6DoF, and Hand Mesh information, providing AR gesture interaction. Good algorithmic basis.

At present, Rokid's gesture recognition can realize a variety of operations in 3D space, including point, pinch, grasp, hold, drag, pull, etc., which can fully meet the needs of AR interactive applications. For example, put on the Rokid Max Pro, stretch out your hand, and open your palm in front of your eyes to call out the menu.

After all, to support such a complex algorithm structure, the hero behind it is not only the camera, but also closely related to the computing power and performance of the "brain", that is, the Rokid Station Pro.

SPACE COMPUTER IN YOUR POCKET

** For a long time, the entire VR/AR industry has had an impossible triangle of "computing power, comfort, and price". Devices with higher computing power tend to be heavier and more expensive, and lightweight devices with high comfort cannot meet the needs of use. **

Judging from the actual situation, there is no "perfect" solution at present. The mainstream manufacturers are trying to find a balance between the two. There are two types of mainstream solutions in the current market: one is represented by Apple. The display and computing are integrated, and the battery is externally connected; the other is the display and computing split design represented by Rokid.

Apple's integrated design integrates two micro-OLED screens, multiple cameras, sensors, speakers and other components, which is more efficient in terms of display effects and calculations, but it will also increase the weight of the body itself, resulting in only Connect the battery externally.

The split design that Rokid insists on maximizes wearability. Compared with Vision Pro’s weight of 454g, the weight of 76g glasses is almost the same as that of ordinary glasses. At the same time, the computing power of the host can also be less limited by space resources, while avoiding to a certain extent Discomfort caused by heat dissipation.

**In general, the split-type route can achieve the two-way ultimate development of the portability of glasses and the computing power of the host, and is also more flexible. The iteration of computing power and the technical route of glasses can be carried out asynchronously. **

Based on the split design, Rokid Station Pro has upgraded its computing power to create an All in One terminal integrating computing, imaging, communication and other functions. It can also be called a "productivity tool". HyperTerminal.

According to Lightcone Intelligence, Rokid Station Pro is equipped with Qualcomm Snapdragon XR2+, 12G RAM + 128G ROM, and supports WIFI6/6E and BT5.1. With heat dissipation and higher performance, it can achieve centimeter-level 6DoF tracking accuracy and extremely low MTP (Motion to Photon) rendering delay.

According to public information, Snapdragon XR2+ is the latest flagship XR platform launched by Qualcomm, which can achieve 50% battery life and 30% improvement in heat dissipation performance, thus enabling a richer and more immersive experience in a smaller and thinner device shape. . At the same time, the Snapdragon XR2 + platform introduces a new image processing pipeline, which can achieve a delay of less than 10 milliseconds and open a full-color video see-through MR experience.

Judging from the on-site experience of Light Cone Intelligence, whether it is watching movies, playing games, or calling keyboards for work and production processes, especially under the high-frequency interaction and fighting of games, the smoothness and response speed of the screen are very silky. slip.

It is worth mentioning that the core algorithm currently on the market is still 3DoF (three-degree-of-freedom tracking), which means that the device can detect the rotation in the three directions of upward, forward and downward, but it cannot detect the spatial displacement of the head, front, back, left, and right. .

The 6DoF algorithm adopted by the upgraded Station Pro can not only detect the change of the field of view angle caused by the rotation of the head, but also detect the six kinds of displacement changes of "up, down, front, back, left, and right" caused by the body movement.

The upgrade of this algorithm is more important in the player's degree of freedom. For example, when fighting zombies under the 3DoF algorithm, the shooting range is at a certain angle in front, but after the upgrade, the zombies appear from 360 degrees, and when you turn around, the body feeling of the zombies behind you is beyond the reach of the former.

In other words, not only is the computing power higher and the experience smoother, but the expansion of the computing power space has also brought about a huge difference in the sense of body.

Said Bakadir, senior director of XR product management at Qualcomm Technologies, said: "The first-generation Snapdragon XR2+ platform is the best choice to enable the next generation of XR experiences. Qualcomm Technologies provides the industry-leading platform for Rokid Station Pro, supporting it to create Its own unique AR application ecosystem."

Do iOS in the AR industry

Of course, the reason why Apple's mobile phone can dominate the mobile phone market all the year round is not only because of its hardware, but also because of its system and ecology. The barriers built by cultivating user habits through software systems are often stronger than the hardware itself.

**This is part of the reason for Rokid's self-developed AR space operating system - YodaOS-Master, but not the whole reason. **

On Rokid Open Day in March this year, Rokid officially launched YodaOS-Master, and released the "AR Space Creation Platform Lingjing", which allows everyone to create AR content in 3D space, and everyone can participate, completely breaking the barriers of AR creation. Threshold, let the ecological potential energy explode.

**If monocular SLAM, 3D gesture recognition, Snapdragon XR+, and Lingjing platform are all sharp blades, then YodaOS-Master can release these tricks through a self-developed system. **

To put it simply, Rokid is taking a road that no one has ever traveled, and Rokid's philosophy is "software defines everything". All software needs to be carried and provided by the system in order to exert its value.

Focusing on the five aspects of perception, understanding, interaction, presentation, collaboration, and digital creation, YodaOS-Master has made a huge upgrade in terms of chip optimization, hardware design, software architecture, AR algorithm, and creation tools. It may be the most complete at present. A set of spatial operating systems for the AR era.

At the press conference, Rokid also demonstrated the openness and convenience brought by the self-developed system. To give a few obvious examples, Based on the self-developed system and the Snapdragon XR+ platform, Rokid has developed a multi-task parallel mode, breaking the previous constraints of only a single task, enabling chatting, writing code, and The scene of viewing documents can be realized at the same time and give full play to the advantages of the large screen in space, so that the production efficiency can be maximized.

**Another extremely innovative case is that Rokid redefines spatial search based on its self-developed system. **Zhu Mingming introduced that this breaks the previous way of displaying search information, and the presentation of search results is no longer a two-dimensional plane effect, but exists in a three-dimensional space. "The results that are most relevant to the question will be the closest to you, and the results that are somewhat relevant are on the secondary page. The farther away, the less relevant. Of course, you can also cross out the previous results and dynamically select the results you want."

In this way, the sense of the future is instantly full, and it also shows the essential difference from the first-stage AR equipment.

**It can be seen that the open ecology of the AR industry has begun to enter the second stage. Apple and Rokid not only move left and right in the direction of hardware, but also in the development of industry system software and ecology. Through the co-creation of hardware, algorithms, software ecology, developers, users and platforms, AR will move towards the second stage of rapid development in a completely open ecology. **

Shi Wenfeng, chief engineer of Rokid system research and development, said, "The YodaOS-Master operating system integrates multiple core technologies of Rokid voice recognition, gesture recognition, SLAM, etc. into system services through a service-oriented approach, and provides a variety of client SDKs for development Developers can develop efficiently, such as SDK for Unity, which allows Unity developers (developer application channel: open platform URL (ar.rokid.com)) to quickly use Rokid core technology for development.”

From hardware to software, from system to ecology, Rokid's development path is a bit like Apple in the Jobs era.

"The AR industry is just before dawn," Zhu Mingming said.

View Original
The content is for reference only, not a solicitation or offer. No investment, tax, or legal advice provided. See Disclaimer for more risks disclosure.
  • Reward
  • Comment
  • Share
Comment
0/400
No comments
  • Pin