The landscape of computing is undergoing a profound transformation with the emergence of spatial computing platforms(VR and AR). As we step into this new era, the intersection of virtual reality, Augmented Reality, and on-device machine learning presents unprecedented opportunities for developers to create experiences that seamlessly blend digital content with the physical world.
The introduction of visionOS marks a significant milestone in this evolution. Apple’s Spatial Computing platform combines sophisticated hardware capabilities with powerful development frameworks, enabling developers to build applications that can understand and interact with the physical environment in real time. This convergence of spatial awareness and on-device machine learning capabilities opens up new possibilities for object recognition and tracking applications that were previously challenging to implement.
What We’re Building
In this guide, we’ll be building an app that showcases the power of on-device machine learning in visionOS. We’ll create an app that can recognize and track a diet soda can in real time, overlaying visual indicators and information directly in the user’s field of view.
Our app will leverage several key technologies in the visionOS ecosystem. When a user runs the app, they’re presented with a window containing a rotating 3D model of our target object along with usage instructions. As they look around their environment, the app continuously scans for diet soda cans. Upon detection, it displays dynamic bounding lines around the can and places a floating text label above it, all while maintaining precise tracking as the object or user moves through space.
Before we begin development, let’s ensure we have the necessary tools and understanding in place. This tutorial requires:
- The latest version of Xcode 16 with visionOS SDK installed
- visionOS 2.0 or later running on an Apple Vision Pro device
- Basic familiarity with SwiftUI and the Swift programming language
The development process will take us through several key stages, from capturing a 3D model of our target object to implementing real-time tracking and visualization. Each stage builds upon the previous one, giving you a thorough understanding of developing features powered by on-device machine learning for visionOS.
Building the Foundation: 3D Object Capture
The first step in creating our object recognition system involves capturing a detailed 3D model of our target object. Apple provides a powerful app for this purpose: RealityComposer, available for iOS through the App Store.
When capturing a 3D model, environmental conditions play a crucial role in the quality of our results. Setting up the capture environment properly ensures we get the best possible data for our machine learning model. A well-lit space with consistent lighting helps the capture system accurately detect the object’s features and dimensions. The diet soda can should be placed on a surface with good contrast, making it easier for the system to distinguish the object’s boundaries.
The capture process begins by launching the RealityComposer app and selecting “Object Capture” from the available options. The app guides us through positioning a bounding box around our target object. This bounding box is critical as it defines the spatial boundaries of our capture volume.

Once we’ve captured all the details of the soda can with the help of the in-app guide and processed the images, a .usdz file containing our 3D model will be created. This file format is specifically designed for AR/VR applications and contains not just the visual representation of our object, but also important information that will be used in the training process.
Training the Reference Model
With our 3D model in hand, we move to the next crucial phase: training our recognition model using Create ML. Apple’s Create ML application provides a straightforward interface for training machine learning models, including specialized templates for spatial computing applications.
To begin the training process, we launch Create ML and select the “Object Tracking” template from the spatial category. This template is specifically designed for training models that can recognize and track objects in three-dimensional space.

After creating a new project, we import our .usdz file into Create ML. The system automatically analyzes the 3D model and extracts key features that will be used for recognition. The interface provides options for configuring how our object should be recognized in space, including viewing angles and tracking preferences.
Once you’ve imported the 3d model and analyzed it in various angles, go ahead and click on “Train”. Create ML will process our model and begin the training phase. During this phase, the system learns to recognize our object from various angles and under different conditions. The training process can take several hours as the system builds a comprehensive understanding of our object’s characteristics.

The output of this training process is a .referenceobject file, which contains the trained model data optimized for real-time object detection in visionOS. This file encapsulates all the learned features and recognition parameters that will enable our app to identify diet soda cans in the user’s environment.
The successful creation of our reference object marks an important milestone in our development process. We now have a trained model capable of recognizing our target object in real-time, setting the stage for implementing the actual detection and visualization functionality in our visionOS application.
Initial Project Setup
Now that we have our trained reference object, let’s set up our visionOS project. Launch Xcode and select “Create a new Xcode project”. In the template selector, choose visionOS under the platforms filter and select “App”. This template provides the basic structure needed for a visionOS application.

In the project configuration dialog, configure your project with these primary settings:
- Product Name: SodaTracker
- Initial Scene: Window
- Immersive Space Renderer: RealityKit
- Immersive Space: Mixed
After project creation, we need to make a few essential modifications. First, delete the file named ToggleImmersiveSpaceButton.swift as we won’t be using it in our implementation.
Next, we’ll add our previously created assets to the project. In Xcode’s Project Navigator, locate the “RealityKitContent.rkassets” folder and add the 3D object file (“SodaModel.usdz” file). This 3D model will be used in our informative view. Create a new group named “ReferenceObjects” and add the “Diet Soda.referenceobject” file we generated using Create ML.
The final setup step is to configure the necessary permission for object tracking. Open your project’s Info.plist file and add a new key: NSWorldSensingUsageDescription. Set its value to “Used to track diet sodas”. This permission is required for the app to detect and track objects in the user’s environment.
With these setup steps complete, we have a properly configured visionOS project ready for implementing our object tracking functionality.
Entry Point Implementation
Let’s start with SodaTrackerApp.swift, which was automatically created when we set up our visionOS project. We need to modify this file to support our object tracking functionality. Replace the default implementation with the following code:
import SwiftUI
/**
SodaTrackerApp is the main entry point for the application.
It configures the app's window and immersive space, and manages
the initialization of object detection capabilities.
The app automatically launches into an immersive experience
where users can see Diet Soda cans being detected and highlighted
in their environment.
*/
@main
struct SodaTrackerApp: App {
/// Shared model that manages object detection state
@StateObject private var appModel = AppModel()
/// System environment value for launching immersive experiences
@Environment(\.openImmersiveSpace) var openImmersiveSpace
var body: some Scene {
WindowGroup {
ContentView()
.environmentObject(appModel)
.task {
// Load and prepare object detection capabilities
await appModel.initializeDetector()
}
.onAppear {
Task {
// Launch directly into immersive experience
await openImmersiveSpace(id: appModel.immersiveSpaceID)
}
}
}
.windowStyle(.plain)
.windowResizability(.contentSize)
// Configure the immersive space for object detection
ImmersiveSpace(id: appModel.immersiveSpaceID) {
ImmersiveView()
.environment(appModel)
}
// Use mixed immersion to blend virtual content with reality
.immersionStyle(selection: .constant(.mixed), in: .mixed)
// Hide system UI for a more immersive experience
.persistentSystemOverlays(.hidden)
}
}
The key aspect of this implementation is the initialization and management of our object detection system. When the app launches, we initialize our AppModel which handles the ARKit session and object tracking setup. The initialization sequence is crucial:
.task {
await appModel.initializeDetector()
}
This asynchronous initialization loads our trained reference object and prepares the ARKit session for object tracking. We ensure this happens before opening the immersive space where the actual detection will occur.
The immersive space configuration is particularly important for object tracking:
.immersionStyle(selection: .constant(.mixed), in: .mixed)
The mixed immersion style is essential for our object tracking implementation as it allows RealityKit to blend our visual indicators (bounding boxes and labels) with the real-world environment where we’re detecting objects. This creates a seamless experience where digital content accurately aligns with physical objects in the user’s space.
With these modifications to SodaTrackerApp.swift, our app is ready to begin the object detection process, with ARKit, RealityKit, and our trained model working together in the mixed reality environment. In the next section, we’ll examine the core object detection functionality in AppModel.swift, another file that was created during project setup.
Core Detection Model Implementation
AppModel.swift, created during project setup, serves as our core detection system. This file manages the ARKit session, loads our trained model, and coordinates the object tracking process. Let’s examine its implementation:
import SwiftUI
import RealityKit
import ARKit
/**
AppModel serves as the core model for the soda can detection application.
It manages the ARKit session, handles object tracking initialization,
and maintains the state of object detection throughout the app's lifecycle.
This model is designed to work with visionOS's object tracking capabilities,
specifically optimized for detecting Diet Soda cans in the user's environment.
*/
@MainActor
@Observable
class AppModel: ObservableObject {
/// Unique identifier for the immersive space where object detection occurs
let immersiveSpaceID = "SodaTracking"
/// ARKit session instance that manages the core tracking functionality
/// This session coordinates with visionOS to process spatial data
private var arSession = ARKitSession()
/// Dedicated provider that handles the real-time tracking of soda cans
/// This maintains the state of currently tracked objects
private var sodaTracker: ObjectTrackingProvider?
/// Collection of reference objects used for detection
/// These objects contain the trained model data for recognizing soda cans
private var targetObjects: [ReferenceObject] = []
/**
Initializes the object detection system by loading and preparing
the reference object (Diet Soda can) from the app bundle.
This method loads a pre-trained model that contains spatial and
visual information about the Diet Soda can we want to detect.
*/
func initializeDetector() async {
guard let objectURL = Bundle.main.url(forResource: "Diet Soda", withExtension: "referenceobject") else {
print("Error: Failed to locate reference object in bundle - ensure Diet Soda.referenceobject exists")
return
}
do {
let referenceObject = try await ReferenceObject(from: objectURL)
self.targetObjects = [referenceObject]
} catch {
print("Error: Failed to initialize reference object: \(error)")
}
}
/**
Starts the active object detection process using ARKit.
This method initializes the tracking provider with loaded reference objects
and begins the real-time detection process in the user's environment.
Returns: An ObjectTrackingProvider if successfully initialized, nil otherwise
*/
func beginDetection() async -> ObjectTrackingProvider? {
guard !targetObjects.isEmpty else { return nil }
let tracker = ObjectTrackingProvider(referenceObjects: targetObjects)
do {
try await arSession.run([tracker])
self.sodaTracker = tracker
return tracker
} catch {
print("Error: Failed to initialize tracking: \(error)")
return nil
}
}
/**
Terminates the object detection process.
This method safely stops the ARKit session and cleans up
tracking resources when object detection is no longer needed.
*/
func endDetection() {
arSession.stop()
}
}
At the core of our implementation is ARKitSession, visionOS’s gateway to spatial computing capabilities. The @MainActor attribute ensures our object detection operations run on the main thread, which is crucial for synchronizing with the rendering pipeline.
private var arSession = ARKitSession()
private var sodaTracker: ObjectTrackingProvider?
private var targetObjects: [ReferenceObject] = []
The ObjectTrackingProvider is a specialized component in visionOS that handles real-time object detection. It works in conjunction with ReferenceObject instances, which contain the spatial and visual information from our trained model. We maintain these as private properties to ensure proper lifecycle management.
The initialization process is particularly important:
let referenceObject = try await ReferenceObject(from: objectURL)
self.targetObjects = [referenceObject]
Here, we load our trained model (the .referenceobject file we created in Create ML) into a ReferenceObject instance. This process is asynchronous because the system needs to parse and prepare the model data for real-time detection.
The beginDetection method sets up the actual tracking process:
let tracker = ObjectTrackingProvider(referenceObjects: targetObjects)
try await arSession.run([tracker])
When we create the ObjectTrackingProvider, we pass in our reference objects. The provider uses these to establish the detection parameters — what to look for, what features to match, and how to track the object in 3D space. The ARKitSession.run call activates the tracking system, beginning the real-time analysis of the user’s environment.
Immersive Experience Implementation
ImmersiveView.swift, provided in our initial project setup, manages the real-time object detection visualization in the user’s space. This view processes the continuous stream of detection data and creates visual representations of detected objects. Here’s the implementation:
import SwiftUI
import RealityKit
import ARKit
/**
ImmersiveView is responsible for creating and managing the augmented reality
experience where object detection occurs. This view handles the real-time
visualization of detected soda cans in the user's environment.
It maintains a collection of visual representations for each detected object
and updates them in real-time as objects are detected, moved, or removed
from view.
*/
struct ImmersiveView: View {
/// Access to the app's shared model for object detection functionality
@Environment(AppModel.self) private var appModel
/// Root entity that serves as the parent for all AR content
/// This entity provides a consistent coordinate space for all visualizations
@State private var sceneRoot = Entity()
/// Maps unique object identifiers to their visual representations
/// Enables efficient updating of specific object visualizations
@State private var activeVisualizations: [UUID: ObjectVisualization] = [:]
var body: some View {
RealityView { content in
// Initialize the AR scene with our root entity
content.add(sceneRoot)
Task {
// Begin object detection and track changes
let detector = await appModel.beginDetection()
guard let detector else { return }
// Process real-time updates for object detection
for await update in detector.anchorUpdates {
let anchor = update.anchor
let id = anchor.id
switch update.event {
case .added:
// Object newly detected - create and add visualization
let visualization = ObjectVisualization(for: anchor)
activeVisualizations[id] = visualization
sceneRoot.addChild(visualization.entity)
case .updated:
// Object moved - update its position and orientation
activeVisualizations[id]?.refreshTracking(with: anchor)
case .removed:
// Object no longer visible - remove its visualization
activeVisualizations[id]?.entity.removeFromParent()
activeVisualizations.removeValue(forKey: id)
}
}
}
}
.onDisappear {
// Clean up AR resources when view is dismissed
cleanupVisualizations()
}
}
/**
Removes all active visualizations and stops object detection.
This ensures proper cleanup of AR resources when the view is no longer active.
*/
private func cleanupVisualizations() {
for (_, visualization) in activeVisualizations {
visualization.entity.removeFromParent()
}
activeVisualizations.removeAll()
appModel.endDetection()
}
}
The core of our object tracking visualization lies in the detector’s anchorUpdates stream. This ARKit feature provides a continuous flow of object detection events:
for await update in detector.anchorUpdates {
let anchor = update.anchor
let id = anchor.id
switch update.event {
case .added:
// Object first detected
case .updated:
// Object position changed
case .removed:
// Object no longer visible
}
}
Each ObjectAnchor contains crucial spatial data about the detected soda can, including its position, orientation, and bounding box in 3D space. When a new object is detected (.added event), we create a visualization that RealityKit will render in the correct position relative to the physical object. As the object or user moves, the .updated events ensure our virtual content stays perfectly aligned with the real world.
Visual Feedback System
Create a new file named ObjectVisualization.swift for handling the visual representation of detected objects. This component is responsible for creating and managing the bounding box and text overlay that appears around detected soda cans:
import RealityKit
import ARKit
import UIKit
import SwiftUI
/**
ObjectVisualization manages the visual elements that appear when a soda can is detected.
This class handles both the 3D text label that appears above the object and the
bounding box that outlines the detected object in space.
*/
@MainActor
class ObjectVisualization {
/// Root entity that contains all visual elements
var entity: Entity
/// Entity specifically for the bounding box visualization
private var boundingBox: Entity
/// Width of bounding box lines - 0.003 provides optimal visibility without being too intrusive
private let outlineWidth: Float = 0.003
init(for anchor: ObjectAnchor) {
entity = Entity()
boundingBox = Entity()
// Set up the main entity's transform based on the detected object's position
entity.transform = Transform(matrix: anchor.originFromAnchorTransform)
entity.isEnabled = anchor.isTracked
createFloatingLabel(for: anchor)
setupBoundingBox(for: anchor)
refreshBoundingBoxGeometry(with: anchor)
}
/**
Creates a floating text label that hovers above the detected object.
The text uses Avenir Next font for optimal readability in AR space and
is positioned slightly above the object for clear visibility.
*/
private func createFloatingLabel(for anchor: ObjectAnchor) {
// 0.06 units provides optimal text size for viewing at typical distances
let labelSize: Float = 0.06
// Use Avenir Next for its clarity and modern appearance in AR
let font = MeshResource.Font(name: "Avenir Next", size: CGFloat(labelSize))!
let textMesh = MeshResource.generateText("Diet Soda",
extrusionDepth: labelSize * 0.15,
font: font)
// Create a material that makes text clearly visible against any background
var textMaterial = UnlitMaterial()
textMaterial.color = .init(tint: .orange)
let textEntity = ModelEntity(mesh: textMesh, materials: [textMaterial])
// Position text above object with enough clearance to avoid intersection
textEntity.transform.translation = SIMD3(
anchor.boundingBox.center.x - textMesh.bounds.max.x / 2,
anchor.boundingBox.extent.y + labelSize * 1.5,
0
)
entity.addChild(textEntity)
}
/**
Creates a bounding box visualization that outlines the detected object.
Uses a magenta color transparency to provide a clear
but non-distracting visual boundary around the detected soda can.
*/
private func setupBoundingBox(for anchor: ObjectAnchor) {
let boxMesh = MeshResource.generateBox(size: [1.0, 1.0, 1.0])
// Create a single material for all edges with magenta color
let boundsMaterial = UnlitMaterial(color: .magenta.withAlphaComponent(0.4))
// Create all edges with uniform appearance
for _ in 0..<12 {
let edge = ModelEntity(mesh: boxMesh, materials: [boundsMaterial])
boundingBox.addChild(edge)
}
entity.addChild(boundingBox)
}
/**
Updates the visualization when the tracked object moves.
This ensures the bounding box and text maintain accurate positioning
relative to the physical object being tracked.
*/
func refreshTracking(with anchor: ObjectAnchor) {
entity.isEnabled = anchor.isTracked
guard anchor.isTracked else { return }
entity.transform = Transform(matrix: anchor.originFromAnchorTransform)
refreshBoundingBoxGeometry(with: anchor)
}
/**
Updates the bounding box geometry to match the detected object's dimensions.
Creates a precise outline that exactly matches the physical object's boundaries
while maintaining the gradient visual effect.
*/
private func refreshBoundingBoxGeometry(with anchor: ObjectAnchor) {
let extent = anchor.boundingBox.extent
boundingBox.transform.translation = anchor.boundingBox.center
for (index, edge) in boundingBox.children.enumerated() {
guard let edge = edge as? ModelEntity else { continue }
switch index {
case 0...3: // Horizontal edges along width
edge.scale = SIMD3(extent.x, outlineWidth, outlineWidth)
edge.position = [
0,
extent.y / 2 * (index % 2 == 0 ? -1 : 1),
extent.z / 2 * (index < 2 ? -1 : 1)
]
case 4...7: // Vertical edges along height
edge.scale = SIMD3(outlineWidth, extent.y, outlineWidth)
edge.position = [
extent.x / 2 * (index % 2 == 0 ? -1 : 1),
0,
extent.z / 2 * (index < 6 ? -1 : 1)
]
case 8...11: // Depth edges
edge.scale = SIMD3(outlineWidth, outlineWidth, extent.z)
edge.position = [
extent.x / 2 * (index % 2 == 0 ? -1 : 1),
extent.y / 2 * (index < 10 ? -1 : 1),
0
]
default:
break
}
}
}
}
The bounding box creation is a key aspect of our visualization. Rather than using a single box mesh, we construct 12 individual edges that form a wireframe outline. This approach provides better visual clarity and allows for more precise control over the appearance. The edges are positioned using SIMD3 vectors for efficient spatial calculations:
edge.position = [
extent.x / 2 * (index % 2 == 0 ? -1 : 1),
extent.y / 2 * (index < 10 ? -1 : 1),
0
]
This mathematical positioning ensures each edge aligns perfectly with the detected object’s dimensions. The calculation uses the object’s extent (width, height, depth) and creates a symmetrical arrangement around its center point.
This visualization system works in conjunction with our ImmersiveView to create real-time visual feedback. As the ImmersiveView receives position updates from ARKit, it calls refreshTracking on our visualization, which updates the transform matrices to maintain precise alignment between the virtual overlays and the physical object.
Informative View

ContentView.swift, provided in our project template, handles the informational interface for our app. Here’s the implementation:
import SwiftUI
import RealityKit
import RealityKitContent
/**
ContentView provides the main window interface for the application.
Displays a rotating 3D model of the target object (Diet Soda can)
along with clear instructions for users on how to use the detection feature.
*/
struct ContentView: View {
// State to control the continuous rotation animation
@State private var rotation: Double = 0
var body: some View {
VStack(spacing: 30) {
// 3D model display with rotation animation
Model3D(named: "SodaModel", bundle: realityKitContentBundle)
.padding(.vertical, 20)
.frame(width: 200, height: 200)
.rotation3DEffect(
.degrees(rotation),
axis: (x: 0, y: 1, z: 0)
)
.onAppear {
// Create continuous rotation animation
withAnimation(.linear(duration: 5.0).repeatForever(autoreverses: true)) {
rotation = 180
}
}
// Instructions for users
VStack(spacing: 15) {
Text("Diet Soda Detection")
.font(.title)
.fontWeight(.bold)
Text("Hold your diet soda can in front of you to see it automatically detected and highlighted in your space.")
.font(.body)
.multilineTextAlignment(.center)
.foregroundColor(.secondary)
.padding(.horizontal)
}
}
.padding()
.frame(maxWidth: 400)
}
}
This implementation displays our 3D-scanned soda model (SodaModel.usdz) with a rotating animation, providing users with a clear reference of what the system is looking for. The rotation helps users understand how to present the object for optimal detection.
With these components in place, our application now provides a complete object detection experience. The system uses our trained model to recognize diet soda cans, creates precise visual indicators in real-time, and provides clear user guidance through the informational interface.
Conclusion

In this tutorial, we’ve built a complete object detection system for visionOS that showcases the integration of several powerful technologies. Starting from 3D object capture, through ML model training in Create ML, to real-time detection using ARKit and RealityKit, we’ve created an app that seamlessly detects and tracks objects in the user’s space.
This implementation represents just the beginning of what’s possible with on-device machine learning in spatial computing. As hardware continues to evolve with more powerful Neural Engines and dedicated ML accelerators and frameworks like Core ML mature, we’ll see increasingly sophisticated applications that can understand and interact with our physical world in real-time. The combination of spatial computing and on-device ML opens up possibilities for applications ranging from advanced AR experiences to intelligent environmental understanding, all while maintaining user privacy and low latency.