The recent release of the Jetson Nano is an inexpensive alternative to Jetson TX1:

Platform CPU GPU Memory Storage MSRP
Jetson TX1 (Tegra X1) 4x ARM Cortex A57 @ 1.73 GHz 256x Maxwell @ 998 MHz (1 TFLOP) 4GB LPDDR4 (25.6 GB/s) 16 GB eMMC $499
Jetson Nano 4x ARM Cortex A57 @ 1.43 GHz 128x Maxwell @ 921 MHz (472 GFLOPS) 4GB LPDDR4 (25.6 GB/s) Micro SD $99

Basically, for 1/5 the price you get 1/2 the GPU. Detailed comparison of the entire Jetson line.

The X1 being the SoC that debuted in 2015 with the Nvidia Shield TV:

Fun Fact: During the GDC annoucement when Jensen and Cevat “play” Crysis 3 together their gamepads aren’t connected to anything. Seth W. from Nvidia (Jensen) and I (Cevat) are playing backstage. “Pay no attention to that man behind the curtain!”

Mini-Rant: Memory Bandwidth

The memory bandwidth of 25.6 GB/s is a little disappointing. We did some work with K1 and X1 hardware and memory ended up the bottleneck. It’s “conveniently” left out of the above table, but Xbox 360 eDRAM/eDRAM-to-main/main memory bandwidth is 256/32/22.4 GB/s.

Put another way, the TX1’s GPU hits 1 TFLOP while the original Xbox One GPU is 1.31 TFLOPS with main memory bandwidth of 68.3 GB/s (also ESRAM with over 100 GB/s, fugged-about-it). So, Xbone is 30% higher performance but has almost 2.7x the memory bandwidth.

When I heard the Nintendo Switch was using a “customized” X1, I assumed the customization involved a new memory solution. Nope. Same LPDDR4 that (imho) would be a better fit for a GPU with 1/4-1/2 the performance. We haven’t done any Switch development, but I wouldn’t be surprised if many titles are bottle-necked on memory. The next most likely culprit being the CPU if overly-dependent on 1-2 threads- but never the GPU.

Looks like we have to hold out until the TX2 to get “big boy pants”. It’s 1.3 TFLOPS with 58.3 GB/s of bandwidth (almost 2.3x the X1).

Installation

Follow the official directions. On Mac:

# For the SD card in /dev/disk2
sudo diskutil partitionDisk /dev/disk2 1 GPT "Free Space" "%noformat%" 100%
unzip -p ~/Downloads/nv-jetson-nano-sd-card-image-r32.3.1.zip | sudo dd of=/dev/rdisk2 bs=1m
# Wait 10-20 minutes

The OS install itself is over 9 GB and the Nvidia demos are quite large such that a 16 GB SD card fills up quick. We recommend at least 32 GB SD card.

Boots into Nvidia customized Ubuntu 18.04.

Intial Shell Access

Connecting a keyboard/display directly to the Nano is the easiest way to get started.

It is also possible to do a “headless” install:

  1. Place a jumper on J48
  2. Connect a 5V DC power adapter to the DC barrel jack (5.5mm/2.1mm)
  3. Connect a USB cable between host PC and Nano’s micro USB (make sure the cable has data wires)
  4. On the host PC, identify the serial device connected to the Nano
    • Mac: /dev/tty.usbmodem*
    • Linux: /dev/ttyACM*
  5. Connect a terminal emulator like screen to the serial device baud rate 115200 (see screen usage):
     screen /dev/tty.usbmodem0123456789 115200
    

Then, just follow the prompts.

If you’re using a Mac running Catalina (10.15) there’s some complications.

Install other software to taste, remember to grab the arm64/aarch64 version of binaries instead of arm/arm32.

Benchmarks

For our Raven Ridge-like APU with Vega GPU we run a series of benchmarks:

But they’re either limited to Windows and/or x86. Seems the de-facto standard for ARM platforms might be the Phoronix Test Suite. Fall 2018, Phoronix did a comparison of a bunch of single-board computers that’s not exactly surprising but still interesting.

Purely for amusement also throwing in the results for the Raspberry Pi Zero. Which, to be fair, is in a completely different device class and target use-case.

Test Pi Zero Pi 3 B Nano Notes
glxgears 107 560 2350 FPS. Zero using “Full KMS”, when not using only manages 7.7 FPS. 3B using “Fake KMS”, “Full KMS” caused display to stop working.
glmark2 399 383 1996 On the Pi’s several tests failed to run
build runng 3960 135 68 Seconds. cargo clean; time cargo build runng like we did on the Pi Zero

Phoronix Test Suite

PTS is pretty nice. It provides an easy way to (re-)run a set of benchmarks based on a unique identifier. For example, to run the tests from the Fall 2018 ARM article:

sudo apt-get install -y php-cli php-xml
# Download PTS somewhere and run/compare against article
phoronix-test-suite benchmark 1809111-RA-ARMLINUX005
# Wait a few hours...
# Results are placed in ~/.phoronix-test-suite/test-results/
Test Pi Zero Pi 3 B Nano TX1 Notes
Tinymembench (memcpy) 291 1297 3504 3862  
TTSIOD 3D Renderer   15.66 40.83 45.05  
7-Zip Compression 205 1863 3996 4526  
C-Ray   2357 943 851 Seconds (lower is better)
Primesieve   1543 466 401 Seconds (lower is better)
AOBench 778 333 190 165 Seconds (lower is better)
FLAC Audio Encoding 971.18 387.09 103.57 78.86 Seconds (lower is better)
LAME MP3 Encoding 780 352.66 143.82 113.14 Seconds (lower is better)
Perl (Pod2html) 5.3830 1.2945 0.7154 0.6007 Seconds (lower is better)
PostgreSQL (Read Only)   6640 12410 16079  
Redis (GET) 34567 213067 568431 484688  
PyBench 76419 24349 7030 6348 ms (lower is better)
Scikit-Learn   844 496 434 Seconds (lower is better)

The “Pi 3 B” and “TX1” columns are reproduced from the OpenBenchmarking.org results. There’s also an older set of benchmarks, 1703199-RI-ARMYARM4104.

Check out the graphs (woo hoo!):

These all seem to be predominantly CPU benchmarks where the TX1 predictably bests the Nano by 10-20% owing to its 20% higher CPU clock.

Don’t let the name “TTSIOD 3D Renderer” fool you, it’s a software renderer (i.e. non-hardware-accelerated; no GPUs were harmed by that test). Further evidenced by the “Socionext Developerbox” showing. Socionext isn’t some new, up-and-coming GPU company, that device has a 24 core ARM Cortex A53 @ 1 GHz (yes, 24- that’s not a typo).

There’s more results for the Nano including things like Nvidia TensorRT and temperature monitoring both with and without a fan. But, GLmark2 is likely one of the only things that will run everywhere.

glxgears

On Nano:

# Need to disable vsync for Nvidia hardware
__GL_SYNC_TO_VBLANK=0 glxgears

On Pi:

sudo raspi-config

Advanced Options > GL Driver > GL (Full KMS) > Ok adds dtoverlay=vc4-kms-v3d to the bottom of /boot/config.txt. Reboot and run glxgears.

GLmark2

Getting GLmark2 working on the Nano is easy:

sudo apt-get install -y glmark2

On Pi, it’s currently broken.

You can use the commit right after Pi3 support was merged:

sudo apt-get install -y libpng-dev libjpeg-dev
git clone https://github.com/glmark2/glmark2.git
cd glmark2
git checkout 55150cfd2903f9435648a16e6da9427d99c059b4

There’s a build error:

../src/gl-state-egl.cpp: In member function ‘bool GLStateEGL::gotValidDisplay()’:
../src/gl-state-egl.cpp:448:17: error: ‘GLMARK2_NATIVE_EGL_DISPLAY_ENUM’ was not declared in this scope
                 GLMARK2_NATIVE_EGL_DISPLAY_ENUM, native_display_, NULL);
                 ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

In src/gl-state-egl.cpp at line 427 add:

#else
// Platforms not in the above platform enums fall back to eglGetDisplay.
#define GLMARK2_NATIVE_EGL_DISPLAY_ENUM 0

Build everything and run it:

# `dispmanx-glesv2` is for the Pi
./waf configure --with-flavors=dispmanx-glesv2
./waf
sudo ./waf install
glmark2-es2-dispmanx --fullscreen

If it fails with failed to add service: already-in-use? take a look at:

Both mention commenting out dtoverlay=vc4-kms-v3d in /boot/config.txt- which was added when we enabled “GL (Full KMS)”.

Hello AI World

After getting your system setup, take a look at “Hello AI World” which does image recognition and is pre-trained with 1000 objects. Start with “Building the Repo from Source”. It took a while to install dependencies, but then everything builds pretty quick.

cd jetson-inference/build/aarch64/bin
# Recognize what's in orange_0.jpg and place results in output.jpg
./imagenet-console orange_0.jpg output.jpg

# If you have a camera attached via CSI (e.g. Raspberry Pi Camera v2)
./imagenet-camera googlenet # or `alexnet`