Nvidia Jetson Nano

The recent release of the Jetson Nano is an inexpensive alternative to Jetson TX1:

Platform	CPU	GPU	Memory	Storage	MSRP
Jetson TX1 (Tegra X1)	4x ARM Cortex A57 @ 1.73 GHz	256x Maxwell @ 998 MHz (1 TFLOP)	4GB LPDDR4 (25.6 GB/s)	16 GB eMMC	$499
Jetson Nano	4x ARM Cortex A57 @ 1.43 GHz	128x Maxwell @ 921 MHz (472 GFLOPS)	4GB LPDDR4 (25.6 GB/s)	Micro SD	$99

Basically, for 1/5 the price you get 1/2 the GPU. Detailed comparison of the entire Jetson line.

The X1 being the SoC that debuted in 2015 with the Nvidia Shield TV:

Fun Fact: During the GDC annoucement when Jensen and Cevat “play” Crysis 3 together their gamepads aren’t connected to anything. Seth W. from Nvidia (Jensen) and I (Cevat) are playing backstage. “Pay no attention to that man behind the curtain!”

Mini-Rant: Memory Bandwidth

The memory bandwidth of 25.6 GB/s is a little disappointing. We did some work with K1 and X1 hardware and memory ended up the bottleneck. It’s “conveniently” left out of the above table, but Xbox 360 eDRAM/eDRAM-to-main/main memory bandwidth is 256/32/22.4 GB/s.

Put another way, the TX1’s GPU hits 1 TFLOP while the original Xbox One GPU is 1.31 TFLOPS with main memory bandwidth of 68.3 GB/s (also ESRAM with over 100 GB/s, fugged-about-it). So, Xbone is 30% higher performance but has almost 2.7x the memory bandwidth.

When I heard the Nintendo Switch was using a “customized” X1, I assumed the customization involved a new memory solution. Nope. Same LPDDR4 that (imho) would be a better fit for a GPU with 1/4-1/2 the performance. We haven’t done any Switch development, but I wouldn’t be surprised if many titles are bottle-necked on memory. The next most likely culprit being the CPU if overly-dependent on 1-2 threads- but never the GPU.

Looks like we have to hold out until the TX2 to get “big boy pants”. It’s 1.3 TFLOPS with 58.3 GB/s of bandwidth (almost 2.3x the X1).

Installation

Follow the official directions. On Mac:

# For the SD card in /dev/disk2
sudo diskutil partitionDisk /dev/disk2 1 GPT "Free Space" "%noformat%" 100%
unzip -p ~/Downloads/nv-jetson-nano-sd-card-image-r32.3.1.zip | sudo dd of=/dev/rdisk2 bs=1m
# Wait 10-20 minutes

The OS install itself is over 9 GB and the Nvidia demos are quite large such that a 16 GB SD card fills up quick. We recommend at least 32 GB SD card.

Boots into Nvidia customized Ubuntu 18.04.

Intial Shell Access

Connecting a keyboard/display directly to the Nano is the easiest way to get started.

It is also possible to do a “headless” install:

Place a jumper on J48
Connect a 5V DC power adapter to the DC barrel jack (5.5mm/2.1mm)
Connect a USB cable between host PC and Nano’s micro USB (make sure the cable has data wires)
On the host PC, identify the serial device connected to the Nano
- Mac: /dev/tty.usbmodem*
- Linux: /dev/ttyACM*
Connect a terminal emulator like screen to the serial device baud rate 115200 (see screen usage):
```
 screen /dev/tty.usbmodem0123456789 115200
```

Then, just follow the prompts.

If you’re using a Mac running Catalina (10.15) there’s some complications.

Install other software to taste, remember to grab the arm64/aarch64 version of binaries instead of arm/arm32.

Benchmarks

For our Raven Ridge-like APU with Vega GPU we run a series of benchmarks:

3DMARK, PCMARK, Cinebench
Unigine (Heaven, Valley)
Several games that have benchmark/demo modes (e.g. “Rise of the Tomb Raider”, “Shadow of Mordor”, etc.)
Claymore/Phoenix Miner
A few others

But they’re either limited to Windows and/or x86. Seems the de-facto standard for ARM platforms might be the Phoronix Test Suite. Fall 2018, Phoronix did a comparison of a bunch of single-board computers that’s not exactly surprising but still interesting.

Purely for amusement also throwing in the results for the Raspberry Pi Zero. Which, to be fair, is in a completely different device class and target use-case.

Test	Pi Zero	Pi 3 B	Nano	Notes
glxgears	107	560	2350	FPS. Zero using “Full KMS”, when not using only manages 7.7 FPS. 3B using “Fake KMS”, “Full KMS” caused display to stop working.
glmark2	399	383	1996	On the Pi’s several tests failed to run
build runng	3960	135	68	Seconds. `cargo clean; time cargo build` runng like we did on the Pi Zero

Phoronix Test Suite

PTS is pretty nice. It provides an easy way to (re-)run a set of benchmarks based on a unique identifier. For example, to run the tests from the Fall 2018 ARM article:

sudo apt-get install -y php-cli php-xml
# Download PTS somewhere and run/compare against article
phoronix-test-suite benchmark 1809111-RA-ARMLINUX005
# Wait a few hours...
# Results are placed in ~/.phoronix-test-suite/test-results/

Test	Pi Zero	Pi 3 B	Nano	TX1	Notes
Tinymembench (memcpy)	291	1297	3504	3862
TTSIOD 3D Renderer		15.66	40.83	45.05
7-Zip Compression	205	1863	3996	4526
C-Ray		2357	943	851	Seconds (lower is better)
Primesieve		1543	466	401	Seconds (lower is better)
AOBench	778	333	190	165	Seconds (lower is better)
FLAC Audio Encoding	971.18	387.09	103.57	78.86	Seconds (lower is better)
LAME MP3 Encoding	780	352.66	143.82	113.14	Seconds (lower is better)
Perl (Pod2html)	5.3830	1.2945	0.7154	0.6007	Seconds (lower is better)
PostgreSQL (Read Only)		6640	12410	16079
Redis (GET)	34567	213067	568431	484688
PyBench	76419	24349	7030	6348	ms (lower is better)
Scikit-Learn		844	496	434	Seconds (lower is better)

The “Pi 3 B” and “TX1” columns are reproduced from the OpenBenchmarking.org results. There’s also an older set of benchmarks, 1703199-RI-ARMYARM4104.

Check out the graphs (woo hoo!):

These all seem to be predominantly CPU benchmarks where the TX1 predictably bests the Nano by 10-20% owing to its 20% higher CPU clock.

Don’t let the name “TTSIOD 3D Renderer” fool you, it’s a software renderer (i.e. non-hardware-accelerated; no GPUs were harmed by that test). Further evidenced by the “Socionext Developerbox” showing. Socionext isn’t some new, up-and-coming GPU company, that device has a 24 core ARM Cortex A53 @ 1 GHz (yes, 24- that’s not a typo).

There’s more results for the Nano including things like Nvidia TensorRT and temperature monitoring both with and without a fan. But, GLmark2 is likely one of the only things that will run everywhere.

glxgears

On Nano:

# Need to disable vsync for Nvidia hardware
__GL_SYNC_TO_VBLANK=0 glxgears

On Pi:

sudo raspi-config

Advanced Options > GL Driver > GL (Full KMS) > Ok adds dtoverlay=vc4-kms-v3d to the bottom of /boot/config.txt. Reboot and run glxgears.

GLmark2

Getting GLmark2 working on the Nano is easy:

sudo apt-get install -y glmark2

On Pi, it’s currently broken.

You can use the commit right after Pi3 support was merged:

sudo apt-get install -y libpng-dev libjpeg-dev
git clone https://github.com/glmark2/glmark2.git
cd glmark2
git checkout 55150cfd2903f9435648a16e6da9427d99c059b4

There’s a build error:

../src/gl-state-egl.cpp: In member function ‘bool GLStateEGL::gotValidDisplay()’:
../src/gl-state-egl.cpp:448:17: error: ‘GLMARK2_NATIVE_EGL_DISPLAY_ENUM’ was not declared in this scope
                 GLMARK2_NATIVE_EGL_DISPLAY_ENUM, native_display_, NULL);
                 ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

In src/gl-state-egl.cpp at line 427 add:

#else
// Platforms not in the above platform enums fall back to eglGetDisplay.
#define GLMARK2_NATIVE_EGL_DISPLAY_ENUM 0

Build everything and run it:

# `dispmanx-glesv2` is for the Pi
./waf configure --with-flavors=dispmanx-glesv2
./waf
sudo ./waf install
glmark2-es2-dispmanx --fullscreen

If it fails with failed to add service: already-in-use? take a look at:

Both mention commenting out dtoverlay=vc4-kms-v3d in /boot/config.txt- which was added when we enabled “GL (Full KMS)”.

Hello AI World

After getting your system setup, take a look at “Hello AI World” which does image recognition and is pre-trained with 1000 objects. Start with “Building the Repo from Source”. It took a while to install dependencies, but then everything builds pretty quick.

cd jetson-inference/build/aarch64/bin
# Recognize what's in orange_0.jpg and place results in output.jpg
./imagenet-console orange_0.jpg output.jpg

# If you have a camera attached via CSI (e.g. Raspberry Pi Camera v2)
./imagenet-camera googlenet # or `alexnet`

Rendered Obsolete