Friday, February 20, 2015

A look at Raspberry Pi 2 performance and overclocking

Raspberry Pi 2 significantly improves on original model

The Raspberry Pi 2 significantly increases performance when compared to the original Raspberry Pi. It corrects deficiencies in the design of the original SoC used inside the Raspberry Pi by integrating four more modern and faster Cortex-A7 ARMv7 CPU cores in a quad-core configuration, as opposed to the single ARM11 core in the original SoC, all within the constraints of a similar 40 nm manufacturing process. Whereas the CPU inside the BCM2835 processor of the original Raspberry Pi effectively ran without a L2 cache (which was tied to the GPU), the new Broadcom BCM2836 SoC contains a dedicated 512 KB CPU cache, improving memory performance and performance in general. The amount of RAM has also doubled to 1 GB. Other changes include more USB ports and a MicroSD card slot for storage instead of SD.

Compatibility with Raspbian

Otherwise the new SoC as well as the device itself has been engineered to maintain hardware and software compatibility with the original Raspberry Pi, while running considerably faster. When using the Raspbian OS, an ARM11 compatible Debian-based distribution using armhf specifically maintained for the Raspberry Pi, only the kernel is specific to the Raspberry Pi 2 with the entire userland being 100% compatible. Although this misses out on some of the advantages of the newer ARMv7 instruction set (such as the reduced code size of Thumb2 instructions, which are used in ARMv7 Debian), applications that can take advantage of, for example, NEON SIMD instructions usually do so on a run-time detection basis (as they do in ARMv7 Debian), so that the most critical gains from the new instruction set can in theory be taken advantage of in Raspbian.

Nevertheless, the new device can run an OS specifically configured for ARMv7, such as Debian armhf and derived distributions such as Ubuntu, which take advantage of the reduced-size Thumb2 instruction set. An example of such a distribution that has been applied to the Raspberry Pi 2 is Ubuntu Snappy Core.

Components of Raspberry Pi 2 SoC clocked conservatively out of the box

The maximum CPU clock of the Cortex-A7 cores in the Raspberry Pi 2 is 900 MHz, while the L2 cache appears to be clocked at only 250 MHz by default, inheriting the clock rate of the original Pi's GPU cache. SDRAM is clocked at 450 MHz by default. The GPU is clocked at 250 MHz, similar the original Raspberry Pi.

The configured speed of the L2 cache is particularly low, as we will see, since speeds up to 600 MHz seem to be stable when overclocking, resulting in a large performance increase. The CPU clock speed can also be bumped up somewhat.

The raspi-config utility in Raspbian at the time of writing contains just one overclocking option for the Raspberry Pi 2, which clocks the CPU at 1000 MHz, doubles L2 cache speed to 500 MHz and clocks SDRAM also at 500 MHz. Unfortunately, this setting turned out to be unstable on my device. This appears to be due to the SDRAM clock speed being set too high and causing problems. Bumping the SDRAM speed down to 483 MHz results in a stable system.

Overclocking test set-up

I have performed a number of overclocking tests with different clock configurations. The test set-up was as follows.

To prevent corruption of the root file system, I modified /etc/fstab to mount the root filesystem read-only at boot by adding "ro" to the mount flags. To remount with read-write capability when necessary after boot (on a stable system), I ran "sudo mount -o remount,rw /dev/mmcblk0p2 /".

The main stability test was performed using the single-threaded memtester package (available in Raspbian and Debian) using the command line "memtester 16M 10" (16 MB memory region, 10 loops). In several cases four of these commands were run in parallel to fully occupy the CPU and provide reliable stability information. In unstable configurations, this test almost always shows errors.

Memory performance was tested using a slightly modified version of the fastarm package ( with the command line "for x in 0 1 2 3 4 5 6 7 8 9; do ./benchmark --duration 1 --repeat 1 --memcpy e --test 0; done". Because of result variation due to cache allocation effects, I took the best result out of ten. Tests number 0 (memcpy of varying size, aligned, depends on CPU as well as memory) and 43 (4K page-aligned memcpy, a more pure memory subsystem test) were used.

For a real-world CPU performance indication I used the command line "time zcat bullet3-Bullet-2.83-alpha.tar.gz >/dev/null" performed multiple times, which is effectively gzip decompression of a large file out of buffer cache memory.

Table with stability testing results

The following table shows stability testing results for a large number of CPU clock, core clock (L2 cache clock), and SDRAM clock configurations. Also included are some benchmark scores, including memory performance and CPU performance.

CPU     +Volt   Core    SDRAM   +Volt   Stability       Memcpy perf.
                                p i c   (memtester)     Varied  4K      zcat

900     ?       250     450     0 0 0   OK (slow)       716     1015    2.388s
Standard overclock (raspi-config "Pi 2" option):
1000    2       500     500     0 0 0   Fail
Other settings:
900     0       450     450     0 0 0   OK              778     1270    2.380s
900     0       600     467     0 0 0   Almost          804     1431    2.379s
900     2       600     467     0 0 0   OK (multi-test)
1000    0       467     467     0 0 0   OK (multi-test) 867     1410    2.146s
1000    0       500     483     0 0 0   OK (multi-test) 880     1502    2.146s
1000    0       500     483     2 0 0   OK (multi-test) 878     1502    2.169s
1000    2       500     500     0 0 0   Almost
1000    4       500     500     0 0 0   Almost
1000    0       500     500     2 2 0   Almost
1000    0       500     500     4 4 0   Almost?
1000    0       500     500     4 0 0   Fail            886     1415    2.143s
1000    2       500     500     4 0 0   Fail
1000    4       500     500     4 4 0   Fail (multi)
1000    0       500     500     6 6 6   ?
1000    2       600     467     0 0 0   OK (multi-test) 885     1518    2.145s
1000    2       600     500     4 0 0   OK (multi-test) 890     1553    2.142s
1000    2       667     500     4 0 0   Fail (freeze)
1000    6       667     500     6 0 0   Fail (freeze)
1050    0       466     466     4 4 4   OK
1050    0       466     533     4 4 4   Fail
1050    0       466     533     6 6 6   Fail (bitspr.)
1050    4       600     450     0 0 0   OK (multi-test) 916     1528    2.045s
1050    4       600     483     2 0 0   OK (multi-test) 924     1571    2.041s
1067    6       533     533     6 6 6   Fail
1067    4       533     533     8 8 0   Fail (bitflip)
1067    6       533     533     8 8 0   Fail (bitflip)
1067    6       533     500     4 4 0   Almost
1067    4       533     466     0 0 0   OK (multi test) 925     1521    2.010s
1100    0       466     466     0 0 0   Fail (boot)
1100    4       466     466     0 0 0   OK?
1100    4       600     467     0 0 0   Fail
1100    4       500     500     6 6 6   OK?
1100    4       500     500     6 6 0   OK?
1100    4       500     500     4 0 0   Almost
1100    4       500     500     6 0 0   OK?             950     1532    1.950s
1100    6       500     500     6 0 0   Almost
1100    4       533     533     6 0 4   Fail            962     1593    1.948s
1100    4       550     483     0 0 0   OK (multi-test) 944     1549    1.951s
1133    4       567     466     0 0 0   Almost          974     1578    1.893s
1133    4       567     467     4 0 0   Almost
1133    5       567     453     0 0 0   Almost          971     1571    1.896s
1133    8       567     453     0 0 0   Fail
1166    4       466     466     0 0 0   Almost          960     1451    1.841s
1167    4       466     466     2 2 4   Fail
1166    6       466     466     0 0 0   Fail            962     1451    1.841s
1167    8       500     500     4 0 0   Fail                            1.839s
1167    8       500     500     8 8 8   Fail
1200    8       600     450     4 0 0   Fail
The stable configurations show "OK (multi-test)" in the stability column, meaning they were stable during a test with multiple memtester processes running concurrently. Most unstable configurations have an SDRAM clock speed of 500 MHz or higher, or a CPU speed higher than 1100 MHz.

CPU frequency corresponds with the "arm_freq=" setting in /boot/config.txt. The CPU/main SoC voltage is set with over_voltage setting. The core clock (the L2 cache speed on the Raspberry Pi 2) is set with core_freq. The SDRAM frequency is set with sdram_freq, while voltage settings for the SDRAM physical layer, I/O and controller are set using over_voltage_sdram_p, over_voltage_sdram_i and over_voltage_sdram_c, of which the physical layer voltage seems to be the most relevant to overclocking. An example of the relevant lines in /boot/config.txt for a particular overclocking configuration (1000 MHz CPU, with stable 483 MHz SDRAM, as well as 256 MB memory reserved for GPU) follows.
See the official documentation for more details.

Observations based on stability testing

The following is apparent from testing my device:
  • The core_freq setting seems to be directly correlated with the L2 CPU cache in the new SoC, which has a large effect on performance. Depending on other frequencies, core_freq frequencies up to 600 MHz seem to be stable, giving a significant performance boost over the default configuration of 250 MHz.
  • When increasing CPU speed beyond roughly 1000 MHz, the CPU core voltage has to be bumped up.
  • Increasing SDRAM speed beyond about 483 MHz seems to cause instability on my device. Bumping up the SDRAM voltage (in particular the physical layer voltage, but not the I/O voltage or SDRAM controller voltage) may help a little for potential stability. However, SDRAM speeds of 500 MHz and higher tend to cause stability problems regardless of voltages on my device.
  • Certain divisor relationships between CPU clock and core (L2 cache) clock (such as 2:1) seem to enhance stability and performance.

CPU overclocking conclusions

  • The default Raspberry Pi 2 core_freq (L2 CPU cache) setting of 250 MHz appears to be extremely conservative. At the default CPU frequency of 900 MHz, 450 MHz (which has a nice divisor of two) appears to be very stable and even 600 MHz can be stable.
  • Unfortunately, the standard Raspberry Pi 2 overclocking setting available in raspi-config at the time of writing (1000 MHz CPU, 500 MHz core clock, 500 MHz SDRAM) appears to be unstable on my device due to a SDRAM clock speed that is slightly too high. Instead of bumping the CPU voltage as performed by this setting, increasing the SDRAM voltage (primarily the physical layer voltage) may improve stability, but clocking the SDRAM slightly lower at 483 or 467 MHz seems to be the best solution.
  • It seems likely that certain SDRAM parameters (CAS delay, etc) are set to fixed values by the kernel and that higher SDRAM speeds will be possible when these parameters are configurable or appropriately adjusted by the kernel for higher SDRAM clock speeds. However, the actual RAM chip used is an Elpida/Micron EDB8132B4PB-8D-F LPDDR2-800 chip specified for 400 MHz clock frequency, so the overclocking headroom may not be that high.

Table with stable high-performance clock configurations

The following table shows stable high-performance clock configurations tested on my device and their clock frequency ratios:
CPU     Over-   Core    Base
clock   volt    clock   Clock   CPU : Core      SDRAM   Overv.

1067    +4      533     533     2 : 1           467
1050    +4      600     150     7 : 4           483     +2
1000    +2      600     100     5 : 3           500     +4
1000            500     500     2 : 1           483     +2
 900    +2      600     133     3 : 2           467
 900            450     450     2 : 1           450
However, I may have to retest the configuration with an SDRAM frequency of 500 MHz because other configurations show such a setting to be unstable after extensive testing. Additionally, the 1100 MHz CPU frequency setting turned out not be completely stable.

Overclocking the GPU

By default, the Raspberry Pi as well as the Raspberry Pi 2 will use dynamic clocking, whereby the CPU speed, "core_freq" speed and SDRAM frequency are dynamically ajdusted based on CPU load. Any GPU frequency settings, as governed by the "v3d", "h264_freq" and "isp_freq" settings in config.txt, are ignored by default.

Using "force_turbo=1" allows overclocking of the GPU using the "v3d_freq", "h264_freq" and "isp_freq" options. "v3d_freq" corresponds to the frequency of the 3D block (the most relevant for overclocking), while "h264_freq" is the H.264 video block and "isp_freq" governs the camera interface. However, "force_turbo=1" also disables dynamic clocking, locking the CPU, core and SDRAM speeds to fixed maximum values, which is highly undesirable. Also note that using "force_turbo=1" may void the warranty of the device.

There is another setting, "avoid_pwm_pll=1", that allows "core_freq" to be set independently from that of the GPU on the original Raspberry Pi, at the cost of slightly reducing analog audio output quality. However, "force_turbo=1" is still required to be able to modify the GPU clock frequencies.

Because the Raspberry Pi 2 has an independent GPU with its own independent L2 cache seperate from the L2 cache of the CPU, some of these limitations may have become unnecessary (in particular the requirement that the CPU is locked at a high speed with "force_turbo=1" in order to be able to overclock the GPU), and if that is the case these restrictions will hopefully be removed in the future.

When running 3D benchmarks, the following CPU and SDRAM settings were used (note that when using of "force_turbo=1" to overclock the GPU, these frequencies are locked and do not scale down when the CPU is idle):
When running 3D GPU benchmarks without overclocking the GPU (force_turbo=0), it looks like the CPU / L2 cache frequencies are scaled down quickly because the CPU load is relatively low, negatively affecting the throughput of the 3D benchmarks because of a CPU bottleneck, resulting in an initial peak in fps dropping to a lower base. To avoid this, we modify the sampling_down_factor of the ondemand cpufreq governor from 50 to 1000:
sudo sh -c "echo 1000 >/sys/devices/system/cpu/cpufreq/ondemand/sampling_down_factor"
The following settings overclock the 3D block (V3D) of the GPU from 250 MHz to 300 MHz:
These are the results of benchmark testing with different V3D clock speeds:
v3d_freq        demo1   demo1   demo2   demo2   demo2   demo5   demo9   game
                        lights          lights  shadows
default          81.1    20.5    26.1     8.87    0.98   50.5    46.4   112
300              95.3            28.4     9.88    1.12   56.7    49.3   130
350             109*     27.4    29.9    10.9     1.24   62.3    51.6   148
400             120*     30.6    31.4    11.7     1.35   40-52*  53.5   108*
450              80*     33.7    20.2*   12.3     1.45   40-56*  55.0   111*
Although the clock frequency of either the CPU or the 3D block seemed to be scaled down in some cases at higher V3D speeds (presumably due to temperature measurements or voltage readings resulting in throttling), there were actually never any signs of stability issues when overclocking the GPU, up the maximum tested speed of 450 MHz. The

Regular dynamic downclocking of the CPU can occur due to USB power supply/cable issue

Initially,  downclocking by the Raspberry Pi 2 kernel's under-voltage monitor seemed to be triggered a lot of more frequently than it is on the original Raspberry Pi. This results in a rainbow-colored icon being displayed in the top-right corner of the screen. This even happens briefly during boot. At such occasions, presumably the CPU and other components are downclocked in order to ensure stability.

The rainbow-colored square suggests a power supply issue since it indicates a voltage that is too low. As it turns out, replacing the USB power cable I was using with a shorter one that is better insulated eliminates the under-voltage warnings, with the same 5V/2A power supply.

Updated 1 March  2015 (update explanation for CPU speed throttling).
Updated  25 March 2015 (update with USB power cable findings).

Sunday, February 15, 2015

Optimizing performance on the Raspberry Pi and Raspberry Pi 2

The Raspberry Pi is a popular platform that usually runs from a flash-based SD card as the root filesystem. Most of the tips from the previous article apply to the Raspberry Pi. They work best on the common 512 MB memory variant of Raspberry Pi Model B, rather than the early 256 MB version.

Raspberry Pi's standard Raspbian OS configuration by default has an extensive set of logging options enabled for rsyslog, and the file system is configured in ordered data mode. While such a configuration might be understandable from the viewpoint of system stability and error reporting, it is not beneficial for performance to the say the least, causing a flash write access bottleneck for overall system performance.

Reducing logging activity, using tmpfs and using ramdisk for cache directories

The Raspberry Pi's rsyslog configuration files are stored in /etc/rsyslog.conf. Comment out the many logging rules by prefixing them with a '#' will eliminate logging activity.

The following lines added to /etc/fstab cause /tmp and /var/tmp to be stored in ramdisks:
    tmpfs    /tmp       tmpfs    defaults    0 0
    tmpfs    /var/tmp   tmpfs    defaults    0 0
To move cache directories to ramdisk, create the file /etc/profile.d/ with the following content:
    export XDG_CACHE_HOME="/dev/shm/.cache"

Optimizing the file system

The Raspberry Pi does not accept the journal_async_commit mount option. However, write-back mode can be enabled and barriers can be disabled. Note that the risk of file system corruption with these settings is greater for the Raspberry Pi than for battery-powered devices.

It is best to store the performance options as default mount options in the filesystem itself using tune2fs. Data write-back mode (as opposed to the default ordered data mode) and no barriers are configured with the following command, assuming the root file system is stored in the /dev/mmcblk0p2 partition as it is on Raspbian:
    sudo tune2fs -o journal_data_writeback,nobarrier /dev/mmcblk0p2
When an error is detected during mounting at boot (including an option that is not accepted), the root file system will be mounted read-only, so that configuration files cannot be changed. It is however possible to remount the filesystem in read-write mode using the following command:
    sudo mount -o remount,rw /dev/mmcblk0p2 / 

Disabling X windows system error logging

Edit the file /etc/X11/Xsession and edit the relevant lines to look like this (somewhere in the middle of the file), The log-generating line is commented out and replaced with a one that routes messages to /dev/null:
    #exec >> "$ERRFILE" 2>&1
    exec >> /dev/null 2>&1

Increasing the size of the console font

The Raspberry Pi comes configured with a relatively small font in the text console. However, it is easy to configure a larger font that greatly improves readability by editing /etc/default/console-setup to include, for example, the following two lines:
This configures a highly readable 16x32 font. This amounts to 80x32 characters on a 1280x1024 monitor. On a 1920x1024 monitor, this amount to 120x33 characters with 24 scanlines unused.

Editing nano context highlighting configuration

The convenient nano text editor comes configured with context-sensitive color highlighting by default. It operates using rules read from a configuration file, which depends on the file extension of the file being edited. Some of these rules are very complex and much too slow for a device such as the Raspberry Pi, resulting in sluggish editing. For example, the rules for C/C++ source files in /usr/share/nano/c.nanorc includes a rule labelled with the comment "This string is VERY resource intensive!".  Commenting out this rule by putting a '#' in front of it results in greatly improved experience editing C/C++ source files. Similar resource intensive rules exist in the configuration for assembler files in /usr/share/nano/asm.nanorc.


Overclocking options can be set using raspi-config configuration utility. Overclocking is often stable on the device. I had no problems with the highest "Turbo" setting, which overclocks the CPU to 1000 MHz and the RAM to 600 MHz. The "core frequency" is also doubled from 250 MHz to 500 MHz. This frequency is tied to the L2 cache used by the GPU (it does appear to cause large increase in GPU performance).

On the Raspberry Pi 2, the default clock speed is 900 MHz, while the "core frequency" (which has a different meaning than it has for original Raspberry Pi) is a conservative 250 MHz. The main overclocking option intended for the Raspberry Pi 2 clocks the CPU cores at 1000 MHz, RAM at 500 MHz, and doubles the "core frequency" to 500 MHz. This has a positive effect on performance (especially memory performance), suggesting that the core clock is indeed correlated with CPU cache speed on the Raspberry Pi 2, which has a seperate L2 cache for the CPU. However, this setting turned out to be not quite 100% stable, with the culprit being an SDRAM speed that is slightly too high.

The amount of memory allocated to the GPU (memory split) should be set to the something like 128 if you want to be able run a wider range of OpenGL ES 2.0 applications, assuming a 512 MB Raspberry Pi.


The performance increase from overclocking the CPU and RAM is measurable in low-level benchmarks. The following benchmarks were run with framebuffer_depth=32 in /boot/config.txt and a monitor resolution of 1280x1024.

Running the benchmark program from the "fastarm" repository ( as follows shows a significant performance increase from the default settings:
    ./benchmark --memset a --test 4
    ./benchmark --memcpy a --test 43
    ./benchmark --memcpy a --test 0
The memset benchmark (which reflects sequential DRAM write access) shows a performance increase from 1220 MB/s to 1929 MB/s (58%). The memcpy benchmark (which reflects copying a sequential memory region of 4 KB) shows an increase from 320 MB/s to 432 MB/s (35%). The third benchmark (which reflects smaller memory copies of varying size) is also dependent on CPU performance. It shows an impressive increase from 206 MB/s to 353 MB/s (71%).

On the Raspberry Pi 2,  the memset benchmark only scores 1090 MB/s with default settings. This seems to be a bandwidth bottleneck related to CPU and L2 cache speed limitations since other memset variants (including one using NEON SIMD instructions) show the same result. Multi-threaded benchmarks may potentially show higher throughout. However, when using the overclocking option, which significantly increases the clock speed of the L2 CPU cache,  memset performance increases to 1730 MB/s, in line with the Raspberry Pi.

The second benchmark (4K memcpy) reports 965 MB/s, which is significantly faster than the original Raspberry Pi. This seems to imply that write bandwidth provides the bottleneck. When overclocking, the result further increases to 1380 MB/s. The third benchmark (memcpy of regions of varying size) reports 673 MB/s, significantly higher than the original Raspberry Pi, which increases to 785 MB/s when overclocking.

The low-level benchx program ( can be used to measure pixel performance in the X server. A command line similar to the one below was used:
    ./benchx --window All
Overclocking shows the following improvements (measurements in MBytes/s) on the Raspberry Pi:
    Test                          Standard     Overclock (Turbo)   Speed-up
    ScreenCopy (33x33)               84.7           150               77%
    ScreenCopy (549x549)            184             277               51%
    FillRect (33x33)                842            1282               52%
    FillRect (549x549)             1161            1835               58%
    PutImage (549x549)               45.1            72.8             74%
    ShmPutImage (549x549)           229             365               59%
    ShmPutImageFullWidth (109x109)  298             534               79%
    ShmPutImageFullWidth (549x549)  233             364               56%
The SRE real-time rendering library ( allows us to measure 3D GPU performance. The following benchmarks were run:
    ./sre-demo --benchmark --multi-pass --multiple-lights --shadow-volumes demo2
    ./sre-demo --benchmark demo2
    ./sre-demo --benchmark demo9
The first benchmark, a complex 3D benchmark with multiple lights, multi-pass rendering and shadows, shows an increase from 1.118 fps to 1.325 fps (19%) when overclocking. The second benchmark only uses only a single light with single-pass rendering and no shadows and shows an impressive increase from 13.2 fps to 27.2 fps (106%). The third benchmark, a lighter benchmark which uses a lot of alpha blending, increased from 30.2 fps to 48.0 fps (59%).

On the Raspberry Pi 2, the first benchmark reports an even slower framerate of 0.91 fps, while the second benchmark scores 20.5 fps, and the third benchmark scores 32.5 fps. Surprisingly, these results do remain about the same when overclocking. This is probably because on the Raspberry Pi 2, the core_freq setting does not directly affect the GPU.

File system benchmarks

Testing low-level file system performance using flash-bench ( gives an indication of flash storage performance. A high-speed UHS 1 MicroSD card was used (using an SD card adapter). Measurements in MB/s.

                                          Seq.    Seq.    Random  Random
                                          Read    Write   Read    Write
    PC (using USB SD card adapter)        18.1    15.7     3.55   15.3
    Raspberry Pi, optimized fs options    17.2    13.3     4.35    0.96
    Raspberry Pi (turbo overclock)        17.2    14.4     4.83    0.86
    Raspberry Pi 2                        17.4    14.5     5.09    1.47
    Raspberry Pi 2 (same card via USB)    16.2    10.9     4.38    1.19
    Raspberry Pi 2 (different card)       17.7     9.5     4.46    1.31
    Raspberry Pi 2 (same card, overclock) 17.7     9.7     4.52    1.56
    Raspberry Pi 2 eMMC via USB           17.8    14.4     4.38    6.80
The results show very slow random write performance, despite the use of the journal_data_writeback and nobarrier options, when compared to testing on other devices. Whether this is reflected in actual real-world performance is unclear.

The last entry is for a high-performance eMMC flash module fitted on an eMMC-to-MicroSD adapter fitted on a USB Micro-SD reader. This shows that higher random write performance is possible given high performance flash memory. Although I have not tested this set-up with the Raspberry Pi 2's internal MicroSD card slot, it would probably deliver similar or better performance.

Updated February 25, 2015.