Best Android Phones: Q3 2016

After the usual summer respite, our Q3 2016 Best Android Smartphones guide arrives in the middle of the fall frenzy, which has already produced a number of new phones, including the Samsung Galaxy Note7, Apple iPhone 7 and 7 Plus, Honor 8, the new modular Moto Z family from Motorola, and even a new brand from Huawei—Nova—to name just a few. There’s still several high-profile products still to come too, such as the LG V20, two new Nexus phones, and a new Mate phablet from Huawei. Several of the Chinese OEMs are also releasing some interesting phones at reasonable prices that will interest our international readers.

Keeping in mind that our guide only includes phones that we’ve reviewed, and that we do not have the bandwidth to review every phone that’s available, here are the Android phones we currently like.

Best Android Phablet: Samsung Galaxy Note5

Until Samsung sorts out the Galaxy Note7’s battery issue, the Note5 remains our top phablet choice. Its 5.7-inch 2560x1440 SAMOLED display is still one of the best available, with excellent black levels, reasonably good brightness, and several different display modes ranging from a very accurate sRGB mode to a couple of wider gamut modes with more vivid colors. Its 16MP rear camera with PDAF and OIS is also one of the best we’ve tested. The Note7’s camera focuses and snaps photos more quickly, and produces better images in low-light scenes, but the Note5’s camera still has the edge in daytime image quality.

The Note5’s Exynos 7420 SoC was the first to use Samsung’s 14nm LPE FinFET process, and its four ARM Cortex-A57 CPU cores running at up to 2.1GHz and four Cortex-A53 cores running at up to 1.5GHz still delivers quick performance. Its 4GB of LPDDR4 RAM gives Samsung’s memory hungry TouchWiz UI some extra room to work.

The piece of hardware that really makes Samsung’s Note line unique is the S-Pen. Being able to jot down notes, sketch pictures, sign documents, annotate screenshots, and select and manipulate text with the active stylus make the Note5 a good choice for people who use their phone for work as well as communication and killing time. Just be sure not to insert it into its silo backwards, or you’ll have to break the phone to get it back out.

People have a love/hate relationship with TouchWiz, and while some questionable design elements remain and it suffers from some performance hiccups, it does include some useful phablet features, including the ability to shrink the whole screen by pressing the home button three times, the option to use a smaller keyboard for one-handed thumb typing, and the two-pane Multi Window feature that allows you to work in two apps at the same time. The Note5 should receive an update to Android 7 at some point in the future, but Samsung has not set an exact date.​

Best High-End Android Smartphones: Galaxy S7 and HTC 10

Earlier this year, Samsung released its seventh generation Galaxy S series. The Galaxy S7 improves upon the design and features of the popular Galaxy S6. The design is very similar, but Samsung has tweaked the curvature of the back, edges, and cover glass to make the phone significantly more ergonomic. The chassis does get thicker and heavier, but this allows for a significant reduction to the camera hump and an increase in battery capacity.

As far as specs go, the Galaxy S7 comes in two versions. Both have 5.1-inch 2560x1440 SAMOLED displays, 32GB or 64GB of UFS 2.0 NAND, 4GB of LPDDR4 memory, a 12MP Sony IMX260 camera with a f/1.7 aperture, and a 3000mAh battery. Depending on where you live you'll either get Qualcomm's Snapdragon 820 or Samsung's Exynos 8890 SoC, both of which use custom ARM CPU cores. More specifically, the US, Japan, and China versions receive Snapdragon 820, while the rest of the world gets Exynos 8890.

Regardless of which Galaxy S7 you get, you'll be getting the best hardware that Samsung has to offer. The Galaxy S6 was a good phone, but it was not perfect. The S7 addresses several of these shortcomings with a more ergonomic design, a larger battery, support for microSD cards, and the return of IP68 dust and water protection.

The other phone worth discussing at the high end is the HTC 10, which manages to best the Galaxy S7 in at least a few areas. In terms of audio quality, design, OEM UI, and other areas like perceptual latency I would argue that HTC is just clearly ahead of Samsung. HTC also has proper USB 3.1 and USB-C support, which does make the device more future-proof than the Galaxy S7’s microUSB connector in that regard. The front facing camera is also just clearly better on the basis of having OIS and optics that can actually focus on a subject instead of being set to infinity at all times.

However, Samsung is clearly ahead in display and the camera is clearly the fastest I’ve ever seen in any phone, bar none. Samsung is also just clearly shipping better WiFi implementations right now in terms of antenna sensitivity and software implementation, along with IP68 water resistance and magstripe payments for the US and South Korea.

To further muddy the waters, there are areas where HTC and Samsung trade blows. While Samsung’s camera is clearly faster, HTC often has better detail in their images, especially at the center of the frame but the Galaxy S7 has better detail at the edge of the frame. Noise reduction tends to be a bit less heavy-handed and sharpening artifacts aren’t nearly as strong as it is on the Galaxy S7. HTC’s larger sensor also means that it’s possible to get actual dSLR-like bokeh with macro shots, which is honestly something that I’ve never seen before in any smartphone camera ever.

Overall, I think it’s pretty fair to say that the HTC 10 is a solid choice. If I had to pick between the two I would probably lean towards the HTC 10, but this is based upon personal priorities. I don’t think you can really go wrong between the two. The HTC 10 is currently 699 USD when bought unlocked through HTC with Carbon Gray and Glacial Silver with 32 GB of internal storage, which is a bit more than the Galaxy S7 but considering how smartphones are often used for 2-3 years now I don’t think 50 dollars should be a major point in favor or against a phone.

Best Mid-Range Android Smartphone: OnePlus 3

The OnePlus 3, with its list of impressive hardware at a reasonable price, is still our (upper) mid-range choice. The Motorola Moto Z Play Droid is about the same price and includes a nice display, a good camera, and a large battery—not to mention support for Moto Mods such as the Hasselblad True Zoom Mod—but its eight Cortex-A53 CPU cores and Adreno 506 GPU cannot offer the same level of performance as the OnePlus 3’s Snapdragon 820 SoC. The Moto Z Play Droid also comes with less RAM (3GB), less internal storage (32GB), and lacks 802.11ac Wi-Fi. Its little brother, the Moto G4 Plus costs less than the OnePlus 3—$299 for 4GB of RAM and 64GB of internal NAND—but again falls short of the OnePlus 3’s overall user experience.

Huawei’s Honor 8 is another contender that costs the same as the OnePlus 3 and is available in the US and internationally. We’re not far enough into our review to give it a thumbs up or thumbs down, but it’s a nice looking phone with decent specs. It also has a smaller 5.2-inch display, giving it a smaller footprint than the OnePlus 3.

When we first looked at the OnePlus 3, Brandon discovered that the display’s grayscale and color accuracy were quite poor, its video quality was subpar, and it evicted apps from RAM too aggressively, especially considering that it comes with 6GB of LPDDR4; however, in subsequent software updates OnePlus has either fixed or improved each of these issues.

The build quality of the OnePlus 3 is excellent, its 16MP rear camera with PDAF and OIS takes nice photos, and its Snapdragon 820 SoC delivers good performance. It also includes 64GB of internal UFS 2.0 NAND storage but no microSD slot, and the usual array of wireless connectivity options including NFC—something the OnePlus 2 lacked. The OnePlus 3 comes in only one configuration and costs $399.

Best Budget Android Smartphones: Huawei Honor 5X (US) and Xiaomi Redmi Note 3 Pro

While the rest of the planet is awash with lower-cost phones containing decent hardware, it’s difficult to recommend a budget smartphone for the US market. Take the Xiaomi Redmi Note 3 Pro, for example. Its Snapdragon 650 SoC contains two high-performance Cortex-A72 CPU cores running at up to 1.8GHz and four Cortex-A53 cores at up to 1.4GHz, which easily outperforms the standard octa-core A53 SoCs common at this price point. Its performance is really quite remarkable, rivaling some upper mid-range and flagship devices. The Adreno 510 GPU supports the latest graphics APIs, including support for tessellation, and is fast enough to play most games currently available. Battery life is excellent too, thanks in part to a large 4050 mAh battery. There’s even an infrared blaster and support for 802.11ac Wi-Fi and FM radio.

Of course some sacrifices need to be made to reach such a low price. The Redmi Note 3 Pro’s weakest component is its 5.5-inch 1080p IPS display, whose poor black level and inaccurate white point and gamma calibration hurt image quality. The panel’s backlight does not fully cover the sRGB gamut, which further reduces color accuracy. While not perfect, it clearly moves the bar higher in this segment and raises our expectations for future lower-cost phones.

Unfortunately, the Redmi Note 3 Pro, like most phones made by Chinese OEMs, is not sold in the US and does not support the LTE frequencies used by US carriers. Instead US consumers must choose from a number of underwhelming phones such as the LG X Power and its Snapdragon 212 SoC that uses four Cortex-A7 CPU cores—not even A53s—and 1.5GB of RAM. The Huawei Honor 5X cannot match the Redmi Note 3 Pro’s performance or photo quality, but it remains a solid option for the US despite being almost a year old. Even the recently released Moto G4 and G4 Play really do not bring anything new. The Honor 5X recently received a long awaited update to Android 6.0 and EMUI 4.0 and is still available for about $200.

from AnandTech
via anandtech

NVIDIA Teases Xavier, a High-Performance ARM SoC for Drive PX & AI

Ever since NVIDIA bowed out of the highly competitive (and high pressure) market for mobile ARM SoCs, there has been quite a bit of speculation over what would happen with NVIDIA’s SoC business. With the company enjoying a good degree of success with projects like the Drive system and Jetson, signs have pointed towards NVIDIA continuing their SoC efforts. But in what direction they would go remained a mystery, as the public roadmap ended with the current-generation Parker SoC. However we finally have an answer to that, and the answer is Xavier.

At NVIDIA’s GTC Europe 2016 conference this morning, the company has teased just a bit of information on the next generation Tegra SoC, which the company is calling Xavier (ed: in keeping with comic book codenames, this is Professor Xavier of the X-Men). Details on the chip are light – the chip won’t even sample until over a year from now – but NVIDIA has laid out just enough information to make it clear that the Tegra group has left mobile behind for good, and now the company is focused on high performance SoCs for cars and other devices further up the power/performance spectrum.

  Xavier Parker Erista (Tegra X1)
CPU 8x NVIDIA Custom ARM 2x NVIDIA Denver +
4x ARM Cortex-A57
4x ARM Cortex-A57 +
4x ARM Cortex-A53
GPU Volta, 512 CUDA Cores Pascal, 256 CUDA Cores Maxwell, 256 CUDA Cores
Memory ? LPDDR4, 128-bit Bus LPDDR3, 64-bit Bus
Video Processing 7680x4320 Encode & Decode 3840x2160p60 Decode
3840x2160p60 Encode
3840x2160p60 Decode
3840x2160p30 Encode
Transistors 7B ? ?
Manufacturing Process TSMC 16nm FinFET+ TSMC 16nm FinFET+ TSMC 20nm Planar

So what’s Xavier? In a nutshell, it’s the next generation of Tegra, done bigger and badder. NVIDIA is essentially aiming to capture much of the complete Drive PX2 system’s computational power (2x SoC + 2x dGPU) on a single SoC. This SoC will have 7 billion transistors – about as many as a GP104 GPU – and will be built on TSMC’s 16nm FinFET+ process.

Under the hood NVIDIA has revealed just a bit of information of what to expect. The CPU will be composed of 8 custom ARM cores. The name “Denver” wasn’t used in this presentation, so at this point it’s anyone’s guess whether this is Denver 3 or another new design altogether. Meanwhile on the GPU side, we’ll be looking at a Volta-generation design with 512 CUDA Cores. Unfortunately we don’t know anything substantial about Volta at this time; the architecture was bumped further down NVIDIA’s previous roadmaps for Pascal, and as Pascal just launched in the last few months, NVIDIA hasn’t said anything further about it.

Meanwhile NVIDIA’s performance expectations for Xavier are significant. As mentioned before, the company wants to condense much of Drive PX2 into a single chip.  With Xavier, NVIDIA wants to get to 20 Deep Learning Tera-Ops (DL TOPS), which is a metric for measuring 8-bit Integer operations. 20 DL TOPS happens to be what Drive PX2 can hit, and about 43% of what NVIDIA’s flagship Tesla P40 can offer in a 250W card. And perhaps more surprising still, NVIDIA wants to do this all at 20W, or 1 DL TOPS-per-watt, which is one-quarter of the power consumption of Drive PX 2, a lofty goal given that this is based on the same 16nm process as Pascal and all of the Drive PX2’s various processors.

NVIDIA’s envisioned application for Xavier, as you might expect, is focused on further ramping up their automotive business. They are pitching Xavier as an “AI Supercomputer” in relation to its planned high INT8 performance, which in turn is a key component of fast neural network inferencing. What NVIDIA is essentially proposing then is a beast of an inference processor, one that unlike their Tesla discrete GPUs can function on a stand-alone basis. Coupled with this will be some new computer vision hardware to feed Xavier, including a pair of 8K video processors and what NVIDIA is calling a “new computer vision accelerator.”

Wrapping things up, as we mentioned before, Xavier is a far future product for NVIDIA. While the company is teasing it today, the SoC won’t begin sampling until Q4 of 2017, and that in turn implies that volume shipments won’t even be until 2018. But with that said, with their new focus on the automotive market, NVIDIA has shifted from an industry of agile competitors and cut-throat competition, to one where their customers would like as much of a heads up as possible. So these kinds of early announcements are likely going to become par for the course for NVIDIA.

from AnandTech
via anandtech

GTC Europe 2016: NVIDIA Keynote Live Blog with CEO Jen-Hsun Huang

I'm here at the first GTC Europe event, ready to go for the Keynote talk hosted by CEO Jen-Hsun Huang.

from AnandTech
via anandtech

Xiaomi Mi 5s and Mi 5s Plus Announced

Xiaomi announced two new flagship smartphones today. The Mi 5s and Mi 5s Plus are updates to the Mi 5 / Mi 5 Pro and Mi 5 Plus phones that were announced at MWC 2016 in February, and pack some new hardware inside a new brushed-aluminum chassis.

Both the Mi 5s and Mi 5s Plus use Qualcomm’s Snapdragon 821 SoC, which itself is an updated version of the popular Snapdragon 820 that’s inside the Mi 5 phones and many of the other flagship phones we’ve seen this year. With Snapdragon 821, max frequencies increase to 2.34GHz for the two Kryo CPU cores in the performance cluster and 2.19GHz for the two Kryo cores in the power cluster. Complementing the quad-core CPU is Qualcomm’s Adreno 530 GPU that also sees a small 5% increase in peak frequency to 653MHz. While it’s unclear if the 821 includes any changes to its micro-architecture, Qualcomm has likely done some layout optimization as it’s quoting a 5% increase in power efficiency. The Mi 5s and Mi 5s Plus still pair the SoC with LPDDR4 RAM and UFS 2.0 NAND like their predecessors.

Xiaomi Mi 5 Series
  Xiaomi Mi 5 (Mi 5 Pro) Xiaomi Mi 5s Xiaomi Mi 5s Plus
SoC Snapdragon 820

2x Kryo @ 1.80 / 2.15GHz
2x Kryo @ 1.36 / 1.59GHz
Adreno 530 @ 624MHz
Snapdragon 821
(MSM8996 Pro)

2x Kryo @ 2.34GHz
2x Kryo @ 2.19GHz
Adreno 530 @ 653MHz
NAND 32GB / 64GB / (128GB) (UFS 2.0) 64GB / 128GB
(UFS 2.0)
Display 5.15-inch 1920x1080 IPS LCD 5.15-inch 1920x1080 IPS LCD 5.7-inch 1920x1080 IPS LCD
Dimensions 144.55 x 69.23 x 7.25 mm
129 / (139) grams
? ?
Modem Qualcomm X12 LTE (Integrated)
2G / 3G / 4G LTE (Category 12/13)

Qualcomm X12 LTE (Integrated)
2G / 3G / 4G LTE (Category 12/13)

Front Camera 4MP, 2.0μm, f/2.0 4MP, 2.0μm, f/2.0
Rear Camera 16MP, 1/2.8" Sony IMX298 Exmor RS, 1.12µm pixels, f/2.0, PDAF, 4-axis OIS, Auto HDR, dual-tone LED flash 12MP, 1/2.3” Sony IMX378 Exmor RS, 1.55µm pixels, f/2.0, PDAF, Auto HDR, dual-tone LED flash 2x 13MP (color + monochrome)
Battery 3000 mAh (11.55 Wh)
3200 mAh
3800 mAh
Connectivity 802.11a/b/g/n/ac, BT 4.2, NFC, GPS/GNSS, USB 2.0 Type-C 802.11a/b/g/n/ac 2x2 MU-MIMO, BT 4.2, NFC, GPS/GNSS, USB 2.0 Type-C
Launch OS Android 6.0 with MIUI 7 Android 6.0 with MIUI 8
Launch Price
(No Contract)
3GB / 32GB / 1.80GHz ¥1999
3GB / 64GB / 2.15GHz ¥2299
(4GB / 128GB / 2.15GHz) ¥2699
3GB / 64GB  ¥1999

4GB / 128GB  ¥2299
4GB / 64GB  ¥2299

6GB / 128GB  ¥2599

Note: We're still trying to confirm the Mi 5s and Mi 5s Plus specifications with Xiaomi.

The Mi 5s still comes with a 5.15-inch 1080p IPS LCD. This is an extended color gamut panel that will display exceptionally vivid, but inaccurate, colors. Xiaomi claims the display will reach a peak brightness of 600 nits, which it achieves by increasing the number of LEDs in the backlight assembly from the typical 12 to 14 in most edge-lit IPS displays to 16, a feature also shared with the Mi 5. This improves power efficiency by 17%, according to Xiaomi, presumably from using more LEDs at lower individual output levels. The Mi 5s Plus has a larger 5.7-inch 1080p IPS display with a pixel density of 386ppi, which is still decent for an LCD.

Xiaomi Mi 5s

While the front camera still uses a 4MP sensor with large 2.0μm pixels, both new phones receive new rear cameras. The Mi 5s looks to improve low-light performance by using a larger format Sony IMX378 Exmor RS sensor that features 1.55µm pixels; however, image resolution drops to 12MP, the same as Samsung’s Galaxy S7 and Apple’s iPhone 7. The Mi 5s Plus has the more interesting camera setup, employing dual 13MP sensors. Similar to Huawei’s P9 and Honor 8, the Mi 5s Plus uses one sensor for capturing color images and the other sensor for capturing black and white images. The black and white camera lacks an RGB Bayer filter, allowing it to capture more light than a color camera. By combining the output of both sensors, the Mi 5s Plus can theoretically capture brighter images with higher contrast and less noise. The P9 and Honor 8 also use the second camera for measuring depth, aiding camera focusing and allowing the user to adjust bokeh effects after the image is captured, but it’s not clear if the Mi 5s Plus also has these capabilities.

Xiaomi Mi 5s Plus

The other big change is a completely new chassis made entirely from brushed aluminum. The back edges are still curved, but there’s no longer any glass or ceramic on the back like the Mi 5 and Mi 5 Pro, respectively. The change to aluminum means the Mi 5s now includes plastic antenna lines on the top and bottom of the back panel. The Mi 5s Plus goes a different route by using plastic inserts at the top and bottom that try to blend in by mimicking the color and texture of the surrounding aluminum.

The Mi 5s Plus includes a circular, capacitive fingerprint sensor on the back that’s slightly recessed, making it easier to locate. The Mi 5s goes the less conventional route with an ultrasonic fingerprint sensor that sits below the edge-to-edge cover glass on the front. Both phones use capacitive buttons rather than onscreen navigation controls and 2.5D cover glass that blends into a chamfered edge on the aluminum frame.

Both phones come in four different colors—silver, gray, gold, and pink—and will be available for sale in China starting September 29.

from AnandTech
via anandtech

Razer Updates The DeathAdder Elite Gaming Mouse

Although Razer has become one of the well known gaming computer companies, they got their start with gaming mice, and today Razer is launching their next iteration of the best selling gaming mouse of all time, the Razer DeathAdder Elite. The DeathAdder series was first introduced in 2006.

As an iterative update, there could just be some new lights, or what not, but this update brings about a new Razer 5G Optical Sensor, rated for up to 16,000 DPI, which is the highest yet. It can also track at 450 inches per second, which is yet another new standard, and supports up to 50 g of acceleration. Razer is also announcing the DeathAdder Elite has the highest measured resolution accuracy in a gaming mouse at 99.4 percent. If high speed and precision is required, this mouse appears to have that sewn up.

The more interesting bit though is that Razer has also upped their game on the switches. Razer has co-designed and produced new mechanical switches with Omron, which are “optimized for the fastest response times” and more importantly to me, an increased durability rating of 50 million clicks.

Razer has also included an improved tactile scroll wheel design. I’ve used the DeathAdder in the past, and one of the things that made me abandon it was the scroll wheel, which gave plenty of grip, but would actually wear through the skin on my finger due to the sharp nubs on the wheel. Hopefully the new version is improved in this regard. For fast gaming, the extra grip is likely a nice bonus, but for everyday use I found it uncomfortable.

The overall design hasn’t changed, which is a good thing, since it was a pretty comfortable and ergonomic gaming mouse. It also keeps the Razer Chroma RGB LED lighting system as well, so you can customize away. The mouse has seven programmable buttons, 1000 Hz polling, and a 2.1 m / 7 ft braided USB cable. It weighs in at 105 grams.

The mouse is available for pre-order starting today for $69.99 USD, with worldwide shipments starting in October.

Source: Razer

from AnandTech
via anandtech

New ARM IP Launched: CMN-600 Interconnect for 128 Cores and DMC-620, an 8Ch DDR4 IMC

You need much more than a good CPU core to conquer the server world. As more cores are added, the way data moves from one part of the silicon to another gets more important. ARM has announced today a new and faster member to their SoC interconnect IP offerings in the form of CMN-600 (CMN stands for 'coherent mesh network', as opposed to cache coherent network of CCN). This is a direct update to CCN-500 series, which we've discussed at AnandTech before. 

The idea behind a coherent mesh between cores as it stands in the ARM Server SoC space is that you can put a number of CPU clusters (e.g. four lots of 4xA53) and accelerators (custom or other IP) into one piece of silicon. Each part of the SoC has to work with everything else, and for that ARM offers a variety of interconnect licences for users who want to choose from ARM's IP range. For ARM licensees who pick multiple ARM parts, this makes it easier for to combine high core counts and accelerators in one large SoC.  

The previous generation interconnect, the CCN-512, could support 12 clusters of 4 cores and maintain coherency, allowing for large 48-core chips. The new CMN-600 can support up to 128 cores (32 clusters of 4). As part of the announcement, There is also an agile system cache which a way for I/O devices to allocate memory and cache lines directly into the L3, reducing the latency of I/O without having to touch the core.

Also in the announcement is a new memory controller. The old DMC-520, which was limited to four channels of DDR3, is being superseded by the DMC-620 controller which supports eight channels of DDR4. Each DMC-620 channel can contain up to 1 TB DDR4, giving a potential SoC support of 8TB. 

According to ARM through simulations, the improved memory controller offers 50% lower latency and up to 5 times more bandwidth. Also, the new DMC is being advertised as supporting DDR4-3200. 3200 MT/s offers twice as much bandwidth than 1600 MT/s, and doubling the channels offers twice the amount of bandwidth - so we can explain 4 times more bandwidth, so it is interesting that ARM claims 5x more, which would suggest efficiency improvements as well. 

If you double the number of cores and memory controllers, you expect twice as much performance in the almost perfectly scaling SPEC int2006_rate. ARM claims that their simulations show that 64 A72s will run 2.5 times faster than 32 A72 cores, courtesy of the improved memory controller. If true, that is quite impressive. By comparison, we did not see such a jump in performance in the Xeon world when DDR3 was replaced by DDR4. Even more impressive is the claim that the maximum compute performance of a 64x A72 SoC can go up by a factor six compared to 16x A57 variant. But we must note that the A57 was not exactly a success in the server world: so far only AMD has cooked up a server SoC with it and it was slower and more power hungry than the much older Atom C2000.   

We have little doubt we will find the new CMN-600 and/or DMC-620 in many server solutions. The big question will be one of application: who will use this interconnect technology in their server SoCs? As most licensees do not disclose this information, it is hard to find out. As far as we know, Cavium uses its own interconnect technology, which would suggest Qualcomm or Avago/Broadcom are the most likely candidates. 

from AnandTech
via anandtech

CEVA Launches Fifth-Generation Machine Learning Image and Vision DSP Solution: CEVA-XM6

Deep learning, neural networks and image/vision processing is already a large field, however many of the applications that rely on it are still in their infancy. Automotive is the prime example that uses all of these areas, and solutions to the automotive 'problem' are require significant understanding and development in both hardware and software - the ability to process data with high accuracy in real-time opens up a number of doors for other machine learning codes, and all that comes afterwards is cost and power. The CEVA-XM4 DSP was aimed at being the first programmable DSP to support deep learning, and the new XM6 IP (along with the software ecosystem) is being launched today under the heading of stronger efficiency, compute, and new patents regarding power saving features.

Playing the IP Game

When CEVA launched the XM4 DSP, with the ability to infer pre-trained algorithms in fixed-point math to a similar (~1%) accuracy as the full algorithms, it won a number of awards from analysts in the field, claiming high performance and power efficiency over competing solutions and the initial progression for a software framework. The IP announcement was back in Q1 2015, with licensees coming on board over the next year and the first production silicon using the IP rolling off the line this year. Since then, CEVA has announced its CDNN2 platform, a one-button compilation tool for trained networks to be converted into suitable code for CEVA's XM IPs. The new XM6 integrates the previous XM4 features, with improved configurations, access to hardware accelerators, new hardware accelerators, and still retains compatibility with the CDNN2 platform such that code suitable for XM4 can be run on XM6 with improved performance.

CEVA is in the IP business, like ARM, and works with semiconductor licensees that then sell to OEMs. This typically results in a long time-to-market, especially when industries such as security and automotive are moving at a rapid pace. CEVA is promoting the XM6 as a scalable, programmable DSP that can scale across markets with a single code base, while also using additional features to improve power, cost and performance.

The announcement today covers the new XM6 DSP, CEVA's new set of imaging and vision software libraries, a set of new hardware accelerators and integration into the CDNN2 ecosystem. CDNN2 is a one-button compilation tool, detecting convolution and applying the best methodology for data transfer over the logic blocks and accelerators.

XM6 will support OpenCL and C++ development tools, and the software elements include CEVA's computer vision, neural network and vision processing libraries with third-party tools as well. The hardware implements an AXI interconnect for the processing parts of the standard XM6 core to interact with the accelerators and memory. Along with the XM6 IP, there are hardware accelerators for convolution (CDNN assistance) allowing lower power fixed function hardware to cope with difficult parts of neural network systems such as GoogleNet, De-Warp for adjusting images taken on fish-eye or distorted lenses (once the distortion of an image is known, the math for the transform is fixed-function friendly), as well as other third party hardware accelerators.

The XM6 promotes two new specific hardware features that will aid the majority of image processing and machine learning algorithms. The first is scatter-gather, or the ability to read values from 32 addresses in L1 cache into vector registers in a single cycle. The CDNN2 compilation tool identifies serial code loading and implements vectorization to allow this feature, and scatter-gather improves data loading time when the data required is distributed through the memory structure. As the XM6 is configurable IP, the size/associativity of the L1 data store is adjustable at the silicon design level, and CEVA has stated that this feature will work with any size L1. The vector registers for processing at this level are 8-wide VLIW implementations, meaning 'feed the beast' is even more important than usual.

The second feature is called 'sliding-window' data processing, and this specific technique for vision processing has been patented by CEVA. There are many ways to process an image for either processing or intelligence, and typically an algorithm will use a block or tile of pixels at once to perform what it needs to. For the intelligence part, a number of these blocks will overlap, resulting in areas of the image being reused at different parts of the computation. CEVA's method is to retain that data, resulting in fewer bits being needed in the next step of analysis. If this sounds straightforward (I was doing something similar with 3D differential equation analysis back in 2009), it is, and I was surprised that it had not been implemented in vision/image processing before. Reusing old data (assuming you have somewhere to store it) saves time and saves energy.

CEVA is claiming up to a 3x performance gain in heavy vector workloads for XM6 over XM4, with an average of 2x improvement for like-for-like ported kernels. The XM6 is also more configurable than the XM4 from a code perspective, offering '50% more control'.

With the specific CDNN hardware accelerator (HWA), CEVA cites that convolution layers in ecosystems such as GoogleNet consume the majority of cycles. The CDNN HWA takes this code and implements fixed hardware for it with 512 MACs using 16-bit support for up to an 8x performance gain (and 95% utilization). CEVA mentioned that a 12-bit implementation would save die area and cost for a minimal reduction in accuracy, however there are a number of developers requesting full 16-bit support for future projects, hence the choice.

Two of the big competitors for CEVA in this space, for automotive image/visual processing, is MobilEye and NVIDIA, with the latter promoting the TX1 for both training and inference for neural networks. Based on TX1 on a TSMC 20nm Planar process at 690 MHz, CEVA states that their internal simulations give a single XM6 based platform as 25x the efficiency and 4x the speed based on AlexNet and GoogleNet (with the XM6 also at 20nm, even though it will most likely be implemented at 16nm FinFET or 28nm). This would mean, extrapolating the single batch TX1 data published, that XM6 using AlexNet at FP16 can perform 268 images a second compared to 67, at around 800 mW compared to 5.1W. At 16FF, this power number is likely to be significantly lower (CEVA told us that their internal metrics were initially done at 28nm/16FF, but were redone on 20nm for an apples-to-apples with the TX1). It should be noted that TX1 numbers were provided for multi-batch which offered better efficiency over single batch, however other comparison numbers were not provided. CEVA also implements power gating with a DVFS scheme that allows low power modes when various parts of the DSP or accelerators are idle.

Obviously the advantage that NVIDIA has with their solution is availability and CUDA/OpenCL software development, both of which CEVA is attempting to address with one-button software platforms like CDNN2 and improved hardware such as XM6. It will be interesting to see which semiconductor partners and future implementations will combine this image processing with machine learning in the future. CEVA states that smartphones, automotive, security and commercial (drones, automation) applications are prime targets.

from AnandTech
via anandtech

AMD Announces Embedded Radeon E9260 & E9550 - Polaris for Embedded Markets

While it’s AMD’s consumer products that get the most fanfare with new GPU launches – and rightfully so – AMD and their Radeon brand also have a solid (if quiet) business in the discrete embedded market. Here, system designers utilize discrete video cards for commercial, all-in one products. And while the technology is much the same as on the consumer side, the use cases differ, as do the support requirements. For that reason, AMD offers a separate lineup of products just for this market under the Radeon Embedded moniker.

Now that we’ve seen AMD’s new Polaris architecture launch in the consumer world, AMD is taking the next step by refreshing the Radeon Embedded product lineup to use these new parts. To that end, this morning AMD is announcing two new Radeon Embedded video cards: the E9260 and the E9550. Based on the Polaris 11 and Polaris 10 GPUs respectively, these parts are updating the “high performance” and “ultra-high performance” segments of AMD’s embedded offerings.

AMD Embedded Radeon Discrete Video Cards
  Radeon E9550 Radeon E9260 Radeon E8950 Radeon E8870
Stream Processors 2304 896 2048 768
GPU Base Clock 1.12GHz ? 750MHz 1000MHz
GPU Boost Clock ~1.26GHz ~1.4GHz N/A N/A
Memory Clock 7Gbps GDDR5 7Gbps GDDR5? 6Gbps GDDR5 6Gbps GDDR5
Memory Bus Width 256-bit 128-bit 256-bit 128-bit
Displays 6 5 6 6
TDP Up To 95W Up To 50W 95W 75W
GPU Polaris 10 Polaris 11 Tonga Bonaire
Architecture GCN 4 GCN 4 GCN 1.2 GCN 1.1
Form Factor MXM MXM & PCIe MXM MXM & PCIe

We’ll start things off with the Embedded Radeon E9550, which is the new top-performance card in AMD’s embedded lineup. Based on AMD’s Polaris 10 GPU, this is essentially an embedded version of the consumer Radeon RX 480, offering the same number of SPs at roughly the same clockspeed. This part supersedes the last-generation E8950, which is based on AMD’s Tonga GPU, and is rated to offer around 93% better performance, thanks to the slightly wider GPU and generous clockspeed bump.

The E9550 is offered in a single design, an MXM Type-B card that’s rated for 95W. These embedded-class MXM cards are typically based on AMD’s mobile consumer designs, and while I don’t have proper photos for comparison – AMD’s supplied photos are stock photos of older products – I’m sure it’s the same story here. Otherwise, the card is outfitted with 8GB of GDDR5, like the E8950 before it, and is capable of driving up to 6 displays. Finally, AMD will be offering the card for sale for 3 years, which again is par for the course here for AMD.

Following up behind the E9550 is the E9260, the next step down in the refreshed Embedded Radeon lineup. This card is based on AMD’s Polaris 11 GPU, and is similar to the consumer Radeon RX 460, meaning it’s not quite a fully enabled GPU. Within AMD’s lineup it replaces the E8870, offering 2.5 TFLOPS of single precision floating point performance to the former’s 1.5 TFLOPS. AMD doesn’t list official clockspeeds for this card, but based on the throughput rating this puts its boost clock at around 1.4GHz. The card is paired with 4GB of GDDR5 on a 128-bit bus.

Meanwhile on the power front, the E9260 is being rated for up to 50W. Notably, this is down from the 75W designation of its predecessor, as the underlying Polaris 11 GPU aims for lower power consumption. And unlike its more powerful sibling, the E9260 is being offered in two form factors: an MXM Type-A card, and a half height half length (HHHL) PCIe card. Both cards have identical performance specifications, differing only in their form factor and display options. Both cards can support up to 5 displays, though the PCIe card only has 4 physical outputs (so you’d technically need an MST hub for the 5th). Finally, both versions of the card will be offered by AMD for 5 years, which at this point would mean through 2021.

Moving on, besides the immediate performance benefits of Polaris, AMD is also looking to leverage Polaris’s updated display controller and multimedia capabilities for the embedded market. Of particular note here is support for full H.265 video encoding and decoding, something the previous generation products lacked. And display connectivity is greatly improved too, with both HDMI 2.0 support and DisplayPort 1.3/1.4 support.

The immediate market for these cards will be the same general markets that previous generation products have been pitched at, including digital signage, casino gaming, and medical, all of whom make use of GPUs in various degrees and need parts to be available for a defined period of time. Across all of these markets AMD is especially playing up the 4K and HDR capabilities of the new cards, along of course with overall improved performance.

At the same time however, AMD’s embedded group is also looking towards the future, trying to encourage customers to make better use of their GPUs for compute tasks, a market AMD considers to be in its infancy. This includes automated image analysis/diagnosis, machine learning inferencing to allow a casino machine or digital sign to react to a customer, and GPU beamforming for medical. And of course, AMD always has an eye on VR and AR, though for the embedded market in particular that’s going to be more off the beaten path.

Wrapping things up, AMD tells us that the new Embedded Radeon cards will be shipping in the next quarter. The E9260 will be shipping in production in the next couple of weeks, while the E9550 will be coming towards the end of Q4.

from AnandTech
via anandtech