1. Version 4 of PhotoRaw and PhotoRaw Lite are now available on the App Store. The new version is a full 64-bit rewrite of PhotoRaw, and takes full advantage of the speed of Apple's new devices. I've talked in previous posts about the speed advantages that 64-bit operation can bring.

    If you haven't tried PhotoRaw since the first version of PhotoRaw and the iPad 1, you should try the new version on an iPad Air - you'll be surprised......
    0

    Add a comment

  2. In my last blog post, I went all technical, and talked about how to use the SIMD hardware acceleration, otherwise known as NEON, on Apple's new 64-bit processor (aka ARM64, aka ARMv8-A) on the iPad Air and iPhone 5s.

    But the question is, is it actually worthwhile? Writing code for a SIMD processor is hard at the best of times, and in this case the documentation is near non-existent, and Apple's compiler is buggy. (It turns out that Apple use their own undocumented instruction naming convention which is aliased to the ARM names. But sometimes, the alias isn't quite right.)

    Now this is interesting because the new processor actually has two distinct personalities - it can either be a 32-bit processor (aka ARMv7), which looks just the same as previous generation iPad/iPhone processors, or it can be a 64-bit processor with a different instruction set (ARMv8-A). By way of background, there have been numbers of "experts" on the web stating that 64-bit would make no difference. Which is of course in theory true - all other things being equal, you can build a 32-bit processor as fast as a 64-bit one, but that rather misses the point. The point being, did Apple decide to make all other things equal, or not? Given X amount of chip area to work with, Apple could choose to use that area either to make the 32-bit part of the chip fast, or the 64-bit part.

    So I set out to find out (a) just how fast the new iPad is in imaging applications, and (b) whether either 64-bit mode or using the SIMD instruction set would make a significant difference.

    The benchmark

    My interest in this is practical, and is just about about how to optimize my products, in this case PhotoRaw. So I chose to measure the performance of just one stage of PhotoRaw's pipeline which happens to be fairly "SIMD friendly", and is already SIMD accelerated for 32-bit under the older ARM processors. Note:

    • This is just a single point test - the stage in question is typical of an image processing pipeline, but your results may vary. A lot. Also, it's real production code, and it's the whole stage, so when I say SIMD, that actually means a mix of SIMD and C++.
    • The stage is multi-threaded, so will use all cores. Specifically, note that the iPad 1 is single core vs the later iPad's two core architecture.
    • The NEON SIMD code is hand optimized. Interestingly, the SIMD code in the core loop on the 64-bit ARMv8-A is 23 instructions vs. 27 for the 32-bit code, so about a 15% saving there, although that's not hugely meaningful as different instructions take different numbers of cycles to complete.
    • Finally, it so happens that this stage runs identically in AccuRaw, so allows me to also benchmark the same code, in X86 form, on a Intel Core i7 processor, which is quad core.

    The results

    Times in mS, lower is better.

    C++ SIMD
    iPad 1 6,056 2,813
    iPad 4 581 514
    iPad Air 32-bit 321 474
    iPad Air 64-bit 230 108
    Intel Core i7 4.2 GHz 46


    The results are interesting, and probably not quite what you'd expect:
    1. Unsurprisingly, the iPad 1 just gets completely outclassed - it has a slow single core processor, and just can't keep up at all. On the iPad 1 however, SIMD makes a real difference, which is how SIMD originally found its way into PhotoRaw.
    2. The iPad 4 is much better, but there's a surprise - SIMD code only helps a little.
    3. On the iPad Air, there's another surprise - running in 64-bit mode instantly gains you about 50% - 230mS vs 321 just using compiled C++ code.
    4. SIMD on the iPad Air is the real shocker. Firstly, in 32-bit mode, it's slower than straight C++ code. If I had to guess, I'd say that Apple deliberately built the 32-bit SIMD side of the new chip to just match the iPad 4, for compatibility reasons. However, in 64-bit mode, it's screamingly fast, clocking a full 5 times faster than the iPad 4, and twice as fast as compiled C++.
    5. Apple have claimed the that the processor in the iPad Air is "desktop class". Well, sort of. Versus a Core i7 clocking at 4.2 GHz, its about 1/5 the speed. But on a per core basis, that's close to half the speed. That from a device that including memory, screen, battery, etc takes up about 10% of the space that the Core i7's heat sink and fan take up!!!!!

    Conclusions

    First conclusion - if you were wondering whether the whole bother of rebuilding apps for 64-bit is worthwhile, then the answer is that if they are CPU intensive imaging apps, then it is probably worth the bother. You can expect a 50% uptick in performance right there. 

    Second conclusion - SIMD might be worthwhile for you, but only if you're going to 64-bit mode and have a real need. Otherwise, don't bother.

    Third conclusion - all those web "experts" that said that 64-bit doesn't matter - well, Apple made it matter.

    Finally, various people have speculated as to whether Apple's 64-bit chip could find its way into a desktop product. The answer is, yes, probably. If you built a 4-core version, up-clocked it and added heat sinking, it probably still wouldn't quite compete with the top-of-the-line Intel chips. But it would be quite capable.
    0

    Add a comment

  3. So this one is for the serious techies.

    It's been widely publicized that the new iPad Air and iPhone 5s have 64-bit processors. What's not not so well understood is that the 64-bit processors can actually either run old 32-bit code, or new 64-bit code. If you run the new 64-bit code, you're running an instruction set that's quite different to the old one; it's not like it just got bigger registers.  Apple just refer to the new architecture as ARM64, but in official ARM speak, the new instruction set is actually ARMv8-A; the 32-bit set was ARMv7.

    Now a lot of apps used the NEON SIMD extensions to improve performance - SIMD instructions effectively perform the same instruction simultaneously on multiple pieces of data, so speeding up operations.

    How to write NEON code on the old 32-bit iDevices was well understood; there were libraries such as Math-Neon available that you could look at and/or use directly. Not so for the 64-bit devices; there is not much out there other than ARM's deep-dive engineering manuals. The shortest ARM document I found is a 112 page "Instruction Set Overview", and is not exactly easy reading, and of course is not Apple specific. In fact, ARM don't seem to even call the SIMD extensions NEON anymore, it's just "Advanced SIMD Floating-Point". The only third-party exception that I found are the folks at Linaro, who have some presentations about porting to ARMV8, and have also ported the libjpeg-turbo library to ARMv8.

    However, the Linaro code isn't Apple friendly, doesn't show how to do ASM blocks, how to select between 32-bit and 64-bit mode, etc.

    Better yet, Apple refuse to help. I asked their Apple Developer Technical Support (DTS) service for suggestions on best practice with a paid tech support incident. What I got back was : "Thank you for contacting Apple Developer Technical Support (DTS). DTS does not provide support for ARM assembly." So not much support there.

    However, not one to let an absence of documentation or support stop me, below is a simple technology demonstrator of how its possible to support NEON-32 and NEON-64 code in the form of inline asm blocks in an Xcode based project.

    The example below is quite simple, just a 3x3 matrix by 3 element vector multiply routine, but it shows a structure that works. The 64-bit code is just a straight port of the 32-bit code; it may be possible to further optimize it, but this is just intended to show the principle.


    #if !defined(__i386__) && defined(__ARM_NEON__)
    #if  (!defined(__LP64__) && !defined(_LP64))
    #define __MATH_NEON_32
    #else
    #define __MATH_NEON_64
    #endif
    #endif

    void
    matvec3_RowMajor(float matrix[3][3], float v[3], float d[3])
    {
    float *m = (float *) matrix;
    d[0] = m[0]*v[0] + m[1]*v[1] + m[2]*v[2];
    d[1] = m[3]*v[0] + m[4]*v[1] + m[5]*v[2];
    d[2] = m[6]*v[0] + m[7]*v[1] + m[8]*v[2];
    }

    void __attribute__((noinline)) matvec3_neon_RowMajor(float m[3][3], float v[3], float d[3])
    {
    #if defined(__MATH_NEON_32)
      __asm__ volatile (
        "vld1.32 {d6}, [%1]! \n\t"         // Q3 = v
        "vld1.32 d7[0], [%1] \n\t"
     
        "vld3.32 {d0, d2, d4}, [%0]! \n\t" // Q0 = {x, m0_6, m0_3, m0_0} = {d1, d0}
        "vld3.32 {d1[0], d3[0], d5[0]}, [%0] \n\t" // Q1 = {x, m0_7, m0_4, m0_1} = {d3, d2}
                                                                            // Q2 = {x, m0_8, m0_5, m0_2} = {d5, d4}
     
        "vmul.f32 q9, q0, d6[0] \n\t" // Multiply out
        "vmla.f32 q9, q1, d6[1] \n\t" //
        "vmla.f32 q9, q2, d7[0] \n\t" //
        "vmov.f32 q0, q9 \n\t" //
     
        "vst1.32 d0, [%2]! \n\t" //r2 = D24
        "fsts s2, [%2] \n\t" //r2 = D25[0]
     
        : "+r"(m), "+r"(v), "+r"(d)
        :
        : "q0", "q1", "q2", "q3", "q9", "memory"
      );
    #elif defined(__MATH_NEON_64)
      __asm__ volatile (
        "ld1 {v3.2s}, [%1], 8 \n\t" // V3 = v
        "ld1 {v3.s}[2], [%1] \n\t"

        "ld3 {v0.2s, v1.2s, v2.2s}, [%0], 24 \n\t" // V0 = {x, m0_6, m0_3, m0_0}
        "ld3 {v0.s, v1.s, v2.s}[2], [%0] \n\t" // V1 = {x, m0_7, m0_4, m0_1}
                                                                            // V2 = {x, m0_8, m0_5, m0_2}
                          
        "fmul v9.4s, v0.4s, v3.s[0] \n\t" // Multiply out
        "fmla v9.4s, v1.4s, v3.s[1] \n\t" //
        "fmla v9.4s, v2.4s, v3.s[2] \n\t" //
                          
        "st1 {v9.2s}, [%2], 8 \n\t" // Result in V9
        "st1 {v9.s}[2], [%2] \n\t"
     
        : "+r"(m), "+r"(v), "+r"(d)
        :
        : "v0", "v1", "v2", "v3", "v9", "memory"
      );
    #else
    matvec3_RowMajor(m, v, d);
    #endif
    }
    0

    Add a comment

  4. Lloyd Chambers, of the diglloyd blog, recently published a review of the Nikon D800M in which he used AccuRaw Monochrome for raw processing. So what is AccuRaw Monochrome and why should you care? Here's the story:

    By way of background, the D800M is a Nikon D800 that's been modified by the folks at MaxMax.com to remove the Bayer color filter layer on the sensor, creating a pure monochrome camera. So, no Bayer demosiaicing artifacts, beautiful tonality, etc, etc. Now you could just go out and buy a pure monochrome camera off the shelf, in the form of Leica's M Monochrom, which I spoke about on this blog in some previous posts. Problem is, once you've bought an M Monochrom, and a few lenses, you won't have much change from say $20,000. And then, much as I love Leica's M cameras, you have a camera that is really only at its best for lenses between 35mm and 75mm, doesn't have auto-focus, etc, etc. Enter the folks at MaxMax. A lot of their work is for scientific and engineering applications, but they will build you a camera modified to pure monochrome. You can chose anything from a pocketable point-and-shoot to a top-of-the-line Nikon. Also, you can also get cameras without IR filters, UV filters, etc, etc to suit your intended use.

    Lloyd's various posts go into detail about image quality, usability, etc, and are well worth the read if you're interested in monochrome work, so I won't go into any of that here. What I will talk about is the technicalities of processing the image from a camera modified to monochrome.

    The problem of course is, how do you do the raw processing? Well, if you're handy with command line options and don't mind a fairly complex multi-step process, you can persuade DCRaw to treat the image as a single color. Or you can just use a conventional raw processor such as ACR or Lightroom (or the "normal" version of AccuRaw, for that matter). Of course, the raw processor is going to think that it's still dealing with a Bayer matrix camera, and as a result is going to try to demosaic monochrome data. While the end result of that is actually not as bad as you might imagine, it isn't ideal (see later for an example).

    Enter AccuRaw Monochrome. Now, to be clear, "AccuRaw" and "AccuRaw Monochrome" are separate products. AccuRaw Monochrome is dedicated to monochrome applications. It's primary target is actually conventional unmodified off-the-shelf cameras. For those cameras, AccuRaw monochrome has a special demosaicing algorithm dedicated to creating monochrome images. Because AccuRaw Monochrome is dedicated just to that, it can do a better job than conventional demosaicing algorithms that are optimized for good color results.

    However, AccuRaw Monochrome also has another trick up its sleeve - it can also do true monochrome processing for a camera modified to monochrome operation such as by MaxMax. It's in this role that Lloyd Chambers was using the AccuRaw Monochrome beta. (If you'd like to be a beta tester, either with a conventional or modified camera, drop me an email to the contact address on the AccuRaw website.)

    True raw processing

    To demonstrate the difference that true monochrome processing makes, I'll use an image provided by the folks at MaxMax.com that was created with a D800M . Firstly, let's look at the entire image, as processed with Lightroom 5.2 and AccuRaw Monochrome Beta 0.9.1. Note that I'm using Lightroom here just because that is what is commonly used - you would get similar results from any conventional raw processor, including the normal version of AccuRaw for example.

    Lightroom 5.2, default settings, saturation set to zero


    AccuRaw Monochrome Beta 0.9.1, default settings

    Looking at the entire image, reduced to a size that fits well on this page, there is no discernible difference between LR and AccuRaw Monochrome. The images might as well be identical - exposure is the same, overall contrast is the same, etc. However, let's take a closer look. But before we do that, hold onto the "overall contrast is the same" thought - it will be important later.

    100% crops

    Now lets look at some 100% crops:

     Lightroom 5.2, 100% crop, default settings, saturation = 1


    Lightroom 5.2, 100% crop, default settings, saturation = 0



    AccuRaw Monochrome beta, 100% crop, default settings

    The first Lightroom crop, with saturation set to 1, clearly shows the demosaicing problem in the form of color artifacts, e.g., the "On" lettering and the "System" lettering. Setting the saturation to 0, in the second crop, sorts that (and the white balance) out. However, the artifacts are still there - there're just not as obvious. In the third crop, this from AccuRaw Monochrome, those artifacts aren't there, and the image is noticeably sharper all over.

    400% crops

    Looking at some crops at 400% will make what is happening more obvious:

    Lightroom 5.2, 400% crop, default settings, saturation = 0 


    AccuRaw Monochrome beta, 100% crop, default settings

    Comparing the two 400% crops, what you see is really two things - firstly, the demosaicing process is creating artifacts - you can most easily see that around the lettering. But secondly, and more subtlety, there is also a loss of micro-contrast in the LR image relative to the AccuRaw Monochrome image. E.g., take a look at the "2000i" text. And remember, the overall contrast is the same. That loss of micro-contrast makes a real difference, and is primarily why AccuRaw Monochrome's 100% crop looks much sharper all over, not just where there are artifacts.

    Now I just need to work on persuading myself that a MaxMax modified Nikon Df is absolutely necessary to my artistic development.......might not be hard.....
    2

    View comments

  5. For that have been looking for an opportunity to buy AccuRaw, it's 30% off on the App Store for the whole black Friday weekend, through cyber Monday.

    AccuRaw on the App Store
    0

    Add a comment

Popular Posts
Blog Archive
About Me
About Me
My Photo
Author of AccuRaw, PhotoRaw, CornerFix, pcdMagic, pcdtojpeg, dcpTool, WinDat Opener and occasional photographer....
Loading