============================================================== ASF2VC1 v1.2: Additional Techinical Info and Source Code Notes ============================================================== Theory of Operation: Abstract ============================= There's really no such thing as a VC-1 "Elementary Stream" (see "What is a VC-1 Elementary Stream, Anyway?") later in this document). But ignoring that minor issue, I'll continue. A usable VC-1 ES needs a "Frame Start Code" (a simple 4 byte code) at the beginning of each frame. It also needs an "Entry Point Start Code", followed by some amount of associated data, prior to each keyframe. Additionally, it needs, at minimum, at least one "Sequence Start Code", along with its associated data, before the first frame, although it is legal, common, and desirable to simply include the pair before every keyframe (ASF2VC1 will do that by default). A standard demux of a video stream of a WVC1 Advanced Profile .WMV file will give us a concatenated series of frames containing VC-1 video data. It would be nice if simply demuxing the video gave us a usable and proper "ES", in which case this application would be unnecessary. Unfortunately, it doesn't; it lacks all three of the above. Because the frames are already delimited by the ASF container, Microsoft does not bother to add the "frame start code" to the beginning of each frame; I assume they feel it's implied and would be redundant. It's just a simple 4-byte tag, though, and it's easy to add, but its absence in the demuxed stream is a critical problem. With that addition alone, SMPTE would consider then consider the resulting stream a "conformant stream" (though not a "picture producing conformant stream") bitstream (both described later). Too bad - if that's all we needed to do this app would quit a bit simpler. To their credit, Microsoft does make some attempt to include the Entry Point and Sequence Start codes and their associated data, and it's in the form of a pre- encoded bit string - "ready to go". Again, presumably to avoid redundancy, they only include it once, though. It's "hidden" (i.e. currently undocumented, to my knowledge) in the ASF_VIDEO_MEDIA_OBJECT structure 1 byte after the last byte of the last documented member, a BITMAPINFOHEADER structure. So, while demuxing, simply inserting a Frame Start Code before every demuxed frame and also inserting this supplied Sequence Header / Entrypoint Header bit sequence, verbatim, before each frame marked as a keyframe would make a usable VC-1 ES out of the resulting demuxed data. In fact, I believe that's exactly what version 1.0 of this app did, and I believe it did produce a technically "picture producing conformant" VC-1 bitstream. Most modern containers use "byteless" stream of bits (meaning byte boundaries are meaningless) where the data is grouped into fields which are often both variable length and/or entirely optional, based on previous values. And VC-1 is no exception. This involves writing custom code to tediously "parse out" values, bit-by-bit, even if you are only interested in a single bit of info. So here's the good part. While much of the ASF code is more conventionally written, I coded the VC-1 parsing (reading a bitstream) and "un-parsing" (creating a bitstream) a prototype for a new "experimental" table driven approach I thought up, which seems to have great promise (for my GSpot app, in particular, which much handle dozens of these kinds of specs). Each entry in the table contains the name of a value (as an ID and as "human readable" text"), the number of bits the value occupies, a condition, if any, upon which to "skip" it, and a place for the value itself (as well as flag designating that value as valid). The values in the table all readily accessible, they are constant bit length (32 bits) values and can be accessed by simply going to their index. So values can be directly extracted or inserted at any time. The table can be "run" in by a simple loop, which, in the absence of any overriding "conditionals" within the table, simply "executes" an entry a time and then moves on to the next one (the whole idea has similarities to an interpreted computer language). The table can furthermore be run in "parse" mode (read a bitstream & fill the table) or in reverse (use values in the table to create a bitstream) depending on the loop chosen to run it. So, here's how it works: 1. The table is first "zeroed out" (well, it's actually "initialized to uninitialized", since zero is often a valid value, but that's a technicality). The "parser" loop is then run once on the Microsoft supplied "short" VC-1 headers - the Sequence and Entrypoint headers mentioned above -, "reading in" any values it find there into the corresponding locations in the table. 2. The ASF file is then parsed (currently using more conventional, "straight line" code) to find values which I've found to be important yet missing in Microsoft's "short" VC-1 headers. Most significantly, the framerate is missing. But I also parse out the pixel-aspect ratio and perhaps some other stuff. This is where the program could be easily expanded. The table has a spot for every possible value that could exist in the sequence and Entrypoint headers. The app could be modified to grab additional values from the ASF file, or, and this would be particularly easy, grabbed directly from a command line and inserted. That way, a value like PAR specified when even if It's not present or "overridden" when it is. Values like framerate (which should always be present) could be overridden as well. 3. The table is then "unparsed" back to bitstream form. The generated result, a new "longer and more informative" bitstream than the original, is saved for use during the actual demux process in the next step. 4. Finally the ASF file is demuxed (not a particularly trivial task, by the way). A Frame Start Code is inserted prior to every frame, and the "new and improved" Entrypoint and Sequence Header pair bitstream is inserted prior to every keyframe. 5. VC-1 has was certainly upped the complexity level on that on the previously simple question of whether a frame is a keyframe or not (I think 3 of its approximately 13 possible frametypes are considered keyframes). In any event, the ASF file marks keyframes, and I perform a direct check of the VC-1 frametype as well. I believe the debug version of the code will "assert" if the ASF file contradicts the statement above (that 3 types are keyframes), and I have yet to see an assertion. For more info on this, just run the in it's most verbose mode. What is a VC-1 Elementary Stream, Anyway? ========================================= Technically, at the lowest level a VC-1 bitstream consists of a series of so- called "RBDU's" (raw bitstream decodable units), of which there are 14 different types. Each RBDU is prefixed by start code "0x000001", followed by an additional byte that identifies the type of BDU that follows. The problem is, the sequence 0x000001 could randomly appear up in the middle of an RBDU, which would, among other things, make re-synchronization virtually impossible. So Annex E of SMPTE-421M defines an "encapsulation mechanism", wherein all data within the RBDU after the actual start code is "escaped", as programmer's common put it. The resulting BDU is now called EBDU (you guessed it: "encapsulated bitstream decodable unit". The above annex describes how to convert an RBDU into an EBDU and vice-versa. I believe that, up until this point, the mechanism described is intended to apply to all VC-1 profiles - Simple, Main and Advanced. In any event, this work is already done for us when we demux the ASF file. The ASF2VC1 application does not have to get perform any "encapsulation". I only mention it to avoid any possible terminology confusion. Unfortunately, a series of concatenated EBDU's is not what we are trying to create - we're not nearly done yet. We want a playable concatenation of these units, something *we'll* call a VC-1 "Elementary Stream" (hereafter referred to as "ES") - a stream that could be decoded and play "by itself", i.e. without the benefit of a container. We're using the term Elementary Stream by way of analogy with MPEG: an MPEG Elementary Stream is an encapsulated, containerless video stream that can be "played by itself" - just like we want. Our "ES" is defined in SMPTE-421M Annex G: "Bitstream Construction Constraints - Advanced Profile" and is technically called a "picture-producing conformant bitstream". Note: to the best of my knowledge, the SMPTE specification does not appear to define a "conformant" bitstream", much less its subset, a "picture-producing conformant bitstream", for profiles other than the Advanced. The common wisdom is, basically, that there is "no such thing". The Simple and Main profiles are defined to the point where they can be put in a container, but there does not appear to exist a definition for what we're calling an "Elementary Streams" for either the Simple or Main profiles. Notes About the Source Code =========================== I had big plans to neaten up this source, add features, removed some unused stuff, improve comments & some function and class names, and I thus did not release it immediately. At the time I figured I'd release it in another week or two. But that was over six months ago, and in the interim I've received several requests for it, so I'm releasing the untouched source that was used to ASF2VC1 v1.2, build 20070526, the version that's been posted here since that time. The source code is extremely generic, and should compile "out of the box" using Visual Studio VC-2003 or VC-2005. The binary release that's been posted was compiled and "statically linked" using VC-2005 running on Windows Vista. But I've also added a second source code package, for those who prefer, which compiles "out of the box" on VC-6. It's basically the same source - just open use the .dsw or the .dsp instead of the .sln or the .vcproj included in the former package. All I had to do to make the VC-6 version was add a few extra system #includes, change the prototype for "main()", and one or two other trivial things. Porting ======= Porting this to another O/S should be a snap. The existing readme for the executable says "This application has NO system requirements' to speak of... if your PC can read and write files, then the app should run fine. It consists of a single executable file, written from scratch. No DirectShow, no codecs, no SDK's, other apps or anything else is needed." And I would say the statement can be roughly applied to porting the source as well. I haven't actually don it, but my guess is that porting to a Linux or similar environment would mostly consists of swapping a few "CreateFile(), ReadFile() and WriteFile() with fopen(), fread, and fwrite(). I think I used an MS macro DEFINE_GUID, but that could just be replaced with "const char myguid [32] = {bytes of some GUID}, and other issues should be equally simple. All required GUID *values* are included in supplied header files, though, you do not need to pore thru any obscure Microsoft header files. Most are only obtainable from the ASF specification doc anyway, and if there discovered any undocumented ones that were needed, they're there too. Re: Byte ordering: If you're porting to a system that uses Motorola (Big Endian) byte ordering, I *know* for sure you will have add a line or so to four functions: "NextWordLe(), NextDWordLe(), and NextQWordLe() in the file FileCache2.cpp. These functions simply read the next two, four or eight bytes respectively from the ASF "input stream" and are expected to return a corresponding value for your host system. I probably would have thrown in ntohs() and ntohl(), but the Microsoft specifies that all values in an ASF file be in Intel (Little Endian) order (even though it's a "streaming" format definition, they stuck to their old AVI ways and rejected the fact that "network order" was defined to be Big Endian - but I digress). I was writing a quick piece of code for an Windows (Intel) machine that read a file format defined as little Endian, so I just "cast" it. I could make the code portable right now in less time than it's taking to type this, but I want to release the code untouched. Please double check, but it seems other similar functions, in particular the nearby "NextVarlenLe()", don't use "casting" and are inherently Platform Independent. Addendum: I just noticed there's a definition "LITTLE_ENDIAN" near the top of "constants.h". Comment that out if it's not applicable to your machine. Apparently there are a few other places that are byte order sensitive, though it looks like I've put in the appropriate conditionals. Other Oddities ============== The whole project consists of around seven simple classes - two for ASF, one for VC1, two for "getting and putting" bits, a "file cache" and a Log function (the last of which isn't even really a class - jus a global function). The log() function was part of an unfinished plan to easily allow multiple "verbosity levels", either to the screen or to have built-in "logging" without the necessity of redirecting console output. That was never really finished, and the entire function could probably be replaced with a simple "printf" or even completely eliminated if you're only interested in the final result. As I recall, the currently implementation does distinguish between some "serious" ("fatal") errors and "informational" output, sending the former to stderr and the latter to "stdout", so the existing version will display critical error info on the screen even if the output is being redirected, the only current mechanism for "logging". Beyond that, this simple function can largely be ignored or replaced. The CFileCache2 class is a simple "ring buffer" memory mapping of the file being read. The intent was to have a "black" box which would transparently reload from the physical file only in large chunks, when needed, allowing fast processing even when files would otherwise have to be parsed with multiple one or two byte reads. It maintained it's own virtual internal file pointer, and kept that aligned with the physical file pointer. Besides supporting a single "fgetc()" type calls and larger "fread()" type calls, it had a number of utility functions such to get the next WORD or DWORD or GUID. This simplified coding by returning the value in the form needed & advancing the pointer automatically. It was to become a better version of the original CFileCache() class I already use in my GSpot app. It's OK as it currently exists in this app, except the version here is "unfinished" insofar as it lacks any functions that aren't specifically needed for this app (e.g. It has a "get next little-endian DWORD" - the NextDWordLe() mentioned above - but no but no corresponding NextDWordBe() for possible use in other file formats. Again, I could add some of this in less time than it's taking me to type this, but I'm I've decided to leave this current source distribution untouched. Furthermore, it doesn't support reads larger than the buffer size, and in this app the number of small reads is minimal (while parsing the ASF header), then it starts demuxing in relatively large chunks. In a very "un-object oriented" way, the current code simply bypasses the whole cache after the initial parse and uses the physical file handle thereafter. And lastly, if you're actually examining it, you'll notice that it always reloads the 7/8 of the ring buffer when it becomes exhausted, not the whole thing. This may seem odd, but is quite intentional and was indeed of some use in the original version. What it meant is that you could actually perform relatively efficient, limited byte by byte *backwards* searches as well, because a "reload" created a memory map which contained values both ahead and, to a lesser extent, *behind" the current virtual pointer. It's not a bad idea, but the feature is unused in this app. Anyway, this is just all informative. For this app, the entire cache probably could be eliminated, since there aren't that many "small reads", I bypass it later anyway, and the "reverse function is unused". But the various special "Getxxx()" functions are nice, and they'd have to be replaced if the class were dispensed with. My plan is to finish it up and use it for a variety of file formats in the next GSpot. Right now, it's probably best left as is. ASF weirdness's =============== The current code handles ASF files with multiple bitstreams - it automatically selects the one with the highest bitrate (a command line override to select s specified stream should really be added). In retrospect, I probably should have just specified that the user only attempt to demux single stream files. But I kept running into multi-stream files, and it always picked the wrong one. I now know I was "running into them" because it's easy to make them "accidentally" with the Windows Media Encoder, and I was inadvertently creating them myself! For the record, if you use it, make sure you only have one checkbox ticked on the Session Properties "Compression" tab, or you'll be adding a lot of extra stuff to your file you may never see. These multiple bitrate files are apparently intended when creating files to be used as source material with their streaming server, but most regular users would not have a use for such a thing. Anyway, I only mention all this because of an odd idiosyncrasy of the ASF File format. It's explained in a comment, complete with a "pictorial representation", directly above the function "CAsfStream::ProcASF_Extended_Stream_Properties_Object" We're looking for information in a structure called the "Stream Properties object". If there is more than one stream, there is more than one such object. So far, so good. But only *one*, apparently the "low-bandwidth" one, is in the normal place. To find the one we typically want, we have to look for a "Header Extension Object". And *inside* that there is a "Extended Stream Properties Object". And following that is a list of additional objects, one of which *may* be another "Stream Properties object" - nested three levels deeper than the original. I don't know how or why the ASF format such ended up with such an apparent "hack" for adding additional streams, but it seriously complicates parsing a format which isn't exactly trivial to begin with. Be that it may, the code is in there and appears to work. The Good Stuff ============== Finally on to something I like discussing. The VC-1 bitstreams are encoded and decoded by way of a table driven "interpreter", and I'm very happy with the way this worked out. This was previously described so I won't get into it to much now, but if you're looking for the "meat" of the VC-1 parsing, you'll find it in the functions "CVc1Parse::BitStringToTable" and "CVc1Parse::TableToBitString", whose names should be self-explanatory. For reference, here's the table that along with the two short loops above, does all the VC-1 work. The table comprises the full specification of possible values that may be included in Sequence and Entrypoint headers, so expanding the capabilities of this program should be a relatively simple matter. If you can figure out a way to get a new value you want in the bitstream headers, just call "CVc1Parse::SetTableValue" with the name of a parameter and its value at the appropriate time, and then the aforementioned "TableToBitString()" function will include your new value in the resulting headers. I'm going to expand this table concept (right now it only handles two "comparison operators" - equal and "not equal", but it would be trivial to add more. And in theory the table, which now only conditionally jumps forwards, could be made to loop and perform other complex parsing functions, and even to perhaps call a custom function when required (by adding a function pointer column right to the table). As it stands now, custom functions are needed in only around three cases. The are TableToBitString() and TableToBitString() functions, which would ideally be completely generic loops, are each hard coded to determine when that part of the table is reached, and then call a custom function. But with a little more work, I hope to have a generic table driven parsing mechanism that I can use in a wide variety of circumstances. That would hopefully organize and reduce the huge amount of custom code in programs like GSpot, which read data defined by a wide variety of specifications. References ========== ASF: Almost everything you need to know is contained in Microsoft's "Advanced Systems Format (ASF) Specification", currently available for free download at http://www.microsoft.com/windows/windowsmedia/forpros/format/asfspec.aspx . VC-1: Everything you need to know is available in the main SPMPTE VC-1 specification document, "SMPTE STANDARD 421M-2006: VC-1 Compressed Video Bitstream Format and Decoding Process" Unfortunately, this 470 page document is not available for free; you have to purchase it from SMPTE and indeed it's quite expensive. While I own a copy, it would obviously be illegal for me to make it available. But it's not illegal for me to divulge information from it, and everything required by this program is included right in the source code - often in nicely organized tables. Note, for example, near the top of VC1Parse.cpp are tables showing the bit codes are assigned to which framerates, PAR values, etc. And, of course, the entire "structure" of the Sequence and Entrypoint headers is implied by my main "parser table", "InitSeqAndEntryTbl", located in the file VC- 1Parse.h. License ======= /* *************************************************************************************** The MIT License Copyright (c) 2007 Steven G. Greenberg Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions: The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software. THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE. ******************************************************************************************* */ Addendum and Contact Info ========================= As a final addendum, please note that I've just spent several hours writing up this document to go along with the release of the source code. For expediency, I haven't spent extensive time proof-reading it, so please take that into account. Please direct all comments, bugs, suggestions, etc to steve (at) headbands.com, and make sure to at least include the name "ASF2VC1" somewhere in the subject line. I will update my spam filter to automatically allow any such emails. - Steven Greenberg 12 December 2007