==============================================================
ASF2VC1 v1.2: Additional Techinical Info and Source Code Notes
==============================================================


Theory of Operation: Abstract
=============================
There's really no such thing as a VC-1 "Elementary Stream" (see "What is a VC-1 
Elementary Stream, Anyway?") later in this document). But ignoring that minor 
issue, I'll continue.

A usable VC-1 ES needs a "Frame Start Code" (a simple 4 byte code) at the 
beginning of each frame. It also needs an "Entry Point Start Code", followed by 
some amount of associated data, prior to each keyframe. Additionally, it needs, 
at minimum, at least one "Sequence Start Code", along with its associated data, 
before the first frame, although it is legal, common, and desirable to simply 
include the pair before every keyframe (ASF2VC1 will do that by default). 

A standard demux of a video stream of a WVC1 Advanced Profile .WMV file will 
give us a concatenated series of frames containing VC-1 video data. It would be 
nice if simply demuxing the video gave us a usable and proper "ES", in which 
case this application would be unnecessary. Unfortunately, it doesn't; it lacks 
all three of the above. 

Because the frames are already delimited by the ASF container, Microsoft does 
not bother to add the "frame start code" to the beginning of each frame; I 
assume they feel it's implied and would be redundant. It's just a simple 4-byte 
tag, though, and it's easy to add, but its absence in the demuxed stream is a 
critical problem. With that addition alone, SMPTE would consider then consider 
the resulting stream a "conformant stream" (though not a "picture producing 
conformant stream") bitstream (both described later). Too bad - if that's all we 
needed to do this app would quit a bit simpler.

To their credit, Microsoft does make some attempt to include the Entry Point and 
Sequence Start codes and their associated data, and it's in the form of a pre-
encoded bit string - "ready to go". Again, presumably to avoid redundancy, they 
only include it once, though. It's "hidden" (i.e. currently undocumented, to my 
knowledge) in the ASF_VIDEO_MEDIA_OBJECT structure 1 byte after the last byte of 
the last documented member, a BITMAPINFOHEADER structure. 

So, while demuxing, simply inserting a Frame Start Code before every demuxed 
frame and also inserting this supplied Sequence Header / Entrypoint Header bit 
sequence, verbatim, before each frame marked as a keyframe would make a usable 
VC-1 ES out of the resulting demuxed data. In fact, I believe that's exactly 
what version 1.0 of this app did, and I believe it did produce a technically 
"picture producing conformant" VC-1 bitstream.

Most modern containers use "byteless" stream of bits (meaning byte boundaries 
are meaningless) where the data is grouped into fields which are often both 
variable length and/or entirely optional, based on previous values. And VC-1 is 
no exception.

This involves writing custom code to tediously "parse out" values, bit-by-bit, 
even if you are only interested in a single bit of info.

So here's the good part.  While much of the ASF code is more conventionally 
written, I coded the VC-1 parsing (reading a bitstream) and "un-parsing" 
(creating a bitstream) a prototype for a new "experimental" table driven 
approach I thought up, which seems to have great promise (for my GSpot app, in 
particular, which much handle dozens of these kinds of specs).

Each entry in the table contains the name of a value (as an ID and as "human 
readable" text"), the number of bits the value occupies, a condition, if any, 
upon which to "skip" it, and a place for the value itself (as well as flag 
designating that value as valid).

The values in the table all readily accessible, they are constant bit length (32 
bits) values and can be accessed by simply going to their index. So values can 
be directly extracted or inserted at any time.

The table can be "run" in by a simple loop, which, in the absence of any 
overriding "conditionals" within the table, simply "executes" an entry a time 
and then moves on to the next one (the whole idea has similarities to an 
interpreted computer language). The table can furthermore be run in "parse" mode 
(read a bitstream & fill the table) or in reverse (use values in the table to 
create a bitstream) depending on the loop chosen to run it.

So, here's how it works:

1. The table is first "zeroed out" (well, it's actually "initialized to 
uninitialized", since zero is often a valid value, but that's a technicality). 
The "parser" loop is then run once on the Microsoft supplied "short" VC-1 
headers - the Sequence and Entrypoint headers mentioned above -, "reading in" 
any values it find there into the corresponding locations in the table.

2. The ASF file is then parsed (currently using more conventional, "straight 
line" code) to find values which I've found to be important yet missing in 
Microsoft's "short" VC-1 headers. Most significantly, the framerate is missing.  
But I also parse out the pixel-aspect ratio and perhaps some other stuff.

This is where the program could be easily expanded. The table has a spot for 
every possible value that could exist in the sequence and Entrypoint headers. 
The app could be modified to grab additional values from the ASF file, or, and 
this would be particularly easy, grabbed directly from a command line and 
inserted. That way, a value like PAR specified when even if It's not present or 
"overridden" when it is. Values like framerate (which should always be present) 
could be overridden as well. 

3. The table is then "unparsed" back to bitstream form. The generated result, a 
new "longer and more informative" bitstream than the original, is saved for use 
during the actual demux process in the next step.

4. Finally the ASF file is demuxed (not a particularly trivial task, by the 
way). A Frame Start Code is inserted prior to every frame, and the "new and 
improved" Entrypoint and Sequence Header pair bitstream is inserted prior to 
every keyframe.

5. VC-1 has was certainly upped the complexity level on that on the previously 
simple question of whether a frame is a keyframe or not (I think 3 of its 
approximately 13 possible frametypes are considered keyframes). In any event, 
the ASF file marks keyframes, and I perform a direct check of the VC-1 frametype 
as well. I believe the debug version of the code will "assert" if the ASF file 
contradicts the statement above (that 3 types are keyframes), and I have yet to 
see an assertion. For more info on this, just run the in it's most verbose mode.


What is a VC-1 Elementary Stream, Anyway?
=========================================
Technically, at the lowest level a VC-1 bitstream consists of a series of so-
called "RBDU's" (raw bitstream decodable units), of which there are 14 different 
types. Each RBDU is prefixed by start code "0x000001", followed by an additional 
byte that identifies the type of BDU that follows. The problem is, the sequence 
0x000001 could randomly appear up in the middle of an RBDU, which would, among 
other things, make re-synchronization virtually impossible.

So Annex E of SMPTE-421M defines an "encapsulation mechanism", wherein all data 
within the RBDU after the actual start code is "escaped", as programmer's common 
put it. The resulting BDU is now called EBDU (you guessed it: "encapsulated 
bitstream decodable unit".  The above annex describes how to convert an RBDU 
into an EBDU and vice-versa. I believe that, up until this point, the mechanism 
described is intended to apply to all VC-1 profiles - Simple, Main and Advanced.  
In any event, this work is already done for us when we demux the ASF file. The 
ASF2VC1 application does not have to get perform any "encapsulation". I only 
mention it to avoid any possible terminology confusion.
 
Unfortunately, a series of concatenated EBDU's is not what we are trying to 
create - we're not nearly done yet. We want a playable concatenation of these 
units, something *we'll* call a VC-1 "Elementary Stream" (hereafter referred to 
as "ES") - a stream that could be decoded and play "by itself", i.e. without the 
benefit of a container. We're using the term Elementary Stream by way of analogy 
with MPEG:  an MPEG Elementary Stream is an encapsulated, containerless video 
stream that can be "played by itself" - just like we want.

Our "ES" is defined in SMPTE-421M Annex G: "Bitstream Construction Constraints - 
Advanced Profile" and is technically called a "picture-producing conformant 
bitstream".

Note: to the best of my knowledge, the SMPTE specification does not appear to 
define a "conformant" bitstream", much less its subset, a "picture-producing 
conformant bitstream", for profiles other than the Advanced. The common wisdom 
is, basically, that there is "no such thing". The Simple and Main profiles are 
defined to the point where they can be put in a container, but there does not 
appear to exist a definition for what we're calling an "Elementary Streams" for 
either the Simple or Main profiles. 


Notes About the Source Code
===========================
I had big plans to neaten up this source, add features, removed some unused 
stuff, improve comments & some function and class names, and I thus did not 
release it immediately. At the time I figured I'd release it in another week or 
two. But that was over six months ago, and in the interim I've received several 
requests for it, so I'm releasing the untouched source that was used to ASF2VC1 
v1.2, build 20070526, the version that's been posted here since that time.

The source code is extremely generic, and should compile "out of the box" using 
Visual Studio VC-2003 or VC-2005.  The binary release that's been posted was 
compiled and "statically linked" using VC-2005 running on Windows Vista.  But 
I've also added a second source code package, for those who prefer, which 
compiles "out of the box" on VC-6. It's basically the same source - just open 
use the .dsw or the .dsp instead of the .sln or the .vcproj included in the 
former package. All I had to do to make the VC-6 version was add a few extra 
system #includes, change the prototype for "main()", and one or two other 
trivial things.

Porting
=======
Porting this to another O/S should be a snap. The existing readme for the 
executable says "This application has NO system requirements' to speak of... if 
your PC can read and write files, then the app should run fine. It consists of a 
single executable file, written from scratch. No DirectShow, no codecs, no 
SDK's, other apps or anything else is needed." And I would say the statement can 
be roughly applied to porting the source as well.  I haven't actually don it, 
but my guess is that porting to a Linux or similar environment would mostly 
consists of swapping a few "CreateFile(), ReadFile() and WriteFile() with 
fopen(), fread, and fwrite(). I think I used an MS macro DEFINE_GUID, but that 
could just be replaced with "const char myguid [32] = {bytes of some GUID}, and 
other issues should be equally simple. All required GUID *values* are included 
in supplied header files, though, you do not need to pore thru any obscure 
Microsoft header files. Most are only obtainable from the ASF specification doc 
anyway, and if there discovered any undocumented ones that were needed, they're 
there too.

Re: Byte ordering: If you're porting to a system that uses Motorola (Big Endian) 
byte ordering, I *know* for sure you will have add a line or so to four 
functions: "NextWordLe(), NextDWordLe(), and NextQWordLe() in the file 
FileCache2.cpp. These functions simply read the next two, four or eight bytes 
respectively from the ASF "input stream" and are expected to return a 
corresponding value for your host system. I probably would have thrown in 
ntohs() and ntohl(), but the Microsoft specifies that all values in an ASF file 
be in Intel (Little Endian) order (even though it's a "streaming" format 
definition, they stuck to their old AVI ways and rejected the fact that "network 
order" was defined to be Big Endian - but I digress).  I was writing a quick 
piece of code for an Windows (Intel) machine that read a file format defined as 
little Endian, so I just "cast" it. I could make the code portable right now in 
less time than it's taking to type this, but I want to release the code 
untouched. Please double check, but it seems other similar functions, in 
particular the nearby "NextVarlenLe()", don't use "casting" and are inherently 
Platform Independent. 

Addendum:  I just noticed there's a definition "LITTLE_ENDIAN" near the top of 
"constants.h". Comment that out if it's not applicable to your machine. 
Apparently there are a few other places that are byte order sensitive, though it 
looks like I've put in the appropriate conditionals.

Other Oddities
==============
The whole project consists of around seven simple classes - two for ASF, one for 
VC1, two for "getting and putting" bits, a "file cache" and a Log function (the 
last of which isn't even really a class - jus a global function).

The log() function was part of an unfinished plan to easily allow multiple 
"verbosity levels", either to the screen or to have built-in "logging" without 
the necessity of redirecting console output. That was never really finished, and 
the entire function could probably be replaced with a simple "printf" or even 
completely eliminated if you're only interested in the final result. As I 
recall, the currently implementation does distinguish between some "serious" 
("fatal") errors and "informational" output, sending the former to stderr and 
the latter to "stdout", so the existing version will display critical error info 
on the screen even if the output is being redirected, the only current mechanism 
for "logging". Beyond that, this simple function can largely be ignored or 
replaced.

The CFileCache2 class is a simple "ring buffer" memory mapping of the file being 
read. The intent was to have a "black" box which would transparently reload from 
the physical file only in large chunks, when needed, allowing fast processing 
even when files would otherwise have to be parsed with multiple one or two byte 
reads. It maintained it's own virtual internal file pointer, and kept that 
aligned with the physical file pointer. Besides supporting a single "fgetc()" 
type calls and larger "fread()" type calls, it had a number of utility functions 
such to get the next WORD or DWORD or GUID. This simplified coding by returning 
the value in the form needed & advancing the pointer automatically.

It was to become a better version of the original CFileCache() class I already 
use in my GSpot app. It's OK as it currently exists in this app, except the 
version here is "unfinished" insofar as it lacks any functions that aren't 
specifically needed for this app (e.g. It has a "get next little-endian DWORD" - 
the NextDWordLe() mentioned above -  but no but no corresponding NextDWordBe() 
for possible use in other file formats. Again, I could add some of this in less 
time than it's taking me to type this, but I'm I've decided to leave this 
current source distribution untouched.

Furthermore, it doesn't support reads larger than the buffer size, and in this 
app the number of small reads is minimal (while parsing the ASF header), then it 
starts demuxing in relatively large chunks. In a very "un-object oriented" way, 
the current code simply bypasses the whole cache after the initial parse and 
uses the physical file handle thereafter.

And lastly, if you're actually examining it, you'll notice that it always 
reloads the 7/8 of the ring buffer when it becomes exhausted, not the whole 
thing.  This may seem odd, but is quite intentional and was indeed of some use 
in the original version. What it meant is that you could actually perform 
relatively efficient, limited byte by byte *backwards* searches as well, because 
a "reload" created a memory map which contained values both ahead and, to a 
lesser extent, *behind" the current virtual pointer. It's not a bad idea, but 
the feature is unused in this app.

Anyway, this is just all informative. For this app, the entire cache probably 
could be eliminated, since there aren't that many "small reads", I bypass it 
later anyway, and the "reverse function is unused". But the various special 
"Getxxx()" functions are nice, and they'd have to be replaced if the class were 
dispensed with. My plan is to finish it up and use it for a variety of file 
formats in the next GSpot. Right now, it's probably best left as is.


ASF weirdness's
===============
The current code handles ASF files with multiple bitstreams - it automatically 
selects the one with the highest bitrate (a command line override to select s 
specified stream should really be added). In retrospect, I probably should have 
just specified that the user only attempt to demux single stream files. But I 
kept running into multi-stream files, and it always picked the wrong one. I now 
know I was "running into them" because it's easy to make them "accidentally" 
with the Windows Media Encoder, and I was inadvertently creating them myself! 
For the record, if you use it, make sure you only have one checkbox ticked on 
the Session Properties "Compression" tab, or you'll be adding a lot of extra 
stuff to your file you may never see.  These multiple bitrate files are 
apparently intended when creating files to be used as source material with their 
streaming server, but most regular users would not have a use for such a thing.

Anyway, I only mention all this because of an odd idiosyncrasy of the ASF File 
format. It's explained in a comment, complete with a "pictorial representation", 
directly above the function 
"CAsfStream::ProcASF_Extended_Stream_Properties_Object"

We're looking for information in a structure called the "Stream Properties 
object". If there is more than one stream, there is more than one such object. 
So far, so good. But only *one*, apparently the "low-bandwidth" one, is in the 
normal place. To find the one we typically want, we have to look for a "Header 
Extension Object". And *inside* that there is a "Extended Stream Properties 
Object". And following that is a list of additional objects, one of which *may* 
be another "Stream Properties object" - nested three levels deeper than the 
original. I don't know how or why the ASF format such ended up with such an 
apparent "hack" for adding additional streams, but it seriously complicates 
parsing a format which isn't exactly trivial to begin with. Be that it may, the 
code is in there and appears to work.

The Good Stuff
==============
Finally on to something I like discussing. The VC-1 bitstreams are encoded and 
decoded by way of a table driven "interpreter", and I'm very happy with the way 
this worked out. This was previously described so I won't get into it to much 
now, but if you're looking for the "meat" of the VC-1 parsing, you'll find it in 
the functions "CVc1Parse::BitStringToTable" and "CVc1Parse::TableToBitString", 
whose names should be self-explanatory. 

For reference, here's the table that along with the two short loops above, does 
all the VC-1 work. The table comprises the full specification of possible values 
that may be included in Sequence and Entrypoint headers, so expanding the 
capabilities of this program should be a relatively simple matter. If you can 
figure out a way to get a new value you want in the bitstream headers, just call 
"CVc1Parse::SetTableValue" with the name of a parameter and its value at the 
appropriate time, and then the aforementioned  "TableToBitString()" function 
will include your new value in the resulting headers.

I'm going to expand this table concept (right now it only handles two 
"comparison operators" - equal and "not equal", but it would be trivial to add 
more. And in theory the table, which now only conditionally jumps forwards, 
could be made to loop and perform other complex parsing functions, and even to 
perhaps call a custom function when required (by adding a function pointer 
column right to the table). As it stands now, custom functions are needed in 
only around three cases. The are TableToBitString() and TableToBitString() 
functions, which would ideally be completely generic loops, are each hard coded 
to determine when that part of the table is reached, and then call a custom 
function. But with a little more work, I hope to have a generic table driven 
parsing mechanism that I can use in a wide variety of circumstances. That would 
hopefully organize and reduce the huge amount of custom code in programs like 
GSpot, which read data defined by a wide variety of specifications.


References
==========
ASF: Almost everything you need to know is contained in Microsoft's "Advanced 
Systems Format (ASF) Specification", currently available for free download at 
http://www.microsoft.com/windows/windowsmedia/forpros/format/asfspec.aspx .

VC-1: Everything you need to know is available in the main SPMPTE VC-1 
specification document, "SMPTE STANDARD 421M-2006: 
VC-1 Compressed Video Bitstream Format and Decoding Process"

Unfortunately, this 470 page document is not available for free; you have to 
purchase it from SMPTE and indeed it's quite expensive. While I own a copy, it 
would obviously be illegal for me to make it available. But it's not illegal for 
me to divulge information from it, and everything required by this program is 
included right in the source code - often in nicely organized tables. Note, for 
example, near the top of VC1Parse.cpp are tables showing the bit codes are 
assigned to which framerates, PAR values, etc.

And, of course, the entire "structure" of the Sequence and Entrypoint headers is 
implied by my main "parser table", "InitSeqAndEntryTbl", located in the file VC-
1Parse.h.


License
=======

/* ***************************************************************************************
The MIT License

Copyright (c) 2007 Steven G. Greenberg

Permission is hereby granted, free of charge, to any person obtaining a copy of this
software and associated documentation files (the "Software"), to deal in the Software
without restriction, including without limitation the rights to use, copy, modify, merge,
publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons
to whom the Software is furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all copies or
substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED,
INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR
PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE
FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR
OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER
DEALINGS IN THE SOFTWARE.
******************************************************************************************* */

Addendum and Contact Info
=========================

As a final addendum, please note that I've just spent several hours writing up this document
to go along with the release of the source code. For expediency, I haven't spent extensive
time proof-reading it, so please take that into account.

Please direct all comments, bugs, suggestions, etc to steve (at) headbands.com, and make sure to
at least include the name "ASF2VC1" somewhere in the subject line. I will update my spam filter
to automatically allow any such emails.

- Steven Greenberg
  12 December 2007