iommu information (and other hypervisor api doc issues)

I have sent mail to hypervisor@sun.com, but have yet to receive a response.

I am working on porting FreeBSD to the sun4v, but the documentation for the IOMMU lacking. There is no description of what an iotte is, what it looks like, what it's bits mean, how many bits are in an iotte. If it's an tte, then just say tte, instead of iotte.

The HyperVisor doc references in 20.1.1:

[1] sun4v Bus Binding to Open Firmware

[2] VPCI Bus Binding to Open Firmware

which aren't available for download anywhere, and in section 21.2 it references doc [4] which isn't stated what the doc is.

[620 byte] By [therealjmg] at [2008-1-7]
# 1

Well, since Sun doesn't seem to be answering this question, and we wanted to continue w/ the project, a friend did some experiments, and found that the IOMMU documentation is incorrect.

The call iommu_map io_page_list_p is not: The page mapping addresses are described in the io_page_list defined by the argument io_page_list_p, which is a pointer to the io_page_list. The first entry in the io_page_list is the address for the first iotte, the 2nd entry for the 2nd iotte, and so on.

There is no such thing as an iotte, just an raddr, and this isn't a double array. Notice how they say the first entry is the address for the first iotte, instead of the first entry IS the first iotte, or even better, the first entry IS the r_addr the entry will be mapped to. The io_page_list_p is an r_addr pointing to an array of r_addrs that is the r_addr that the respective TSB entry will be mapped to.

io_page_list_p r_addr ->dma dest r_addr 0

dma dest r_addr 1

...

dma dest r_addr n

Also in the docs for dma_sync it says: using the direction(s) defined by the argument io_sync_direction.

and io_sync_direction is defined as:

io_sync_direction "direction" definition for pci_dma_sync

A value specifying the direction for a memory/io sync operation, The

direction value is a flag, one or both directions may be specified by the

caller.

0x01 - For device (device read from memory)

0x02 - For cpu (device write to memory)

which explicately says one or both directions may be specified... Well, sorry, you can't, you can only specify one direction at once.

It's one thing for documentation to be incomplete (no definition of what an iotte is) or incorrect (io_page_list_p), but it's another thing for Sun to completely ignore requests for help. If Sun wants these forums to be successful, they need to have engineers watching and responding to these requests.

therealjmg at 2007-7-6 > top of java,Open Source Technologies,OpenSPARC...
# 2

Thanks for your comments; having re-read section 20 of the Hypervisor API spec

I agree it probably could use a narrative section to provide more informative

detail. We'll do that in the forthcoming update. I hope the following will

confirm and explain what you have been seeing;

The Niagara-1 system re-uses a PCI-Express interface chip ("Fire") originally

designed for an earlier family of processors. The J-bus interface on Niagara-1

is specifically to be able to connect to this interface chip.

The Fire chip has two PCI-Express interfaces, each of which is supported by

its own IOMMU to translate both 32bit and 64bit PCI bus addresses into host

system memory addresses for J-bus transactions.

Each of the two IOMMUs in Fire contains a TLB for caching frequently used

PCI-bus to host memory address translations. These translations are loaded

on demand by Fire hardware from a lookup table specified in

Niagara's main memory.

The lookup table itself is very simplistic; the (virtual) address presented

by a PCI device is simply shifted by a specified IOMMU page size, and used as

an index into the translation table. A basic linear table lookup is fine

since it is the operating system that is assigning the PCI bus address space -

making it possible to arrange for a compact linear address map.

However, because Fire was originally architected without virtualization in mind,each in-memory translation table contains IO translation table

entries (IO TTE's) that specify physical bus addresses.

Therefore we could not allow the guest operating system to

own and manage these tables directly - instead the Hypervisor must

check and validate the specified address mapping for each IO TTE to ensure

that the guest has not specified an illegal mapping - in order, for example,

to DMA to/from hypervisor memory, or other guest OS' memory.

Furthermore, there is no hardware mechanism to interrupt the Niagara

processor and allow us to "fault-in" a missing translation, (as we do with

the CPU's virtual memory translations), so all the PCI->host memory

address translations have to be fully specified in the in-memory translation

table before IO DMA can begin. This prevented a "shadow" page-table design.

Consequently, these IOMMU translation tables are defined in hypervisor-private

memory and the hypervisor API is used to cause the Hypervisor to fill

in each IO TTE entry. This in turn means that we keep the IO TTE format

hypervisor private also. The guest OS doesn't need to know the TTE format, whichleaves the hardware designers to change the format and layout of the

translation table in the future and the Hypervisor can maintain compatibility

in software. The Hypervisor API is optimised for the kind of behaviour

Solaris wants to perform for each DMA operation - namely map a list of

pages with the same permissions;

So to setup the IO translations required for a DMA transfer, the Hypervisor

API allows the guest OS to identify which translation table it wishes to use

(the abstraction allows for more than one table, and more than one IO TLB).

The remaining arguments specify the starting index in the IO translation

table at which to start populating the mappings, the permissions specified

apply to each of the mappings, and a pointer identifies the list of pages

to use for each of the mappings.

Section 20.3 describes each of the parameters in more detail; but the tsbid

field really is a tuple of the identifier for the translation table in the

upper 32bits, and the index within the table in the lowest 32 bits.

(by the way TSB = Translation Storage Buffer, another name for translation

table. The distinction is simply that these tables are linear rather than

with a heirarchical walk like x86 virtual-memory page tables).

For the map API the Hypervisor will try to create and insert as many

IOTTEs as possible

using the information given into the IOTSB identified. However to avoid

problems like priority inversions while in the Hypervisor, the API

may return before building all the IO TTE mappings. In this event

the guest OS may simply call the API again for the remaining mappings.

The dma_sync API is intended to allow for possible future hardware that

may not be cache-coherent for DMA transactions. For example, an application

may write data into a software buffer which sits as dirty lines, for example,

in a write-back level 2 cache. With non-coherent DMA a device

read of the buffer memory may return the incorrect contents of main memory

without the L2 cache snooping and returning the correct data for each

memory read. On Niagara this is essentially a no-op since DMA is coherent,

but the API is provided to enable binary compatibility of kernels

from one sun4v platform to another.

With that in mind, the flag argument was originally intended to enable the

sync to be specified for both directions simultaneously. However in practice

the operation typically needs to be invoked at different times. The sync for

a device prior to a DMA read, and the sync for the cpu after a DMA write

completes. So in practice we have had no need to specify both operations

in one sync API call - which is why the issue you raise was not fixed

with the Niargara-1 hypervisor. We'll correct this in the next release

of the HV specification.

AshleySaulsbury at 2007-7-6 > top of java,Open Source Technologies,OpenSPARC...