Although we try to avoid mentioning specific companies in these guidelines, probably the most widely used method for capturing spoken addresses is the Nuance advanced dialog module for addresses (see: Nuance Address)
Whether you build your own or use the Nuance grammars, keep in mind that the address grammars will need to be updated regularly to stay in synch with real addresses. The United States Postal System is generally a good source of address data, and they provide quarterly updates to their address data. Surprisingly and unfortunately, there are locales that are known to be missing from the data.
As far as the caller experience, the strategy generally is to start with ZIP Code collection, then use that to (1) determine the city and state and (2) to constrain street numbers and names to those that are in the indicated area. A separate dialogue module is usually provided for the caller to communicate unit or apartment numbers, if applicable.
For improved performance, consider data sources that enable reverse lookup of address information. The caller's ANI is used to search publically available information about the address linked to the phone number, which is then played back to the caller for confirmation using TTS to play the address information. Completion rates and handle time both increase with this approach, since recognition for a yes/no is easier and more accurate than a multi-step speech input process.
A side note on TTS: although designers typically avoid using TTS as much as possible to avoid giving the caller a robotic experience, playback of address information is one where it's not realistic to expect to use recorded speech for the entirety of the experience.
Since TTS, speech licenses, and regular grammar updates are required to enable this address automation, it can become quite expensive. With completion rates anywhere from 30-90%, depending on how well the solution is implemented and maintained, pretty high call volumes are required for address automation to be cost effective.