Wiki -VoIP Topics

Java SIP AI


Build AI voice agents, automated customer support, AI receptionists, real-time transcription and translation, call summarization, compliance monitoring, voice-driven automation or any kind of intelligent call handlers with the JVoIP SIP Library for Java
JVoIP, our mature Java SIP library, now offers both flexible APIs and a built-in AI connector to bring these capabilities to your projects, allowing the routing of SIP calls to AI and the answer back to the SIP peer. 

With the new AI features, you can programmatically:

  • Extract raw audio streams from live SIP/WebRTC/phone calls and feed them to any AI service (STS end-to-end speech models or STT/LLM/TTS)
  • Inject the speech/text back into the SIP session
  • Execute complex call logic based on LLM decisions, including call transfers, DTMF injection or API calls
  • Run headless on servers for intelligent IVR, chatbots or AI-driven call centers

JVoIP is a SIP library for Java with AI connection capabilities, a swiss army key for Java developers but it can be consumed also from any JVM language
JVoIP has been developed with special care for flexibility, interoperability and backward API compatibility. 
It is compatible with all common VoIP devices, servers and softphones, including Asterisk, FreeSWITCH, OpenSIPS, Twilio, Cisco, and any other SIP based platform, enabling on-premise voice AI implementations for privacy-sensitive sectors (healthcare, finance, legal, military). 

Prerequisites:

  • Java SE JDK/JRE on Windows, Linux, macOS, or any platform with Java support (x86, x64, ARM, PPC)
  • JVoIP: download
  • A SIP server (optional: JVoIP can also act as a server)
  • API key from your chosen AI provider (OpenAI, etc.)

    
Deployment Models: 

  • Client-Side: the JVoIP SIP stack runs on an end-user device (desktop, server, or embedded) and connects to your existing VoIP server to add AI voice capabilities to a custom Java softphone or desktop VoIP application
  • Server-Side (Standalone): JVoIP acts as a SIP UAS or server, listening on a port and answering calls directly    to implement lightweight AI receptionists or any AI integration in environments where you control the full stack
  • Server-Side + VoIP Server: JVoIP registers with (or accepts calls from) an existing PBX or SIP server like FreeSWITCH, Asterisk or OpenSIPS. This use case allows production deployments where you want to keep your existing telephony infrastructure and overlay AI services    

    
The call-flow for speech-to-speech (S2S/STS) usually looks like this: 
SIP Peer <-> RTP <-> JVoIP <-> WebSocket <-> AI Provider    
Note: If your VoIP server has support for WebRTC, then of course, you can handle also WebRTC originated calls the same way.)
  
There are two ways to implement AI integration with JVoIP
A). Using the JVoIP built-in AI connector
B). Using media streaming 


A). Using the Built-in AI connector

For each call (or chat) session an AI Connection will be started automatically (as specified by the “ai_start_at” parameter) or on demand (when you call the API_AI function). 
The audio from the SIP call will be routed to the AI as specified by the ai_media_dir_in parameter (by default it will forward the audio from the peer to AI). 
The audio from the AI will be forwarded as specified by the ai_media_dir_in parameter (by default it will be forwarded to the peer SIP endpoint). 
You can fine-tune the behavior with the ai_xxx parameters: 

  • ai_enable: -1: enable/disable the built-in AI module. auto (yes if provider/api_key is set or per endpoint), 0: no, 1: yes
  • ai_dir: specify for which calls to use AI -1: auto (all), 0: undefined, 1: incoming calls, 2: outgoing calls, 3: all calls 
  • ai_start_at: specify when to connect to real-time API -1=auto (at call connect and on demand), 0=on explicit request only (API_AI), 1=on demand, 2=after announcement playback if any, 3=on call connected, 4=on sdp, 5=on call init.
  • ai_media_dir_in: specify which streams to be sent to AI -1: auto (will default to 1), 0: from local API only (not recommended), 1: incoming (text and RTP received to JVoIP from the remote peer), 2: outgoing (text and RTP sent by JVoIP to remote peer), 3: both.
  • ai_media_dir_out: specify where to forward the AI answers: -1: auto (depending on ai_media_dir_in, usually 1), 0: only API/notifications/local streaming (note: sendmedia_aformat with RTP header is not supported here),  1: to peer (forward the answer to the connected peer), 2: playback locally, 3: to peer + local API (unnecessary, because 1 will cover also 0), 4: all
  • ai_media_type_in: specify what kind media to be sent to AI  -1: auto (def to audio+text), 1: audio only, 2: text only, 3: audio+text
  • ai_media_type_out: specify what kind of media to expect from AI (initial AI answer modalities). -1: auto (def to audio, but text only for text), 0: don't set (avoid this), 1: receive audio only, 2: receive text only, 3: receive audio+text
  • ai_provider: -1: defaults to 1 or depending on the ai_url setting, 0: undefined, 1: OpenAI ChatGPT realtime API, 2: Google Gemini Live API, 3+: others: Deepgram Voice Agent, Amazon AWS Nova Sonic, Azure GPT Realtime API Azure Voice Live API, Hume EVI, FastRTC, etc
  • ai_url: optional if you don’t want to use the default. By default it is loaded based in ai_provider.
  • ai_api_key: apiKey / API key. Mandatory.
  • ai_model: optional if you don’t want to use the default. for example: gpt-realtime, gpt-realtime-mini, gpt-realtime-2, gpt-realtime-translate, gpt-realtime-whisper, gemini-live-2.5-flash-native-audio, etc
  • ai_voice: ChatGPT voices: alloy, ash, ballad, coral, echo, sage, shimmer, verse, marin, and cedar
  • ai_language: language output expected from AI (important for translations)
  • ai_answer_on_connect: Answer on connect. Auto-create response when connected. If 0, then you should play some local prompt first from file (on incoming call connected). Possible values: -1 default (yes for ChatGPT, otherwise no), 0: no, 1: yes
  • ai_transcriptions: -1: default (disabled), 0: disable, 1: input transcription (for the speech sent from user to AI), 2: output transcription (for the speech received from AI), 3: both, 4: also partial
  • ai_system_prompt: System prompt / instructions. Example: You are a helpful AI assistant. Be concise and friendly. Say \"Hi, how can I help you?\" at the beginning of each session.
  • ai_transfer_detect: transfer the call on user intent. -1: auto (yes if ai_transfer_num is set, otherwise no), 0: no, 1: yes
  • ai_record: save conversations to file. -1: default/no, 0: no, 1: yes, record the audio from AI to file
  • ...and many others parameters 

AI event notifications provides status updates from the built-in AI connector: 
SIPNotification.AI ai = (SIPNotification.AI) notification;
switch(ai.getStatus()) {
    case STATUS_CONNECTED:   // WebSocket connected
    case STATUS_SENDRECV:    // Audio flowing both ways
    case STATUS_TRANSFER:    // Call transfer initiated
    case STATUS_DISCONNECTED: // Session ended
    case STATUS_ERROR:       // Error occurred
}


Here is a minimal example using JVoIP as an extension on your SIP server: 
    webphone jvoip = new webphone();
    jvoip.API_SetParameter("serveraddress", "YOUR_SIP_SERVER_DOMAIN_OR_IP");
    jvoip.API_SetParameter("username", "SIP_USERNAME");
    jvoip.API_SetParameter("password", "SIP_PASSWORD");
    jvoip.API_SetParameter("ai_api_key", "YOUR_OPENAI_API_KEY");
    jvoip.API_Start();
    // JVoIP is now ready to answer or initiate phone calls with AI    


The built-in AI connector module is fully customizable and you can contact Mizutech (info@mizu-voip.com) if you have some requirements which is not covered with the existing options. 

B) Using media streaming

Get the audio stream from JVoIP with Media events, forward it to your preferred AI provider (usually via websocket) and send the AI answer back to the SIP peer using the API_StreamSoundBuff or API_StreamSoundStream functions.
This meanas implementing the AI connection yourself for complete flexibility and using only the media streaming capabilities from JVoIP. 

Implementation details: 
1. Configure JVoIP for media streaming. Set sendmedia_mode=3 (media events), sendmedia_direction=1 (incoming audio from peer), and sendmedia_aformat=24000 (24 kHz PCM) to match typical AI requirements.

2. Implement a media listener to receive raw audio packets as they arrive:
class MySIPMediaListener extends SIPMediaListener {
    @Override
    public void onMedia(SIPNotification.Media packet) {
        if(packet.getDirection() == DIRECTION_IN && !packet.isEof()) {
            byte[] audio = packet.getData();
            // Forward to your AI module (WebSocket, HTTP, etc.)
            myAIConnector.sendAudio(audio);
        }
    }
}
wobj.API_SetMediaListener(new MySIPMediaListener());

3. Implement an AI connector module that receives responses from the AI service and sends them back into the call (to the SIP peer) using API_StreamSoundBuff function: 
jvoip.API_StreamSoundBuff(1, line, buff, len, samplerate, true);

Download example code from here.

See the JVoIP documentation for more details, especially the "AI Integration" FAQ points.