How are people streaming real time output from models?

Posted by Mephidia@reddit | LocalLLaMA | View on Reddit | 1 comments

Basically title, but how are we streaming the output using an API? I don’t want the request to return the whole output after it completes, but rather I want it to continuously spit out inference values as they are outputted. Is there some sort of Kafka integration or are people hitting the backend over and over or are people just content with waiting several seconds for output?

How are people streaming real time output from models?

Reply to Post

1 Comments

gedw99@reddit