By using this site, you agree to the Privacy Policy and Terms of Use.
Accept
10alert.com10alert.com10alert.com
  • Threats
    • WordPress ThreatsDanger
    Threats
    A cyber or cybersecurity threat is a malicious act that seeks to damage data, steal data, or disrupt digital life in general. Cyber threats include…
    Show More
    Top News
    How to Fight Rootkits -Kaspersky Daily
    1 year ago
    Faketoken Banking Trojan is Getting Widespread in 55 Countries
    1 year ago
    95% of Android phones can be hacked with one just MMS
    1 year ago
    Latest News
    Patchstack Becomes Member Of Open Source Security Foundation
    13 hours ago
    PDF Phishing: Beyond the Bait
    16 hours ago
    Update ASAP! Critical Unauthenticated Arbitrary File Upload in MW WP Form Allows Malicious Code Execution
    19 hours ago
    Fake CVE Phishing Campaign Tricks WordPress Users Into Installing Malware
    2 days ago
  • Fix
    Fix
    Troubleshooting guide you need when errors, bugs or technical glitches might ruin your digital experience.
    Show More
    Top News
    Unscheduled update for WordPress fixes two critical vulnerabilities
    Unscheduled update for WordPress fixes two critical vulnerabilities
    1 year ago
    Windows 11 build 22621.457 (KB5016695) outs in Release Preview Channel
    1 year ago
    How to change time zone on Windows 11
    1 year ago
    Latest News
    How automatically delete unused files from my Downloads folder?
    10 months ago
    Now you can speed up any video in your browser
    10 months ago
    How to restore access to a file after EFS or view it on another computer?
    10 months ago
    18 Proven Tips to Speed Up Your WordPress Site and Improve SEO | 2023 Guide
    11 months ago
  • How To
    How ToShow More
    A year in recap: Windows accessibility
    19 hours ago
    How to stop, disable, and remove any Android apps — even system ones
    3 days ago
    Bigger, Better, Cooler in a 2U1N form factor
    Bigger, Better, Cooler in a 2U1N form factor
    4 days ago
    Vulnerability in crypto wallets created online in the early 2010s
    5 days ago
    Use Windows 11 features to inspire creativity, speed up everyday tasks
    6 days ago
  • News
    News
    This category of resources includes the latest technology news and updates, covering a wide range of topics and innovations in the tech industry. From new…
    Show More
    Top News
    How to set HDR (JXR) wallpapers on Windows 11
    4 months ago
    How to extract Zip, RAR, 7z, Tar on Windows 11
    3 months ago
    How to turn Wi-Fi On or Off on Windows 11
    2 months ago
    Latest News
    How to disable news feed from Widgets on Windows 11
    17 hours ago
    How to fix performance issues after upgrading to Windows 11 23H2
    17 hours ago
    How to disable updates on Windows 10 Pro and Home
    2 days ago
    Change screen brightness on Windows 11
    4 days ago
  • Glossary
  • My Bookmarks
Reading: Streaming and longer context lengths for LLMs on Workers AI
Share
Notification Show More
Aa
Aa
10alert.com10alert.com
  • Threats
  • Fix
  • How To
  • News
  • Glossary
  • My Bookmarks
  • Threats
    • WordPress ThreatsDanger
  • Fix
  • How To
  • News
  • Glossary
  • My Bookmarks
Follow US
Apps

Streaming and longer context lengths for LLMs on Workers AI

Andra Smith
Last updated: 15 November
Andra Smith 3 weeks ago
Share
7 Min Read

Streaming LLMs and longer context lengths available in Workers AI

Contents
Server-sent events: a little gem in the browser APIHigher precision, longer context and sequence lengths

Workers AI is our serverless GPU-powered inference platform running on top of Cloudflare’s global network. It provides a growing catalog of off-the-shelf models that run seamlessly with Workers and enable developers to build powerful and scalable AI applications in minutes. We’ve already seen developers doing amazing things with Workers AI, and we can’t wait to see what they do as we continue to expand the platform. To that end, today we’re excited to announce some of our most-requested new features: streaming responses for all Large Language Models (LLMs) on Workers AI, larger context and sequence windows, and a full-precision Llama-2 model variant.

If you’ve used ChatGPT before, then you’re familiar with the benefits of response streaming, where responses flow in token by token. LLMs work internally by generating responses sequentially using a process of repeated inference — the full output of a LLM model is essentially a sequence of hundreds or thousands of individual prediction tasks. For this reason, while it only takes a few milliseconds to generate a single token, generating the full response takes longer, on the order of seconds. The good news is we can start displaying the response as soon as the first tokens are generated, and append each additional token until the response is complete. This yields a much better experience for the end user —  displaying text incrementally as it’s generated not only provides instant responsiveness, but also gives the end-user time to read and interpret the text.

As of today, you can now use response streaming for any LLM model in our catalog, including the very popular Llama-2 model. Here’s how it works.

Server-sent events: a little gem in the browser API

Server-sent events are easy to use, simple to implement on the server side, standardized, and broadly available across many platforms natively or as a polyfill. Server-sent events fill a niche of handling a stream of updates from the server, removing the need for the boilerplate code that would otherwise be necessary to handle the event stream.

Easy-to-useStreamingBidirectional
fetch ✅
Server-sent events ✅ ✅
Websockets ✅ ✅

Comparing fetch, server-sent events, and websockets

To get started using streaming on Workers AI’s text generation models with server-sent events, set the “stream” parameter to true in the input of request. This will change the response format and mime-type to text/event-stream.

Here’s an example of using streaming with the REST API:

 curl -X POST  "https://api.cloudflare.com/client/v4/accounts//ai/run/@cf/meta/llama-2-7b-chat-int8"  -H "Authorization: Bearer "  -H "Content-Type:application/json"  -d '{ "prompt": "where is new york?", "stream": true }'  data: {"response":"New"}  data: {"response":" York"}  data: {"response":" is"}  data: {"response":" located"}  data: {"response":" in"}  data: {"response":" the"}  ...  data: [DONE] 

And here’s an example using a Worker script:

 import { Ai } from "@cloudflare/ai"; export default {     async fetch(request, env, ctx) {         const ai=new Ai(env.AI, { sessionOptions: { ctx: ctx } });         const stream=await ai.run(             "@cf/meta/llama-2-7b-chat-int8",             { prompt: "where is new york?", stream: true  }         );         return new Response(stream,             { headers: { "content-type": "text/event-stream" } }         );     } } 

If you want to consume the output event-stream from this Worker in a browser page, the client-side JavaScript is something like:

 const source=new EventSource("/worker-endpoint"); source.onmessage=(event)=> {     if(event.data=="[DONE]") {         // SSE spec says the connection is restarted         // if we don't explicitly close it         source.close();         return;     }     const data=JSON.parse(event.data);     el.innerHTML +=data.response; } 

You can use this simple code with any simple HTML page, complex SPAs using React or other Web frameworks.

This creates a much more interactive experience for the user, who now sees the page update as the response is incrementally created, instead of waiting with a spinner until the entire response sequence has been generated. Try it out streaming on ai.cloudflare.com.

Workers AI supports streaming text responses for the Llama-2 model and any future LLM models we are adding to our catalog.

But this is not all.

Higher precision, longer context and sequence lengths

Another top request we heard from our community after the launch of Workers AI was for longer questions and answers in our Llama-2 model. In LLM terminology, this translates to higher context length (the number of tokens the model takes as input before making the prediction) and higher sequence length (the number of tokens the model generates in the response.)

We’re listening, and in conjunction with streaming, today we are adding a higher 16-bit full-precision Llama-2 variant to the catalog, and increasing the context and sequence lengths for the existing 8-bit version.

ModelContext length (in)Sequence length (out)
@cf/meta/llama-2-7b-chat-int8 2048 (768 before) 1800 (256 before)
@cf/meta/llama-2-7b-chat-fp16 3072 2500

Streaming, higher precision, and longer context and sequence lengths provide a better user experience and enable new, richer applications using large language models in Workers AI.

Check the Workers AI developer documentation for more information and options. If you have any questions or feedback about Workers AI, please come see us in the Cloudflare Community and the Cloudflare Discord.
If you are interested in machine learning and serverless AI, the Cloudflare Workers AI team is building a global-scale platform and tools that enable our customers to run fast, low-latency inference tasks on top of our network. Check our jobs page for opportunities.


Source: cloudflare.com

Translate this article

TAGGED: Cloudflare, Software, Transport Layer Security, Windows
Andra Smith November 15, 2023 November 15, 2023
Share This Article
Facebook Twitter Reddit Telegram Email Copy Link Print

STAY CONECTED

24.8k Followers Like
253.9k Followers Follow
33.7k Subscribers Subscribe
124.8k Members Follow

LAST 10 ALERT

Patchstack Becomes Member Of Open Source Security Foundation
Patchstack Becomes Member Of Open Source Security Foundation
Wordpress Threats 16 hours ago
PDF Phishing: Beyond the Bait
Threats 19 hours ago
A year in recap: Windows accessibility
Windows 19 hours ago
How to disable news feed from Widgets on Windows 11
News 20 hours ago
How to fix performance issues after upgrading to Windows 11 23H2
News 20 hours ago

You Might Also Like

Patchstack Becomes Member Of Open Source Security Foundation
Wordpress Threats

Patchstack Becomes Member Of Open Source Security Foundation

16 hours ago
Windows

A year in recap: Windows accessibility

19 hours ago
News

How to disable news feed from Widgets on Windows 11

20 hours ago
News

How to fix performance issues after upgrading to Windows 11 23H2

20 hours ago
Show More

Related stories

Several Critical Vulnerabilities including Privilege Escalation, Authentication Bypass, and More Patched in UserPro WordPress Plugin
BridesMaid – neuron writes toasts For those very occasions when you need to give out a powerful
The other day Yandex pleased us with the announcement of a new Midi station – an excellent reason to listen
REMIX – remixes of pictures from neural networksCreate, share and correct works
How to download Diablo IV for free and absolutely legallyBlizzard has opened a free
Rostelecom employees were forced to abandon Android and iOS in favor of Aurora.
Previous Next

10 New Stories

Update ASAP! Critical Unauthenticated Arbitrary File Upload in MW WP Form Allows Malicious Code Execution
Fake CVE Phishing Campaign Tricks WordPress Users Into Installing Malware
How to disable updates on Windows 10 Pro and Home
How to stop, disable, and remove any Android apps — even system ones
Patchstack Alliance Bounty Program Events for December
Your Smart Coffee Maker is Brewing Up Trouble
Previous Next
Hot News
Patchstack Becomes Member Of Open Source Security Foundation
PDF Phishing: Beyond the Bait
A year in recap: Windows accessibility
How to disable news feed from Widgets on Windows 11
How to fix performance issues after upgrading to Windows 11 23H2
10alert.com10alert.com
Follow US
© 10 Alert Network. All Rights Reserved.
  • Privacy Policy
  • Contact
  • Customize Interests
  • My Bookmarks
  • Glossary
Go to mobile version
adbanner
AdBlock Detected
Our site is an advertising supported site. Please whitelist to support our site.
Okay, I'll Whitelist
Welcome Back!

Sign in to your account

Lost your password?