-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathembeddings_What_they_are_and_why_they_matter_en.json
621 lines (621 loc) · 53.4 KB
/
embeddings_What_they_are_and_why_they_matter_en.json
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
{
"filename": "Embeddings_What_they_are_and_why_they_matter_en",
"output_filepath": "files/test-set-small/output/Embeddings_What_they_are_and_why_they_matter_en.json",
"language": "en",
"chunk_size": 1000,
"token_count": 8705,
"light_model": "ollama/qwen2.5:3b",
"expert_model": "groq/llama-3.3-70b-versatile",
"chunks": {
"chunk_01": {
"text": "So Simon Wilson is going to be our last speaker before our break. So let's delve into the embeddings, what are they, and why they matter. Our distinguished speaker is Simon Wilson today. Beyond creating data sets and open source tool revolutionizing data exploration, Simon has spent time as a JSK Journalism Fellow at Stanford, developing tool rooted in his experience as a data journalist for the UK's Guardian. He's an integral part of Eventbrite after they acquired Lanry, a company he co-founded. Plus, many web developers here might recognize him as the co-creator of Django, their framework. Please join me in giving a warm welcome to Simon Wilson. Okay. Well, good afternoon, PyBay. Today, I'm going to be talking about embeddings, which... And I'm really using this as a kind of excuse to talk about a whole bunch of other stuff as well. I'm going to be demonstrating a variety of tools that I've been building over the past 12 months to six years now to help explore data and prototype things and do quick and interesting experiments. But embeddings are a technology that is adjacent to this whole field of large language models. You know, the technology behind ChatGPT and BARD, all of that kind of stuff, which has been consuming my life over the past 12 months, because I can't tear myself away from how weird and slightly horrifying they are. And embeddings is a sort of small part of that overall slice, which I feel is one of those little things where it's this trick that once you know how to use it, you suddenly realize you can apply it to all sorts of interesting problems. that you might come across. So I'll start with a very sort of high-level idea of what embeddings even are. Embeddings is a trick, and the trick is that you can take a piece of content, in this case we've got a blog entry, and you can turn that piece of content into an array of floating-point numbers. And that's the entirety of the trick. You take content, in whatever shape it is, and you turn it into this array of numbers. The key thing about this array of numbers is that it's a fixed length long. So based on the embedding model that you're using, you will get back 300 floating point numbers, or 1000 floating point numbers, or 1536 floating point numbers, which you can then do stuff with. Because these numbers that come back are actually coordinates. They're a location within a very weird many multi-dimensional space. So if you have 1536 numbers, It's hard to visualize, like here I visualized it in 3D with just three dimensions, but anything that you can do this trick to ends up located somewhere in that space. And the reason that's interesting, and that's useful, is that what's powerful about this is what's nearby. because the embedding vector, this bizarre array of numbers, represents what this model understands about the meaning of that content. And there are hundreds of different models that can do this, but a lot of them can take that and turn it into some kind of semantic meaning. So those numbers represent facts about the world, and colors, and shapes, and concepts, and all of these different things. Nobody really understands what the numbers mean, but we know that if you plot them into this dimensional space, you can start doing interesting things. So I'm going to start with one of the first experiments I did around embeddings, and that's to solve the problem of serving up related content on one of my sites. So this right here is my TIL blog. TIL stands for Today I've Learned, and this is a blog where every few days I post up an article about something that I've figured out. And what I love about this as a writer is this is a very liberating format because the only barrier to should I write about it is did I just learn this thing? I don't have to think about is this expanding the frontiers of human knowledge and making an exciting point. No, no, no, this is I figured out a for loop in bash. I'll write about for loops in bash. It's mainly for me. It's like public notes. If it's useful to you as well that's great. But I've got a lot of these now. I've got what 470 of them I think and at the very bottom of each one there's this table, there's this list of related content. And this is entirely automated, and it's actually really good, right?",
"s_ids": [
"S1",
"S2",
"S3",
"S4",
"S5",
"S6",
"S7",
"S8",
"S9",
"S10",
"S11",
"S12",
"S13",
"S14",
"S15",
"S16",
"S17",
"S18",
"S19",
"S20",
"S21",
"S22",
"S23",
"S24",
"S25",
"S26",
"S27",
"S28",
"S29",
"S30",
"S31",
"S32",
"S33",
"S34",
"S35",
"S36",
"S37",
"S38",
"S39",
"S40",
"S41",
"S42"
]
},
"chunk_02": {
"text": "This is an article about geospatial SQL queries, and it says, oh, well, that's probably related to GeoPoly and SQLite, well, it's related to GeoPackage and KNN queries and all of this stuff. There's not much content about GIS stuff on my blog, but all of it came up as related. And I can navigate through and see, okay, mbtiles, that's a tiling thing, that's related to GeoPackage. all of this stuff just starts relating together. And the secret behind this feature is it's done purely using this concept of embeddings. Now, this website here, this TIL website, is actually running on top of one of my major open source projects, which is this thing I've been building called Dataset. And Dataset is a web-based front end for a SQLite database. So you get a bunch of data about anything you like, you stick it in the SQLite database, and you run Dataset on top of it, and you get this interface that shows you the tables that are available, and you can run SQL queries against them and do bits and pieces like that. And one of the features of Dataset is custom templates. So actually, all of this website, it's just a bunch of custom Jinja templates that I put on top of that sort of default application. But because we've got these in the database, we can start running queries. And I've got this table called, I've got a table called TIL, which is all of my content. each of those articles has a path and a title and the URL and the markdown body and that kind of thing. And then I created the separate table called embeddings. And all that does is it maps each of these article titles to this weird blob of numbers. In this case, it's 6,000 bytes of binary data. I tried to compress it down into an efficient format. And I can run a hex function against it. So I can say, OK, turn that into hexadecimal. It still looks like binary data. Or I've got this custom Python function I wrote, llm embed decode, and that actually shows you what these things are. So this right here, it's a binary representation of that list of 1,536 floating point numbers. But again, where this stuff gets fun is when you start using comparisons between those locations to figure out what, in this case, what's related. So I do everything in SQLite and with SQLite and SQL functions these days. This is my sort of default for hacking around with interesting data, because I've been building this software that lets you do this kind of thing. And so I built this custom Python function called llmbedcosine, which runs a cosine similarity just to judge the distance between two vectors. And this is a dataset plugin which adds that as a SQLite function. And now I can call it. So I can say SELECT id, LLEMBED cos, so the cosine distance between the embedding in that table. And here I've just got a little subselect that looks up the embedding of one specific article, that one I started with. Select that as score, orders by score descending. When I run the SQL query, I get back a 1.0 perfect score for the same article, which makes sense. You'd hope that they're exactly in the same spot. And then you can see these distance scores to these other articles that are related. If I order by score and cut off at 10, I get back a decent set of related results. So really, it's actually really simple to pull this together. I've got, what, 470 items in here, so I can just do a SQL query that does literally a brute force comparison of scores between all of them, and it works, and I get back that related content. What I actually ended up doing, it was taking about 400 milliseconds to do each of these, so I ended up pre-calculating them. I've got a table here called similarities, which has 4,900 rows, and it's just, for each article, what are the similarity scores for the other sort of top 20, I think, articles that that's related to. And then I've got a bit of Python code that runs a SQL query against that, and a bit of template code that drops that into the template, and that's the entirety of the feature. That's the whole thing. So, The the way the site works. It's actually deployed.",
"s_ids": [
"S43",
"S44",
"S45",
"S46",
"S47",
"S48",
"S49",
"S50",
"S51",
"S52",
"S53",
"S54",
"S55",
"S56",
"S57",
"S58",
"S59",
"S60",
"S61",
"S62",
"S63",
"S64",
"S65",
"S66",
"S67",
"S68",
"S69",
"S70",
"S71",
"S72",
"S73",
"S74",
"S75",
"S76",
"S77",
"S78",
"S79",
"S80",
"S81",
"S82",
"S83",
"S84",
"S85"
]
},
"chunk_03": {
"text": "It's deployed using the cell which is a very inexpensive hosting provider where you can stick state you get the idea with the cell is it's Serverless hosting for stateless projects so you can stick some code on there and it'll run whenever somebody visits your website Traditionally you wouldn't use databases with step with her stateless hosting because the database is a thing that has state you need a like hard disk and backups and all of that but because this website here is read-only, right? It's a blog. Nobody's writing anything to these except when I publish a new story. I can actually publish the SQLite database as part of the application that gets deployed to Vercel. I call this the bake-to-data architectural pattern because you're baking your data into your deployment asset. And then the way that it's actually built is just GitHub Actions, right? This is a GitHub Actions workflow. It runs every time I commit to my repository full of TIL documents. And one of the things it does, it builds everything, it loads it into a SQLite database. It actually generates screenshots for them to use in social media cards as well. And then it runs a little command I wrote that hits the OpenAI API to pull back these embedding vectors and writes them into the database, calculates similarity. It's all just one big build process, but that functionality gives me those related results. If you want to see exactly how this worked, I wrote up a TIL on my TIL website about how my TIL website's embeddings work. It's all here. So this is a very detailed article that shows exactly how all of this stuff works, and also links to some sort of other SQL queries that you can start running. This query is quite fun, actually. This one here, I decided to say, okay, of all of my 470 articles, which are the most similar pairs, right? If you were to calculate the similarity scores between everything, what's the most similar? and it turns out the most similar is running tests against Postgres in a service container and talking to a Postgres service container from inside a Docker container, which are practically the same article, and I wrote them several months apart without remembering that I'd already written about them. And yeah, NanoGP, this is actually kind of cool, right? It's kind of fun being able to see, oh, look at all of the most similar pairings of content that I've got out there. So that's kind of a demonstration of quite how much fun you can start having once you've got this ability to calculate the similarity scores between just arbitrary chunks of text in this case. But where do we actually get these things from? So I just mentioned the OpenAI API. For this particular experiment, what I've been doing is hitting OpenAI's embeddings API, which is one of the easiest APIs I've ever used. You do an HTTP post to v1 slash embeddings. You pass it in some text, what is ShotScraper in this case, but you can give it an entire article of content up here, pass in an API key, and it gives you back that list of floating point numbers. That's kind of cool. There is one catch, which is that OpenAI are still quite a new organization, and they haven't quite figured out the importance of the longevity of these APIs. So a few months ago, they announced that they were shutting down a whole bunch of their models and saying, hey, upgrade to this new model instead. one of the models they shut down was their older embedding model, and that means that if you've spent a bunch of money embedding millions of documents worth of content, suddenly those lists of floating point numbers are useless to you, because you don't have access to that model to generate more in the future. So actually, although this works really well, I have slight regrets in having built this feature against this proprietary model that I can't guarantee will keep on working. The good news is, it's really easy to run these models yourself. And I'll talk about that in at length in a moment. The other thing I want to talk about is just to give you a little bit more of an idea about how these things actually work. I mean, this is like modern machine learning language model stuff. So there is nobody on earth who really understands how it works. This is one of the things that's so fascinating and frustrating about this field.",
"s_ids": [
"S86",
"S87",
"S88",
"S89",
"S90",
"S91",
"S92",
"S93",
"S94",
"S95",
"S96",
"S97",
"S98",
"S99",
"S100",
"S101",
"S102",
"S103",
"S104",
"S105",
"S106",
"S107",
"S108",
"S109",
"S110",
"S111",
"S112",
"S113",
"S114",
"S115",
"S116",
"S117",
"S118",
"S119",
"S120",
"S121",
"S122"
]
},
"chunk_04": {
"text": "But there's a demonstration that I think helps a little bit is something that Google Google Research put out, oh wow, 10 years ago now, this was the, I mean, it's an academic paper, so the title is terrible, Efficient Estimation of Word Representations in Vector Space. This is the Word2Vec paper. They created this model called Word2Vec, which could take a single word and turn it into a list of numbers, and they wrote it up. And this was really where the widespread interest in this embeddings technique really started, was this thing 10 years ago. So Word2Vec, here's a demo that somebody built of Word2Vec. You can see this has a JSON file, effectively a JSON file, with a bunch of words, and each word has a list of 300 floating point numbers associated with it. And those numbers try and capture something about the meaning of those words. And there's a really interesting thing you can do with that. You can look up words like Paris and see what are the words that are most similar according to those scores. So France, French, Brussels, Madrid, Rome. We've got a mixture of French things and European cities in here. But the really neat trick is that you can do arithmetic with these things. I can say take the vector for Germany, add Paris and subtract France. What do we get back? And the answer is we get back a location that's closest to Berlin. So something about this model has captured the idea of different nationalities and capitals of countries and so forth, to the point that you can use arithmetic to learn weird sort of numeric facts around the world. That's kind of fascinating. And I think that illustrates a little bit of what's going on under the hood with these things. Although again, like I said, nobody fully understands quite why these things work. Word2Vec, they gave it 1.6 billion words of content and trained up this vocabulary of, I think, about 30,000 words. The models that we're using today, 10 years later, just dwarf that. They're absolutely colossal. But the thing works. We have this trick that we can now use to do interesting things. So let's talk about running these models ourselves. And this is where I get to talk about the other major open source project I'm working on these days, which is this piece of software I've been building called LLM. And I've lost the tab with that. There we go. LLM is a command-line utility and Python library I've been building to manipulate and work with large language models. And it's something you can install. You can pip install LLM. You can install it from Homebrew as well. So you can do brew install LLM. And then you get a command-line app that you can start using to fire things through language models. Out of the box, it can work with the OpenAI API. So you can say, LLM 10 fun names for a pet pelican, and it'll give you 10 fun names for pet pelican. Let's do that. And that's just making an API call direct. Here we go. Paddle, Squawk, Feathers, Flipper, Nibbles. These are quite good. They're quite good names for a pet pelican. All of my examples end up having pelicans in for some reason. But that works. But you can also install plugins for it. So there are plugins for LLM that will add models that can run directly on your laptop. I've got an M2 MacBook Pro here. It can run some pretty decent language models. I've had things run on here that feel like they're getting towards the quality of the ChatGPT model running entirely locally. I'm actually running low on battery, so I'm nervous to try one right now because the CPU and the GPU will start crunching for like 15 seconds to get a result out of it, but it does work. So, What I did a few months ago is I extended LLM to add tools for working with embeddings as well. So now you can take my LLM tool, you can install it, you can say pip install LLM, then you can install a plugin for one of these embedding models. There's a library called Sentence Transformers from Hugging Face, which makes it, which sort of opens up a whole world of models that you can run on your own machine. I've got a plugin for it called LLM Sentence Transformers, which you can install.",
"s_ids": [
"S123",
"S124",
"S125",
"S126",
"S127",
"S128",
"S129",
"S130",
"S131",
"S132",
"S133",
"S134",
"S135",
"S136",
"S137",
"S138",
"S139",
"S140",
"S141",
"S142",
"S143",
"S144",
"S145",
"S146",
"S147",
"S148",
"S149",
"S150",
"S151",
"S152",
"S153",
"S154",
"S155",
"S156",
"S157",
"S158",
"S159",
"S160",
"S161",
"S162",
"S163",
"S164",
"S165",
"S166",
"S167",
"S168",
"S169",
"S170",
"S171",
"S172",
"S173",
"S174",
"S175",
"S176"
]
},
"chunk_05": {
"text": "Then you can register a embedding model, which will download that model onto your computer. Here I'm registering the AllMiniLM L6 V2 model. These things all have very catchy names. But then once you've done that, you can run commands on your computer that will embed content and store those embeddings locally for you to do interesting things with. There's a command called embedmulti, which takes the name of a collection. So I'm going to say, create me a collection of embeddings called readmes. run them using the Sentence Transformer's MiniLM one, and then look for every file in my home directory which matches **readme.md, and find all of those files, and run them through the embeddings models, and then store that in a SQLite database on my computer. And so I've done that, and as a result, I can now, I've now got this database of, it turns out, how many is it? It turns out there are a lot of READMES on my computer. So I've got an embeddings collection. Where are we? So I've got a collection called READMES. There were 16,796 README files on my machine. This took, I think, about half an hour to run, but it worked. And now I can see for each of those READMES, I've stored the full content of it, but I've also got this weird magic embedding number that I can then use to start running searches. So, let's do that right now. I'm going to run another command that I wrote called llmsimilar. I seem to have lost it. Here we go. llmsimilar readme is \"-c\", and I'm going to say I want things that are related to SQLite backups. The \"-c\", means take this content from this string. And that dumped out a whole bunch of stuff. If I pipe it through jq and say, just give me the IDs, here are the top 10 results of readmes on my computer that relate to the concept of SQLite backups. This is good, actually. SQLite dump, there's a repair tool. These are all decent results for the thing at the bottom, which is a backup of my blog that uses SQLite as well. So this worked. And what's interesting about this is that It's not guaranteed that the term backups appeared in those readme texts itself, but there's a way of thinking about this. Sometimes we call it somatic search. It's sort of vibe-based search, right? The vibes of those readmes related in this weird multidimensional space version of meaning of words, they ended up somewhat similar to this concept of a SQLite backup. Absurdly useful. If anyone's built a search engine for a website, you know, you build it as full-text search, and then none of your users use it. They use Google instead because Google built a better search engine than you did, because Google are better at search, right? But when you start messing around with this kind of stuff, it almost feels like we can start building that sort of better level of search ourselves, right? Like exact matching search is useful if you're searching for function names. For a lot of the search problems that we want to solve, you don't really need exact matching. And so actually this idea of semantic search is incredibly powerful. So, another tool that I built is a tool which was, again, originally as part of my explorations into language models. I built this tool called Symbex. And the idea with Symbex is I wanted a way to see the Python functions and classes, the Python symbols in my code base really easily. So I can say things like, Symbex-S, and it'll output just the signatures. So this is classes and functions and so forth. I can say Symbex dash dash function and get back just a list of all of the functions that exist in my code base. And I'd already built this tool when I was building my embedding stuff and I realized, hang on a second, what if Symbex could grew the ability to output like JSON or CSV representing the things that are found, and then I could pipe those into my embedding tool and generate embeddings for all of the functions in my code base. And so that's exactly what I've done here. I've got a embeddings database that has embeddings, this time using a brand new model called GTE tiny, which is only about 60 megabytes.",
"s_ids": [
"S177",
"S178",
"S179",
"S180",
"S181",
"S182",
"S183",
"S184",
"S185",
"S186",
"S187",
"S188",
"S189",
"S190",
"S191",
"S192",
"S193",
"S194",
"S195",
"S196",
"S197",
"S198",
"S199",
"S200",
"S201",
"S202",
"S203",
"S204",
"S205",
"S206",
"S207",
"S208",
"S209",
"S210",
"S211",
"S212",
"S213",
"S214",
"S215",
"S216",
"S217",
"S218",
"S219",
"S220",
"S221",
"S222"
]
},
"chunk_06": {
"text": "Some of these things are actually quite small, but this is embeddings of all of the functions in my main project. And now I can do searches. So I can say things like list plugins, and it'll do an embedding of the term list plugins and compare that to these pre-calculated embeddings of all of the functions. And sure enough, the top result is a function called plugins, which is a click thing that lists all of my plugins. There's one called get plugins that does the same thing. I've now got vibe-based search against a code base. And this is something I knocked out in sort of 30 seconds. I ran a couple of commands and I was up and running and now I can start doing this as well. So really the key idea here is When you've got SQLite as your central substrate, anything you can get into SQLite, you can use these other tools with. And you've got command line tooling that can be piped together. You can start building some really sophisticated combinations of these things. Here I've got one tool that can output JSON representing all of my functions. I've got another tool that can take that JSON, run it through an embeddings model, and store those embeddings. And then I've got my dataset interface here that lets me run those searches on top. And it all ties together into I mean, right now I've got semantic vibe-based search against code, but you can imagine pretty much anything else that you could pipe through the same process, you could do the same kinds of things with. Which leads me to, I think, to my current favorite embeddings model, which is this thing called Clip. So, Clip is actually an open AI. This is back when open AI were doing things in the open. They released this thing for anyone to use. You can actually download Clip. This was January 2021. and Clip is a embeddings model that can do two things. It can embed text, so you can give it the word dog and it'll give you back a list of numbers, and it can embed images. You can give it a photograph of a dog and it will give you back a list of numbers, but the magic is that those numbers, they exist in the same vector space and the text for the word dog will end up in a similar location to a photograph of a dog, which is wildly exciting and kind of confusing when you start thinking about it originally. So I built this demo. This is actually running the clip model directly in the browser, because a lot of these models have been ported to JavaScript now as well. And so here what I can do is I can upload a, well, I can open up an image on my computer. I open up this image of a beach, and then I can give it, oops, and then I can give it text. So I can say, let's see, is this similar to the word city? And it says the similar to score to the word city is 22%. To beach is 29 percent. To beach sunny is 29 percent. If I add California, it goes up to 30 percent. This is a beach in California, I don't know if it definitely knew that. But it's kind of fascinating, right, because this is my browser doing all of this work. It's taking the text here, it's turning that into a weird vector of floating point numbers, doing this cosine similarity distance between that and the image, and giving me back that score. Out of the box, this is kind of useless, right? There's not much It's not particularly useful to look at a photograph and go, how similar is that to the word, I don't know, chaos theory? It's not very similar to the word chaos theory, but that didn't really help me that much. But of course, the trick is when you start building additional interfaces on top of this, using this to find photographs that are similar to other photographs, or doing this sort of vibes-based search against them. And that's what a friend of mine did. This is Drew Brunegg, who hangs around on the dataset discord and plays with all sorts of projects in that space. He was renovating his bathroom and he needed to buy faucets for his bathroom and being a nerd he ended up scraping 20,000 photographs of faucets from a faucet supplier and running clip against them using my LLM clip clip tool and he built this.",
"s_ids": [
"S223",
"S224",
"S225",
"S226",
"S227",
"S228",
"S229",
"S230",
"S231",
"S232",
"S233",
"S234",
"S235",
"S236",
"S237",
"S238",
"S239",
"S240",
"S241",
"S242",
"S243",
"S244",
"S245",
"S246",
"S247",
"S248",
"S249",
"S250",
"S251",
"S252",
"S253",
"S254",
"S255",
"S256",
"S257",
"S258",
"S259",
"S260",
"S261",
"S262",
"S263",
"S264"
]
},
"chunk_07": {
"text": "This is called Faucet Finder and what this lets you do is it lets you find a really, if you find a really expensive faucet that you love, you can use this and say, okay, find similar faucets to this really expensive one. And if you're lucky, it'll come up with some cheap ones that have the same kind of vibes as your expensive faucet. I love this. Like it's, it's such a beautifully niche thing to build. And I mean, look at this, like the similarities are actually pretty great. But of course the really fun trick with this is that you can now do text search against faucets. And he's running this, his demo runs on Dataset. This is, If you hack around the URL, you can find his database of all of these embeddings. And so I built this. This is an observable notebook that hits his API. I've set up an API that can do clip text embeddings. It compares against his API. So now I can do things like search for bird, and I will get back, fingers crossed. Ooh, I hope this works. This is a previously cached search for... There we go. Look at this. There are faucets that look like a bird. That's amazing. I don't know why you always get boring ones come up. I think these faucets are so average that whatever calculation you run, they somehow end up in there. But yeah, or you can say terror and get really frightening looking faucets. There we go. That one right there, I think that is quite a frightening faucet. Or if you search for gold, you'll get the gold ones. It works. We now have vibes-based search for faucets. Let's do Nintendo. I think that one comes up quite well as well. Yep, there we go. That one right there has definitely got a bit of a Nintendo 64 vibe going on. It's amazing, right? We can now apply this to all sorts of weird and wonderful things in our lives. And again, this isn't very difficult to do. Once you know the trick, once you know that you can grab this model, you can download it, you can, in this case, I've got support in my LLM tool for just running a command line script or running a Python function that will give you back these embeddings. Once you've got them, you store them somewhere, and then the trick is to just compare them in that way. There are a bunch of other fun things you can do with these. One thing that's kind of fun, you can use them for clustering, because again, they exist in locations in this weird multidimensional space. I wrote a plugin called LLM Cluster that lets you say, okay, cluster, in this case, all of the issues that have been reported against my LLM project. And when you cluster, you say basically, Give me 10 clusters. That's kind of frustrating. I want it to pick the right number. I've not figured out how to do that yet, but it'll cluster them into 10 clusters, and the clusters do end up being kind of similar, right? There's a cluster here that appears to be things about different command line options I was running. A fun trick I did with this is I added an option where you can ask it for a summary, and it will then take each of those clusters, feed them through a language model like GPT-4, and use that to generate a heading for those clusters, which is kind of neat. So you end up with Like these ones are relating to continuing the conversation mechanism and management, all of that kind of stuff. Another thing you can do is you can take those 50, that weird dimensional space, and you can run a thing called PCA, which I forget what it stands for, Principal Components Analysis or something. Yes, which reduces the dimensions. So in this case, Matt Webb ran embeddings against every episode of a BBC podcast. and use that to reduce them all the way down to two dimensions. And now you can hover over and say, okay, the 30 years war, the Indian mutiny, the battle, this is all war-y ones. And over here, you've got science in the 20th century, the physics of time, chaos theory, relativity, those are the sort of like academically scientific ones.",
"s_ids": [
"S265",
"S266",
"S267",
"S268",
"S269",
"S270",
"S271",
"S272",
"S273",
"S274",
"S275",
"S276",
"S277",
"S278",
"S279",
"S280",
"S281",
"S282",
"S283",
"S284",
"S285",
"S286",
"S287",
"S288",
"S289",
"S290",
"S291",
"S292",
"S293",
"S294",
"S295",
"S296",
"S297",
"S298",
"S299",
"S300",
"S301",
"S302",
"S303",
"S304",
"S305",
"S306",
"S307",
"S308",
"S309",
"S310",
"S311",
"S312",
"S313",
"S314",
"S315"
]
},
"chunk_08": {
"text": "And it kind of works, like it's pretty amazing that reducing, I think this was the 1500 dimensions, just two, still gives you clusters of things that are kind of meaningful when you start scanning through them. One more trick. Amelia Wattenberger wrote up a brilliant idea where she was trying to do an analysis of text that people were writing to help them write better by saying, okay, try and write, try and use, try and use, some have a difference between your concrete terms and your abstract terms in terms of sentences. How do you Calculate if a sentence is concrete or abstract. You come up with a list of 20 concrete sentences, a list of 20 abstract sentences, embed them, calculate the average of each of those things, and then in a new input you can say, okay, which of those extremes is it closest to? And you can even turn that into a color scheme. So here's like sort of a color scale of how close a sentence is to the sort of average of these previously picked things. So you can use this for Categorization for picking topics things there's all sorts of applications of this kind of technique here as well It's kind of fun, right? There's some once you once you understand how to use these things There's a surprising array of problems that you can start pointing them to and I will finish with one last demo which is Why I got interested in beddings in the first place and that's this idea of using them to answer questions Well answer questions about content. It's this idea called retrieval augmented generation and In my case, what I did is I built a thing against my blog that can answer questions using data from my blog. So I can say things like, what is Shotscraper, which is a piece of software I wrote a couple of years ago, and it'll tell me what Shotscraper is as a paragraph of English text. The way it does that is, firstly, it looks for all of the paragraphs of my blog that are similar to the question that was asked. and then it cobbles them all together into a block of text, sticks them through GPT-4 or LLAMA2 on my laptop or whatever, sticks the question at the bottom, says okay, answer this question using this context that I've found of relevant content. The super interesting thing about this one is this is another example of one of these embedding models. This is a thing called E5LargeV2, terrible name, but what this lets you do is it lets you embed two types of sentence. You can have Sentences that are passages, so that's like passage, colon, a paragraph of text from my blog. And then you can have sentences that are queries, which is a question that somebody is answering. The reason you do that is if you want to answer somebody's question, the similarity between a question like, what is Shotscraper? That might not match exactly to a sentence that tells you what Shotscraper is, because they're different sort of ways of discussing the world. But this embedding model has been trained to know that Query colon is a question, passage colon is something full of facts. Plot those into the same space such that a question that is likely to be answered by a passage of text will end up in the same spot. Weed's trick, it totally works. Like I've got a thing, I now have a script that can run on my laptop where I can say, what is shot scraper? Oops. And it will, there's actually like, I've had this work completely offline, no internet connection at all. using models that are running locally, and it gives me back really good answers for questions that are being answered directly, but answered using, in this case, 18,000 paragraphs of text that I pulled in from my site. So, this is kind of cool, right? There's a lot of really neat things you can do with this. I will be turning my notes from this talk into a very detailed write-up with links to source code and examples and things that you can play with. So please check out my website in probably a couple of days, simonwilson.net. I'll have all of that information for you. And yeah, I think I've got some time for questions. Thank you, Simon. All right, any questions? Here we go. Let's go. Hey, Bhupesh here. So, I think you're also familiar with langchain, which also like vectorizes the words and paragraphs into the embeddings which chapGBT can understand. This is one of the tricks that langchain has. Yeah, langchain...",
"s_ids": [
"S316",
"S317",
"S318",
"S319",
"S320",
"S321",
"S322",
"S323",
"S324",
"S325",
"S326",
"S327",
"S328",
"S329",
"S330",
"S331",
"S332",
"S333",
"S334",
"S335",
"S336",
"S337",
"S338",
"S339",
"S340",
"S341",
"S342",
"S343",
"S344",
"S345",
"S346",
"S347",
"S348",
"S349",
"S350",
"S351",
"S352",
"S353",
"S354"
]
},
"chunk_09": {
"text": "The problem with langchain is that it's huge and it does everything. It could take you a month just to understand everything it did. But yeah, one of the initial capabilities of langchain was almost exactly that demo I showed at the end. The thing where you take content, you do the embeddings vectors, stick them in some kind of vector storage. and then use them to answer questions. And yeah, it's sort of fundamental to a whole bunch of the exciting stuff people are doing around language models at the moment. But as I hope I've just demonstrated, you can use it for all kinds of other things that aren't directly related to the language model stuff as well. Yeah, I really like the part where you kept the flexibility of playing around with embeddings, like with the cluster or with the image or with the text, because Langtian doesn't give you that. It's more of a text. Yeah, I really like that part. My approach is quite, like, LangChain is trying to be one framework that does everything. I'm sort of taking this on from a slightly different approach of lots of, the Unix-style philosophy, lots of little tools that can all speak to each other and solve different parts of the problem. And then I'm using SQLite database files as my sort of central coordination point for this stuff. Yeah, that's the best part. Thanks. Thanks for the session. Hi, thank you for the demonstration. I was wondering, so early on you picked a cosine similarity function, I think it was. I was wondering if you did a lot of playing with changing that and seeing how much that... I've done no playing with that at all. Basically, there are a bunch of different distance functions you can use. Everyone else defaults to cosine, so I went with cosine. And in fact, I've got like five different implementations of cosine similarity. I didn't write any of them. ChatGPT is so good at writing cosine similarity functions. Like, yeah, write one in JavaScript. Now do it in Python. Now do it in Python that decodes this binary format first. It's all just like that, yeah. But yeah, that's um, one of the things that's so interesting about this is there are so many knobs that you can tune. You can tune which distance function you're using, which embedding model you're using, what kind of prompts you're using to answer questions. There's just, and the hardest question in all of this is, getting the exact right set of content to feed to a language model to answer a question. And I feel like we're just getting started figuring out the best approaches for that at the moment. Thank you. More questions? Hi, what do you need to adjust if you have like 1 billion objects? If you have 1 billion objects, what do you need to towards us in your code system. So most of all of them, as I showed today, were just brute force because I had like up to 20,000 and you can brute force 20,000 co-sign similarities really quickly. But yeah, if you want to do this against much larger contents, that's where you want some kind of specialized index. And there's actually a Like, every week there's a new vector database startup launching. That's all these things are, right? Vector databases are just databases that are really good at doing indexed, optimized, cosine similarity style comparisons. I don't think vector databases are\u2026 I'm unconvinced by them. I think what we need to index is vector indexes for our existing databases. There's a couple of options for Postgres these days. SQLite has a\u2026 there's an extension called SQLite VSS that does that. There's lots and lots of options here, but yeah, so if you want to do this stuff quickly You need to do a bit more work, but there are dozens of solutions. You can you can select from Have have Have you found your pelican faucet yet I There's no pelican faucet the closest. There's a sort of swan one, but I sent that to my partner, I was like, hey, we should get this for a buff, and she's like, absolutely not. So yeah. Any more questions? Here we go. So you've talked about a lot of interesting things that vector embeddings can do. Is there something that they can't do that you're super excited that maybe in the future they could?",
"s_ids": [
"S355",
"S356",
"S357",
"S358",
"S359",
"S360",
"S361",
"S362",
"S363",
"S364",
"S365",
"S366",
"S367",
"S368",
"S369",
"S370",
"S371",
"S372",
"S373",
"S374",
"S375",
"S376",
"S377",
"S378",
"S379",
"S380",
"S381",
"S382",
"S383",
"S384",
"S385",
"S386",
"S387",
"S388",
"S389",
"S390",
"S391",
"S392",
"S393",
"S394",
"S395",
"S396",
"S397",
"S398",
"S399",
"S400",
"S401",
"S402",
"S403",
"S404",
"S405",
"S406",
"S407"
]
},
"chunk_10": {
"text": "I mean, the thing that got me so excited about Clip is that Clip is multimodal, right? It's images and it's text in the same space. That feels like a fascinating direction for me. There's a Facebook put out one called ImageBind, which can also do audio and various weird 3D formats and things into the same space as the text. That, that's sort of astonishing. But yeah, so the thing I'm most excited about, I also, I like them getting smaller. Like, I want to be able to run them in my browser, like I did earlier with one of those demos. That feels really interesting. And that's a theme for me, just for all of this stuff generally, as it gets smaller, the range of things you can do with it get more interesting. Like the, I demonstrated one earlier, the tiny one that I use for my function lookup, that came out a couple of weeks ago and kind of, it stunned me how good it was considering it's only sort of 60 megabytes. But yeah, so I want to see them get smaller. And I also love it when they, the multimodality I think is really exciting. Worth it. Maybe one more. Okay. All right. Thank you, Simon. Thank you. That was great.",
"s_ids": [
"S408",
"S409",
"S410",
"S411",
"S412",
"S413",
"S414",
"S415",
"S416",
"S417",
"S418",
"S419",
"S420",
"S421",
"S422",
"S423",
"S424",
"S425",
"S426"
]
}
},
"summary": "Embeddings, a technology used in large language models, represent text as numerical vectors, allowing for the understanding of concepts and relationships between them. This technology is utilized in various applications, including semantic searches, content recommendation, and text analysis. Researchers and developers, such as Simon Wilson, have demonstrated the potential of embeddings in creating systems that serve up related content and identify similar articles based on their similarity scores. The use of embeddings has also been explored in other areas, including geospatial SQL queries, image embedding, and audio incorporation. Overall, embeddings have become a fundamental component in the development of language models and AI frameworks, such as LangChain, enabling advanced capabilities in text and data analysis.",
"keywords": [
"geospatial SQL",
"language models",
"semantic search",
"embedding model",
"code functions",
"clustering",
"related content",
"concrete vs abstract",
"SQLite",
"README files",
"function lookup",
"vibebased search",
"vector databases",
"GeoPoly",
"retrieval augmented generation",
"browser compatibility",
"LLM",
"TIL blog",
"GeoPackage",
"multimodal",
"ImageBind",
"GitHub Actions",
"text search",
"text representation",
"cosine similarity",
"blog question answering",
"text clustering",
"Faucet Finder",
"LangChain",
"OpenAI API",
"Serverless hosting",
"Word2Vec",
"PCA",
"Clip"
],
"toc": [
{
"headline": "Embeddings and Similarity",
"topics": [
{
"summary": "Understanding Embeddings in Data Exploration",
"location": "S4"
},
{
"summary": "Using Embeddings for Similarity Analysis",
"location": "S85"
},
{
"summary": "OpenAI API for Embeddings",
"location": "S86"
},
{
"summary": "Analyzing Text for Clusters and Embeddings",
"location": "S316"
},
{
"summary": "Flexible Embedding Tools",
"location": "S407"
}
]
},
{
"headline": "Data Journalism and Tools",
"topics": [
{
"summary": "Simon Wilson's Background and Achievements",
"location": "S20"
},
{
"summary": "Geospatial SQL Queries Overview",
"location": "S43"
}
]
},
{
"headline": "Serverless and Cloud",
"topics": [
{
"summary": "Bake-to-Data Architecture Pattern",
"location": "S122"
},
{
"summary": "LLM Command-Line Utility",
"location": "S123"
}
]
},
{
"headline": "Search and Information Retrieval",
"topics": [
{
"summary": "Vibe-Based Semantic Search for Readmes",
"location": "S222"
},
{
"summary": "Image-Text Similarity in Browser",
"location": "S223"
},
{
"summary": "Vibes-based search for faucets and other items",
"location": "S265"
}
]
},
{
"headline": "Databases and Storage",
"topics": [
{
"summary": "Specialized Indexing Solutions",
"location": "S355"
}
]
},
{
"headline": "Multimodal and Text Processing",
"topics": [
{
"summary": "Fascinating Multimodal Space",
"location": "S408"
}
]
},
{
"headline": "Browser and Accessibility",
"topics": [
{
"summary": "Smaller and More Accessible",
"location": "S426"
}
]
}
]
}