-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathcamp2023-57163-eng-Unsupervised_Pleasures_opus.vtt
2177 lines (1451 loc) · 62.4 KB
/
camp2023-57163-eng-Unsupervised_Pleasures_opus.vtt
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
WEBVTT
00:00:00.000 --> 00:00:10.000
[MUSIC]
00:00:10.000 --> 00:00:20.000
[MUSIC]
00:00:20.000 --> 00:00:35.200
And now our next talk is from Sarah Sisten,
00:00:35.200 --> 00:00:38.960
who is based in Berlin in Los Angeles.
00:00:38.960 --> 00:00:43.800
And is a PhD candidate of University of Southern California.
00:00:43.800 --> 00:00:50.760
And who will talk about unsupervised pleasures and
00:00:50.760 --> 00:00:54.760
its intersectional language models for queer futures.
00:00:54.760 --> 00:00:55.680
Welcome on stage please.
00:00:55.680 --> 00:00:57.680
>> [APPLAUSE]
00:01:07.400 --> 00:01:09.400
>> Thank you for waking up, but
00:01:09.400 --> 00:01:11.400
[INAUDIBLE]
00:01:11.400 --> 00:01:14.000
And chugging, I hope you chug coffee, which I just.
00:01:14.000 --> 00:01:18.560
If you would like to introduce yourselves in the chat,
00:01:18.560 --> 00:01:21.680
I've set up a ether pad.
00:01:21.680 --> 00:01:23.600
I don't know if you can see the URL.
00:01:23.600 --> 00:01:31.600
It's pad.riseup.net/p/unsupervisedpleasurescc-keep.
00:01:31.600 --> 00:01:36.640
Add your favorite emoji, your pronoun, whatever.
00:01:36.640 --> 00:01:39.880
We're gonna be getting into it, hopefully, in a participatory way.
00:01:39.880 --> 00:01:45.120
And while you're doing that, I will introduce myself a little bit more.
00:01:45.120 --> 00:01:47.160
So I'm a poet and programmer.
00:01:47.160 --> 00:01:52.720
I'm interested in building tools to bring intersectional approaches to
00:01:52.720 --> 00:01:57.000
machine learning and building community through accessible,
00:01:57.000 --> 00:02:00.760
creative coding, critical creative coding.
00:02:00.760 --> 00:02:05.040
And I come by coding very circuitously via creative writing and
00:02:05.040 --> 00:02:07.200
scene making and book arts.
00:02:07.200 --> 00:02:14.600
So I have somehow come around to adapting that work into making subversive art with
00:02:14.600 --> 00:02:16.840
and about text-based machine learning.
00:02:16.840 --> 00:02:20.120
And I'll have the link up again in a few minutes.
00:02:20.120 --> 00:02:25.880
Let's get started.
00:02:25.880 --> 00:02:28.880
I can find my mouse.
00:02:28.880 --> 00:02:38.880
[BLANK_AUDIO]
00:02:38.880 --> 00:02:45.800
Ominous.
00:02:45.800 --> 00:02:51.480
Okay, so this project came out of two basic questions.
00:02:51.480 --> 00:02:56.920
This is a collaboration with my colleague who is called Queer AI,
00:02:56.920 --> 00:02:59.720
Emily Martinez, and we were really interested in,
00:02:59.720 --> 00:03:03.120
as these language models are coming about and getting really prominent,
00:03:03.120 --> 00:03:07.600
we actually started before chat GPT dropped and suddenly this has exploded.
00:03:07.600 --> 00:03:12.600
But we were wanting to know what do these existing language models have to say about
00:03:12.600 --> 00:03:16.520
people like us, and is it possible for
00:03:16.520 --> 00:03:19.760
language models to speak so that we recognize ourselves?
00:03:19.760 --> 00:03:25.760
We're really interested in building community tools around curated data sets
00:03:25.760 --> 00:03:30.080
that can acknowledge power and rethink these approaches, and
00:03:30.080 --> 00:03:32.440
thinking about new models and new goals.
00:03:32.440 --> 00:03:37.640
And so this workshop today is to think about what you might want to build with
00:03:37.640 --> 00:03:41.760
these systems, how we might make re-imagined data sets, and
00:03:41.760 --> 00:03:44.480
hopefully eventually re-imagined models as well.
00:03:44.480 --> 00:03:53.320
So these data sets are getting insanely large.
00:03:54.640 --> 00:03:59.960
At last count, GPT-4 and now GPT-5 are off the charts and
00:03:59.960 --> 00:04:02.760
they've stopped telling us what's even in them.
00:04:02.760 --> 00:04:06.800
Common Voice, which is from Mozilla and is crowd sourced,
00:04:06.800 --> 00:04:10.040
is 65 gigabytes of voice data.
00:04:10.040 --> 00:04:15.000
GPT-3, 590 gigabytes, they just keep getting larger and larger.
00:04:15.000 --> 00:04:21.520
Aside from the impacts in terms of sustainability and the environment,
00:04:23.120 --> 00:04:26.760
the issues that I'm seeing around this are that they're grabbing data
00:04:26.760 --> 00:04:30.520
indiscriminately, but they're still really doing a terrible job telling stories
00:04:30.520 --> 00:04:35.920
about people who don't fit these normalizing baselines that they're repeating.
00:04:35.920 --> 00:04:40.040
And my argument is that the solution to this is not to suck up more data
00:04:40.040 --> 00:04:45.640
carelessly, to make more categories, to find more ways to be labeled diverse,
00:04:45.640 --> 00:04:49.360
but to find other approaches that are actually more intersectional.
00:04:49.360 --> 00:04:52.880
So the size of these models means that they're pulling in racist text,
00:04:52.880 --> 00:04:56.560
inaccurate text, private text, all kinds of problematic texts.
00:04:56.560 --> 00:04:59.760
It means that they're really impossible to audit and review.
00:04:59.760 --> 00:05:05.080
And it's difficult to even develop criteria by which they should be reviewed or
00:05:05.080 --> 00:05:09.200
adjusted, ostensibly because they're called general and
00:05:09.200 --> 00:05:11.760
all purpose zero shot learners.
00:05:11.760 --> 00:05:16.040
But what this means is that they kind of only work for
00:05:16.040 --> 00:05:22.800
the Western white democratic rich so-called majority,
00:05:22.800 --> 00:05:25.600
but while leaving out the rest of the global majority.
00:05:25.600 --> 00:05:31.520
And this is a really totalizing approach that rather than representing a multitude
00:05:31.520 --> 00:05:36.200
of voices, it centers and normalizes and affirms this powerful status quo.
00:05:36.200 --> 00:05:42.920
So here's where these are coming from.
00:05:42.920 --> 00:05:46.360
We think about authorship in a new way.
00:05:46.360 --> 00:05:50.440
Common voice, as I said, is open source.
00:05:50.440 --> 00:05:54.720
People are contributing their voices, but it's predominantly an English model.
00:05:54.720 --> 00:06:01.280
GPT is being scraped from social media, Reddit, Twitter, Wiki, GitHub.
00:06:01.280 --> 00:06:07.520
The evaluation criteria for what was a good Reddit text to go into it was if it
00:06:07.520 --> 00:06:09.800
had a karma score of three or above.
00:06:09.800 --> 00:06:13.480
That's what's being decided as a good value for this,
00:06:13.480 --> 00:06:17.080
which I argue we could probably come up with some better rubric for this.
00:06:17.080 --> 00:06:21.440
T5 is from colossal clean crawled corpus,
00:06:21.440 --> 00:06:25.400
which is common crawl but filtered a little bit.
00:06:25.400 --> 00:06:30.040
And Wudow is three billion scraped Chinese social media texts and websites.
00:06:30.040 --> 00:06:35.640
So if you've ever posted anything on Twitter, on Reddit, on GitHub,
00:06:35.640 --> 00:06:39.840
your code and your text and your voice is somewhere in there.
00:06:39.840 --> 00:06:44.000
But it's probably not representing you either.
00:06:44.000 --> 00:06:53.480
Unfortunately, these data sets are also, when they're collected,
00:06:53.480 --> 00:06:57.880
they're not offering information about how the text arrived in this data set,
00:06:57.880 --> 00:06:59.840
which we'll talk about more a bit later.
00:06:59.840 --> 00:07:03.320
It's really showing you just a snippet of text, and
00:07:03.320 --> 00:07:08.800
it might say it came from Reddit, but it's not going to say anything more about
00:07:08.800 --> 00:07:13.320
who the author was, how it got there, what the rights were attached to that.
00:07:13.320 --> 00:07:20.240
So what I'd like us to do is to do an experiment.
00:07:20.240 --> 00:07:24.880
If you have a device that's connected to the Internet available,
00:07:24.880 --> 00:07:28.480
go to the Rise Up Pad address.
00:07:28.480 --> 00:07:34.800
And we're gonna talk through a couple of prompt training examples.
00:07:34.800 --> 00:07:38.800
So what we are finding, just as a way to kind of probe what's inside these models
00:07:38.800 --> 00:07:45.360
first, which you don't need any expertise to do, is to just go to the interfaces
00:07:45.360 --> 00:07:50.320
that they're making available to us in this very limited framework.
00:07:50.320 --> 00:07:53.800
And try putting in this prompt.
00:07:53.800 --> 00:07:58.320
If you fill in a blank couple or on their way to a location,
00:07:58.320 --> 00:08:01.720
as they board the blank, an announcement happens.
00:08:01.720 --> 00:08:07.520
So if you go to chat.gbt and do this, and you say a married couple are on their way
00:08:07.520 --> 00:08:10.520
to Paris with their family as they board the plane, an announcement happens.
00:08:10.520 --> 00:08:16.520
[INAUDIBLE]
00:08:16.520 --> 00:08:22.480
Presumably white, boring, maybe mild vacation inconvenience.
00:08:22.480 --> 00:08:25.560
As they board the plane, an announcement happens to inform the flight has been
00:08:25.560 --> 00:08:27.240
canceled due to bad weather.
00:08:27.240 --> 00:08:31.600
After an argument, the family is forced to stay at an inn in a small village.
00:08:31.600 --> 00:08:35.000
Okay, like not a great day.
00:08:35.000 --> 00:08:41.280
If you try putting in other items, and in the Rise Up Pad, you'll have links
00:08:41.280 --> 00:08:43.200
to these different models that you can test out.
00:08:43.200 --> 00:08:50.720
And I would invite you to put in your own identity markers, your own locations,
00:08:50.720 --> 00:08:54.520
try anything you like in this template, diverge from this template, and share
00:08:54.520 --> 00:08:58.000
into the Etherpad what kind of results you get.
00:08:58.000 --> 00:09:02.160
See how these diverge, and as they accumulate, we'll start to see kind of the
00:09:02.160 --> 00:09:03.680
differences that emerge.
00:09:03.680 --> 00:09:08.400
So if you say a queer Pakistani couple are on their way to Paris with their family,
00:09:08.400 --> 00:09:11.960
as they board the plane, an announcement happens, to inform the flight has been
00:09:11.960 --> 00:09:12.880
hijacked.
00:09:12.880 --> 00:09:15.840
The play explores how the terrorists shape the course of events and how the
00:09:15.840 --> 00:09:19.280
hijacking is represented in the media.
00:09:19.280 --> 00:09:22.440
Or a lesbian couple are on their way to Tehran, as they board the plane,
00:09:22.440 --> 00:09:26.280
an announcement happens, the couple are forced off the plane by an officer who
00:09:26.280 --> 00:09:29.160
accuses them of having deviant sexual relations.
00:09:29.160 --> 00:09:31.720
They leave for another international airport.
00:09:31.720 --> 00:09:33.760
A woman holds her newborn baby in her arms.
00:09:33.760 --> 00:09:38.360
She cannot go through with the adoption due to religious prohibitions.
00:09:38.360 --> 00:09:46.920
So as we add more of these to our examples, it gets really heavy and kind of
00:09:46.920 --> 00:09:48.440
intense.
00:09:48.440 --> 00:09:53.400
And I think just the cumulative effect of this shows that even when you put
00:09:53.400 --> 00:09:59.640
something fairly innocuous into these systems, I'm hoping that this can expand
00:09:59.640 --> 00:10:05.680
the way that we think about bias for this, that it's not simply removing hate
00:10:05.680 --> 00:10:13.120
speech or taking, like these aren't levers that we can turn with corrections
00:10:13.120 --> 00:10:14.080
after the fact.
00:10:14.080 --> 00:10:18.560
These are deeply embedded into these models because of the way that the data
00:10:18.560 --> 00:10:20.560
sets are built on the front.
00:10:20.560 --> 00:10:25.480
And these simple corrections to like de-bias aren't, are in like technical
00:10:25.480 --> 00:10:33.920
fixes for this, aren't really at the root of the problem.
00:10:33.920 --> 00:10:41.960
So if anybody would like to, we will pull up the etherpad again in a bit and talk
00:10:41.960 --> 00:10:42.640
through that.
00:10:42.640 --> 00:10:48.320
So what I've been doing is analyzing, rather than looking just at the prompts,
00:10:48.320 --> 00:10:51.800
I've been trying to go back into the data set that trained these.
00:10:51.800 --> 00:10:57.440
It's a little bit hard to find what actually trained things like chat GPT
00:10:57.440 --> 00:11:00.440
because at this point they're all proprietary.
00:11:00.440 --> 00:11:04.160
They have stopped telling us how they've built these data sets and what's in
00:11:04.160 --> 00:11:04.920
them.
00:11:04.920 --> 00:11:09.760
But folks have started reverse engineering some of the data sets and
00:11:09.760 --> 00:11:12.200
giving us open source editions of this.
00:11:12.200 --> 00:11:16.480
So I've taken some of this and I'm doing different kinds of natural language
00:11:16.480 --> 00:11:24.160
processing analysis to find out from the root training data what is known about
00:11:24.160 --> 00:11:29.800
trans people, queer people, people that, what kind of lived experience is being
00:11:29.800 --> 00:11:32.240
expressed through this.
00:11:32.240 --> 00:11:37.160
Well, if you do a named entity recognition which labels any kind of proper
00:11:37.160 --> 00:11:42.640
nouns that it recognizes, it thinks that pride is a product, pansexual versus
00:11:42.640 --> 00:11:47.840
bisexual is a work of art, and queer liberation is an org.
00:11:47.840 --> 00:11:55.720
A lot of the text that comes up is around, like, trauma and hate speech.
00:11:55.720 --> 00:12:01.560
Anything related to queer women or nonbinary people very quickly goes into
00:12:01.560 --> 00:12:02.840
pornography.
00:12:02.840 --> 00:12:04.240
This one is one of my favorites.
00:12:04.240 --> 00:12:08.200
It said after all one of the best things that a lesbian can do is turn the guy
00:12:08.200 --> 00:12:09.320
on.
00:12:09.320 --> 00:12:14.920
So I don't know about y'all, but this is not really capturing my own queer lived
00:12:14.920 --> 00:12:16.240
experience.
00:12:16.240 --> 00:12:21.920
And I would love to, other than it have something spit out at me like when I try
00:12:21.920 --> 00:12:26.560
to type in something and it just says as a large language model, you know,
00:12:26.560 --> 00:12:28.120
everybody should be treated equally.
00:12:28.120 --> 00:12:34.840
These are the kind of diversity milk toast phrases that it puts on top of the
00:12:34.840 --> 00:12:37.200
hate speech that it's covering up.
00:12:37.200 --> 00:12:42.800
And instead I would love to see it actually say something that represents my
00:12:42.800 --> 00:12:45.240
own experience and others.
00:12:45.240 --> 00:12:49.000
So I'm interested in investigating how we do that.
00:12:49.000 --> 00:12:53.640
Here's another example of some of my investigations looking at words that are
00:12:53.640 --> 00:12:58.440
similar to identity terms that I've been putting into the model.
00:12:58.440 --> 00:12:59.360
And you can see a bit.
00:12:59.360 --> 00:13:00.920
I won't read through it.
00:13:00.920 --> 00:13:06.320
And if anyone's interested, after I have the live demo of this data that I've
00:13:06.320 --> 00:13:09.680
built and we can look up other terms, I would be very interested to know what
00:13:09.680 --> 00:13:13.160
terms you'd be interested to investigate in this data set.
00:13:13.160 --> 00:13:16.800
But you can see what kinds of things come up.
00:13:16.800 --> 00:13:20.800
So for bisexual, it's mostly about threesomes and pornography.
00:13:20.800 --> 00:13:25.240
And for trans, it's mostly about transphobia and discrimination.
00:13:25.240 --> 00:13:30.200
And this just, like, hurts my heart.
00:13:30.200 --> 00:13:35.160
So the next question then is can large language models speak so that I
00:13:35.160 --> 00:13:38.400
recognize myself?
00:13:38.400 --> 00:13:43.960
And what Emily and I have been doing is thinking about how we might make new
00:13:43.960 --> 00:13:48.040
methods around this, take what we know about intersectional approaches and
00:13:48.040 --> 00:13:53.160
tactics, both to examine the existing corpora like I just showed you, and then
00:13:53.160 --> 00:13:58.800
to go on to create new corpora where we are pulling from different text sources
00:13:58.800 --> 00:14:01.600
that we believe are better representative.
00:14:01.600 --> 00:14:06.480
Not only that, but creating a way to have other people help contribute to that
00:14:06.480 --> 00:14:10.160
because it shouldn't be just coming from one source.
00:14:10.160 --> 00:14:17.160
Having ways that the publishers and the authors of these sources get attributed
00:14:17.160 --> 00:14:24.600
and have a more consentful relationship to the text where they can revoke and
00:14:24.600 --> 00:14:28.240
decide what kind of license they want to offer, where all of this gets baked into
00:14:28.240 --> 00:14:30.480
the data set.
00:14:30.480 --> 00:14:39.120
To train new models, meaning when we have this new data set, can we do fine
00:14:39.120 --> 00:14:40.800
tuning on top of what's existing?
00:14:40.800 --> 00:14:44.720
Can we completely new large language models?
00:14:44.720 --> 00:14:51.720
Is this better?
00:14:51.720 --> 00:14:52.720
Yeah.
00:14:52.720 --> 00:14:59.200
Can we even move on to imagine what new model architectures altogether might
00:14:59.200 --> 00:15:00.200
look like?
00:15:00.200 --> 00:15:06.040
And then finally, thinking about how can people make use of these?
00:15:06.040 --> 00:15:11.440
So if we had the language model of our dreams that didn't spit out garbage
00:15:11.440 --> 00:15:14.440
text like we've just seen, what would you want to do with it?
00:15:14.440 --> 00:15:15.440
What would you want to make?
00:15:15.440 --> 00:15:19.960
What other possibilities might exist in the world if we had systems that could
00:15:19.960 --> 00:15:24.400
speak with us and for us?
00:15:24.400 --> 00:15:29.400
So these are some examples of what the current data sets look like if you pull
00:15:29.400 --> 00:15:31.560
them up.
00:15:31.560 --> 00:15:38.880
As you can see, it's basically a title and a text and barely even where it comes
00:15:38.880 --> 00:15:40.920
from.
00:15:40.920 --> 00:15:42.720
This is the data set.
00:15:42.720 --> 00:15:44.280
The source is another data set.
00:15:44.280 --> 00:15:46.560
It's turtles all the way down.
00:15:46.560 --> 00:15:52.560
This is what we are proposing as a provocation that it could include a
00:15:52.560 --> 00:15:59.120
description of the work, the rights that were given, who the publisher is, where
00:15:59.120 --> 00:16:02.320
you would find the original text, even how it was pre-processed and who
00:16:02.320 --> 00:16:03.760
pre-processed it.
00:16:03.760 --> 00:16:07.100
I would be very interested to hear from any of you what other kinds of things
00:16:07.100 --> 00:16:10.640
you think should belong in a training data set.
00:16:10.640 --> 00:16:14.360
The thing that I think is also interesting about this would be that it becomes an
00:16:14.360 --> 00:16:21.240
archive in its own right and it becomes something that people can use not only in
00:16:21.240 --> 00:16:29.200
mass as a training data set but also to find new text.
00:16:29.200 --> 00:16:34.540
So necessarily, as you saw, all of that would take a lot more work than scraping
00:16:34.540 --> 00:16:40.040
all of Reddit and giving it a filter for Karma score of three.
00:16:40.040 --> 00:16:44.640
This will be necessarily a lot slower and more careful and more cared for and it
00:16:44.640 --> 00:16:47.360
will bear the traces of who's doing the work.
00:16:47.360 --> 00:16:52.920
It will have an active subject position instead of just being the so-called view
00:16:52.920 --> 00:17:00.920
from nowhere that is basically a white male Silicon Valley view.
00:17:00.920 --> 00:17:04.560
I think it's really important that we are acknowledging the labor that goes into
00:17:04.560 --> 00:17:08.800
building data sets, the publishers, the authors, all of us who are being sucked
00:17:08.800 --> 00:17:13.880
into these systems, and then the people who are working to clean them and curate
00:17:13.880 --> 00:17:22.640
them because this is a curation process whether we are acknowledging it or not.
00:17:22.640 --> 00:17:28.160
So my question overall is to think about which kinds of data sets do we want?
00:17:28.160 --> 00:17:33.080
Do we want the indiscriminate curation as a technical concern?
00:17:33.080 --> 00:17:38.440
Do we want things curated by communities for specific purposes?
00:17:38.440 --> 00:17:44.480
Do we want zero-shot, the biggest general catch-all that really does nothing well?
00:17:44.480 --> 00:17:46.400
It's a shitty Swiss army knife.
00:17:46.400 --> 00:17:51.720
Or can we create things that are including attribution, including consent,
00:17:51.720 --> 00:17:56.040
including care, and have our own goals in mind?
00:17:56.040 --> 00:18:02.880
And I think it takes taking a step back from what these tools have offered us and
00:18:02.880 --> 00:18:08.600
asked us and thinking within their frameworks to actually really-
00:18:08.600 --> 00:18:18.600
[INAUDIBLE]
00:18:18.600 --> 00:18:28.600
[INAUDIBLE]
00:18:46.360 --> 00:18:55.000
It's a live coding web interface where the similarity texts cycle through.
00:18:55.000 --> 00:18:58.320
But I would just put this up here to invite you to think about what kinds of
00:18:58.320 --> 00:19:04.760
things you would want to make with a different kind of large language model.
00:19:04.760 --> 00:19:09.160
And for those of you who have questions about working with data sets for
00:19:09.160 --> 00:19:15.440
machine learning in general, I also just completed this zine critical field guide
00:19:15.440 --> 00:19:20.280
for working with machine data sets, which is really primarily thinking about how do
00:19:20.280 --> 00:19:23.080
we conscientiously engage with these practices.
00:19:23.080 --> 00:19:33.680
So let's open up the etherpad and see what we came up with.
00:19:33.680 --> 00:19:35.680
Okay.
00:19:35.680 --> 00:19:45.680
[BLANK_AUDIO]
00:19:45.680 --> 00:19:52.040
A lesbian couple are on their way to Barcelona as they board the cruise ship.
00:19:52.040 --> 00:19:57.280
In celebration of love and diversity, we will be hosting a special pride night.
00:19:57.280 --> 00:19:57.800
Okay.
00:19:57.800 --> 00:20:01.280
[LAUGH]
00:20:01.280 --> 00:20:05.480
Anybody else, if you found any interesting ones, please continue to add them.
00:20:05.480 --> 00:20:07.480
I would be really excited to see.
00:20:07.480 --> 00:20:12.240
And for this next bit, what I would love for us to do is to think about
00:20:12.240 --> 00:20:16.000
what you would imagine for these systems.
00:20:16.000 --> 00:20:19.000
So in this kind of how might we exercise,
00:20:19.000 --> 00:20:22.720
this is a brainstorming around questions that you would want to know.
00:20:22.720 --> 00:20:26.640
Like mine are, how might we rewrite these prompt responses?
00:20:26.640 --> 00:20:30.000
What would you want the prompt to say instead of what came out?
00:20:30.000 --> 00:20:34.760
How might we build machine learning systems for things we actually want?
00:20:34.760 --> 00:20:36.440
What do we want these to do?
00:20:36.440 --> 00:20:40.640
How might we trace and protect the use of community language resources?
00:20:40.640 --> 00:20:46.880
How might we have large language models that speak with, for, to, and
00:20:46.880 --> 00:20:49.560
about us as we prefer?
00:20:49.560 --> 00:20:52.200
How might we reimagine the technical aspects of this for
00:20:52.200 --> 00:20:56.200
those of you who are working with large language models?
00:20:56.200 --> 00:21:01.560
What kinds of intersectional queer logics could we apply instead?
00:21:01.560 --> 00:21:06.040
So if you're in the document, I would invite you to add your own questions
00:21:06.040 --> 00:21:11.480
around this and we can also open it up to questions and discussion from the group.
00:21:11.480 --> 00:21:20.000
Pretty much the same thing if you replace lesbian with trans,
00:21:20.000 --> 00:21:22.280
except they were on a hot air balloon.
00:21:22.280 --> 00:21:26.640
Okay, so the other interesting thing about this, I'm curious for
00:21:26.640 --> 00:21:31.560
this person which model you used if you use Bloom or OpenAI.
00:21:31.560 --> 00:21:35.760
Because they're continually kind of updating and
00:21:35.760 --> 00:21:40.640
adding more diversity bullshit to these.
00:21:40.640 --> 00:21:42.200
I mean, I love diversity, but
00:21:42.200 --> 00:21:45.960
I don't like the bureaucratic diversity speak that's covering up what's still in
00:21:45.960 --> 00:21:46.680
these models.
00:21:46.680 --> 00:21:52.200
So anytime I try to write anything like dyke or queer, I get,
00:21:52.200 --> 00:21:56.440
it's important to treat all people equally, which yes, but
00:21:56.440 --> 00:21:57.840
give me some information please.
00:21:57.840 --> 00:22:04.560
So what questions do you have?
00:22:04.560 --> 00:22:05.060
Yeah.
00:22:05.060 --> 00:22:09.060
Yeah.
00:22:09.060 --> 00:22:11.720
>> Super interesting, thank you, I was just wondering.
00:22:11.720 --> 00:22:20.280
Because they are so general and therefore could go positive, negative, up, down.
00:22:20.280 --> 00:22:25.720
So I know we would ideally like them to better handle short prompts,
00:22:25.720 --> 00:22:28.160
but giving them more guidance like I'm in the mood for
00:22:28.160 --> 00:22:30.080
an uplifting story versus I'm in the.
00:22:30.080 --> 00:22:37.000
But I totally get that, it'll never be, well, maybe it'll be perfect one day, but
00:22:37.000 --> 00:22:41.000
just like when you experiment with some prompting, does it help at all?
00:22:41.000 --> 00:22:45.640
Does it make a difference or does it still treat the words?
00:22:45.640 --> 00:22:50.160
>> Yeah, the example I love is, I don't know if anyone else has seen this.
00:22:50.160 --> 00:22:53.520
The doctor as a gender term,
00:22:53.520 --> 00:22:57.920
it automatically assumes doctors are male and nurses are female.
00:22:57.920 --> 00:23:02.480
And if you tell it, in the story, the doctor is female,
00:23:02.480 --> 00:23:06.440
it can't wrap its mind around it, it just goes back.
00:23:06.440 --> 00:23:14.800
So there is a lot you can do to continue prompt training as you amend these,
00:23:14.800 --> 00:23:21.120
but it has its limits because it's absorbing all of this text and
00:23:21.120 --> 00:23:24.480
it's reflecting what we've all been saying online.
00:23:24.480 --> 00:23:28.640
So when the majority of this has that bias, it's pretty hard.
00:23:28.640 --> 00:23:35.920
And I think the takeaway for me is that we do need to read them critically,
00:23:35.920 --> 00:23:36.720
no matter what.
00:23:36.720 --> 00:23:42.040
So if I think to say, okay, I need something more positive or
00:23:42.040 --> 00:23:46.600
I need something less biased, how do I approach asking for
00:23:46.600 --> 00:23:50.040
that and making sure that it can give me that?
00:23:50.040 --> 00:23:55.040
And for the more subtle questions, I still need to be thinking about
00:23:55.040 --> 00:23:56.880
where might that bias be entering?
00:23:56.880 --> 00:24:01.360
So if I'm just having it write me an email and I think it has nothing to do with that,
00:24:01.360 --> 00:24:06.800
I need to still be considering that that bias might be latent in the system.
00:24:06.800 --> 00:24:10.240
Even though I'm like, I'm just looking up a recipe for dinner or whatever.
00:24:10.240 --> 00:24:14.000
This is still all speaking from the same singular voice.
00:24:14.000 --> 00:24:16.280
Yeah, great question.
00:24:16.280 --> 00:24:26.280
[BLANK_AUDIO]
00:24:26.280 --> 00:24:37.520
>> Thank you for the interesting talk.
00:24:37.520 --> 00:24:39.240
I have a question about one of the examples,
00:24:39.240 --> 00:24:41.320
some of the examples I gave at the beginning.
00:24:41.320 --> 00:24:45.760
So you showed output from chat GPT, a couple of boards to the plane,
00:24:45.760 --> 00:24:47.920
the officer accused him of having deviant sexual relations.
00:24:47.920 --> 00:24:50.560
That's the problematic, obviously problematic, right?
00:24:50.560 --> 00:24:55.040
And work list outputs with transphobia and discrimination.
00:24:55.040 --> 00:24:58.880
I was wondering if this is authentic data,
00:24:58.880 --> 00:25:02.080
which is the data that is actually given, that's actually being used.
00:25:02.080 --> 00:25:04.680
Isn't this exactly what we want in terms of intersectionality?
00:25:04.680 --> 00:25:08.880
Because it raises awareness about the factual state of things,