forked from diveintomark/diveintopython3
-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathhttp-web-services.html
executable file
·5084 lines (4923 loc) · 735 KB
/
http-web-services.html
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
<!DOCTYPE html>
<meta charset=utf-8>
<title>HTTP Web Services - Dive Into Python 3</title>
<!--[if IE]><script src=j/html5.js></script><![endif]-->
<link rel=stylesheet href=dip3.css>
<style>
body{counter-reset:h1 14}
mark{display:inline}
</style>
<link rel=stylesheet media='only screen and (max-device-width: 480px)' href=mobile.css>
<link rel=stylesheet media=print href=print.css>
<meta name=viewport content='initial-scale=1.0'>
<p>You are here: <a href=index.html>Home</a> <span class=u>‣</span> <a href=table-of-contents.html#http-web-services>Dive
Into Python 3</a> <span class=u>‣</span>
<p id=level>Difficulty level: <span class=u title=advanced>♦♦♦♦♢</span>
<h1>HTTP Web Services</h1>
<blockquote class=q>
<p><span class=u>❝</span> A ruffled mind makes a restless pillow. <span class=u>❞</span><br>—
Charlotte Brontë
</blockquote>
<p id=toc>
<h2 id=divingin>Diving In</h2>
<p class=f>Philosophically, I can describe HTTP web services in 12 words: exchanging data with remote servers
using nothing but the operations of <abbr>HTTP</abbr>. If you want to get data from the server, use <abbr>HTTP</abbr>
<code>GET</code>. If you want to send new data to the server, use <abbr>HTTP</abbr> <code>POST</code>. Some
more advanced <abbr>HTTP</abbr> web service <abbr>API</abbr>s also allow creating, modifying, and deleting
data, using <abbr>HTTP</abbr> <code>PUT</code> and <abbr>HTTP</abbr> <code>DELETE</code>. That’s it. No
registries, no envelopes, no wrappers, no tunneling. The “verbs” built into the <abbr>HTTP</abbr>
protocol (<code>GET</code>, <code>POST</code>, <code>PUT</code>, and <code>DELETE</code>) map directly to
application-level operations for retrieving, creating, modifying, and deleting data.
<p>The main advantage of this approach is simplicity, and its simplicity has proven popular.
Data — usually <a href=xml.html><abbr>XML</abbr></a> or <a href=serializing.html#json><abbr>JSON</abbr></a> — can
be built and stored statically, or generated dynamically by a server-side script, and all major programming
languages (including Python, of course!) include an <abbr>HTTP</abbr> library for downloading it. Debugging
is also easier; because each resource in an <abbr>HTTP</abbr> web service has a unique address (in the form
of a <abbr>URL</abbr>), you can load it in your web browser and immediately see the raw data.
<p>Examples of <abbr>HTTP</abbr> web services:
<ul>
<li><a href=http://code.google.com/apis/gdata/>Google Data <abbr>API</abbr>s</a> allow you to interact
with a wide variety of Google services, including <a href=http://www.blogger.com/>Blogger </a> and <a
href=http://www.youtube.com/>YouTube </a>. <li><a href=http://www.flickr.com/services/api/>Flickr
Services</a> allow you to upload and download photos from <a href=http://www.flickr.com/>Flickr </a>.
<li><a href=http://apiwiki.twitter.com/>Twitter <abbr>API</abbr></a> allows you to publish status
updates on <a href=http://twitter.com/>Twitter </a>. <li><a href='http://www.programmableweb.com/apis/directory/1?sort=mashups'>…and
many more</a>
</ul>
<p>Python 3 comes with two different libraries for interacting with <abbr>HTTP</abbr> web services:
<ul>
<li><a href=http://docs.python.org/3.1/library/http.client.html> <code>http.client</code></a> is a
low-level library that implements <a href=http://www.w3.org/Protocols/rfc2616/rfc2616.html> <abbr>RFC</abbr>
2616</a>, the <abbr>HTTP</abbr> protocol.
<li><a href=http://docs.python.org/3.1/library/urllib.request.html> <code>urllib.request</code></a> is
an abstraction layer built on top of <code>http.client</code>. It provides a standard <abbr>API</abbr>
for accessing both <abbr>HTTP</abbr> and <abbr>FTP</abbr> servers, automatically follows <abbr>HTTP</abbr>
redirects, and handles some common forms of <abbr>HTTP</abbr> authentication.
</ul>
<p>So which one should you use? Neither of them. Instead, you should use <a href=http://code.google.com/p/httplib2/>
<code>httplib2</code></a>, an open source third-party library that implements <abbr>HTTP</abbr> more
fully than <code>http.client</code> but provides a better abstraction than <code>urllib.request</code>.
<p>To understand why <code>httplib2</code> is the right choice, you first need to understand <abbr>HTTP</abbr>.
<p class=a>⁂
<h2 id=http-features>Features of HTTP</h2>
<p>There are five important features which all <abbr>HTTP</abbr> clients should support.
<h3 id=caching>Caching</h3>
<p>The most important thing to understand about any type of web service is that network access is
incredibly expensive. I don’t mean “dollars and cents” expensive (although
bandwidth ain’t free). I mean that it takes an extraordinary long time to open a
connection, send a request, and retrieve a response from a remote server. Even on the fastest
broadband connection, <i>latency</i> (the time it takes to send a request and start retrieving
data in a response) can still be higher than you anticipated. A router misbehaves, a packet is
dropped, an intermediate proxy is under attack — there’s <a href=http://isc.sans.org/>never
a dull moment</a> on the public internet, and there may be nothing you can do about it.
<aside><code>Cache-Control: max-age</code> means “don't bug me until next week.”</aside>
<p><abbr>HTTP</abbr> is designed with caching in mind. There is an entire class of devices
(called “caching proxies”) whose only job is to sit between you and the rest of
the world and minimize network access. Your company or <abbr>ISP</abbr> almost certainly
maintains caching proxies, even if you’re unaware of them. They work because caching
is built into the <abbr>HTTP</abbr> protocol.
<p>Here’s a concrete example of how caching works. You visit <a href=http://diveintomark.org/>
<code>diveintomark.org</code></a> in your browser. That page includes a background
image, <a href=http://wearehugh.com/m.jpg> <code>wearehugh.com/m.jpg</code></a>. When
your browser downloads that image, the server includes the following <abbr>HTTP</abbr>
headers:
<pre class=nd><code>HTTP/1.1 200 OK
Date: Sun, 31 May 2009 17:14:04 GMT
Server: Apache
Last-Modified: Fri, 22 Aug 2008 04:28:16 GMT
ETag: "3075-ddc8d800"
Accept-Ranges: bytes
Content-Length: 12405
<mark>Cache-Control: max-age=31536000, public</mark>
<mark>Expires: Mon, 31 May 2010 17:14:04 GMT</mark>
Connection: close
Content-Type: image/jpeg</code></pre>
<p>The <code>Cache-Control</code> and <code>Expires</code> headers tell your browser (and
any caching proxies between you and the server) that this image can be cached for up to
a year. <em>A year!</em> And if, in the next year, you visit another page which also
includes a link to this image, your browser will load the image from its cache <em>without
generating any network activity whatsoever</em>.
<p>But wait, it gets better. Let’s say your browser purges the image from your
local cache for some reason. Maybe it ran out of disk space; maybe you manually
cleared the cache. Whatever. But the <abbr>HTTP</abbr> headers said that this data
could be cached by public caching proxies. (Technically, the important thing is what
the headers <em>don’t</em> say; the <code>Cache-Control</code> header
doesn’t have the <code>private</code> keyword, so this data is cacheable by
default.) Caching proxies are designed to have tons of storage space, probably far
more than your local browser has allocated.
<p>If your company or <abbr>ISP</abbr> maintain a caching proxy, the proxy may still
have the image cached. When you visit <code>diveintomark.org</code> again, your
browser will look in its local cache for the image, but it won’t find it, so
it will make a network request to try to download it from the remote server. But if
the caching proxy still has a copy of the image, it will intercept that request and
serve the image from <em>its</em> cache. That means that your request will never
reach the remote server; in fact, it will never leave your company’s network.
That makes for a faster download (fewer network hops) and saves your company money
(less data being downloaded from the outside world).
<p><abbr>HTTP</abbr> caching only works when everybody does their part. On one
side, servers need to send the correct headers in their response. On the other
side, clients need to understand and respect those headers before they request
the same data twice. The proxies in the middle are not a panacea; they can only
be as smart as the servers and clients allow them to be.
<p>Python’s <abbr>HTTP</abbr> libraries do not support caching, but <code>httplib2</code>
does.
<h3 id=last-modified>Last-Modified Checking</h3>
<p>Some data never changes, while other data changes all the time. In between,
there is a vast field of data that <em>might</em> have changed, but
hasn’t. CNN.com’s feed is updated every few minutes, but my
weblog’s feed may not change for days or weeks at a time. In the latter
case, I don’t want to tell clients to cache my feed for weeks at a
time, because then when I do actually post something, people may not read it
for weeks (because they’re respecting my cache headers which said
“don’t bother checking this feed for weeks”). On the other
hand, I don’t want clients downloading my entire feed once an hour if
it hasn’t changed!
<aside><code>304: Not Modified</code> means “same shit, different
day.”</aside>
<p><abbr>HTTP</abbr> has a solution to this, too. When you request data for
the first time, the server can send back a <code>Last-Modified</code>
header. This is exactly what it sounds like: the date that the data was
changed. That background image referenced from <code>diveintomark.org</code>
included a <code>Last-Modified</code> header.
<pre class=nd><code>HTTP/1.1 200 OK
Date: Sun, 31 May 2009 17:14:04 GMT
Server: Apache
<mark>Last-Modified: Fri, 22 Aug 2008 04:28:16 GMT</mark>
ETag: "3075-ddc8d800"
Accept-Ranges: bytes
Content-Length: 12405
Cache-Control: max-age=31536000, public
Expires: Mon, 31 May 2010 17:14:04 GMT
Connection: close
Content-Type: image/jpeg
</code></pre>
<p>When you request the same data a second (or third or fourth) time, you
can send an <code>If-Modified-Since</code> header with your request, with
the date you got back from the server last time. If the data has changed
since then, then the server gives you the new data with a <code>200</code>
status code. But if the data <em>hasn’t</em> changed since then,
the server sends back a special <abbr>HTTP</abbr> <code>304</code> status
code, which means “this data hasn’t changed since the last
time you asked for it.” You can test this on the command line,
using <a href=http://curl.haxx.se/>curl </a>: <pre class='nd screen'>
<samp class=p>you@localhost:~$ </samp><kbd>curl -I <mark>-H
"If-Modified-Since: Fri, 22 Aug 2008 04:28:16 GMT"</mark>
http://wearehugh.com/m.jpg</kbd>
<samp>HTTP/1.1 304 Not Modified
Date: Sun, 31 May 2009 18:04:39 GMT
Server: Apache
Connection: close
ETag: "3075-ddc8d800"
Expires: Mon, 31 May 2010 18:04:39 GMT
Cache-Control: max-age=31536000, public</samp>
</pre>
<p>Why is this an improvement? Because when the server sends a <code>304</code>,
<em>it doesn’t re-send the data</em>. All you get is the status
code. Even after your cached copy has expired, last-modified checking
ensures that you won’t download the same data twice if it
hasn’t changed. (As an extra bonus, this <code>304</code> response
also includes caching headers. Proxies will keep a copy of data even
after it officially “expires,” in the hopes that the data
hasn’t <em>really</em> changed and the next request responds with a
<code>304</code> status code and updated cache information.)
<p>Python’s <abbr>HTTP</abbr> libraries do not support
last-modified date checking, but <code>httplib2</code> does.
<h3 id=etags>ETag Checking</h3>
<p>ETags are an alternate way to accomplish the same thing as the <a
href=#last-modified>last-modified checking</a>. With Etags, the
server sends a hash code in an <code>ETag</code> header along with
the data you requested. (Exactly how this hash is determined is
entirely up to the server. The only requirement is that it changes
when the data changes.) That background image referenced from <code>diveintomark.org</code>
had an <code>ETag</code> header.
<pre class=nd><code>HTTP/1.1 200 OK
Date: Sun, 31 May 2009 17:14:04 GMT
Server: Apache
Last-Modified: Fri, 22 Aug 2008 04:28:16 GMT
<mark>ETag: "3075-ddc8d800"</mark>
Accept-Ranges: bytes
Content-Length: 12405
Cache-Control: max-age=31536000, public
Expires: Mon, 31 May 2010 17:14:04 GMT
Connection: close
Content-Type: image/jpeg
</code></pre>
<aside><code>ETag</code> means “there’s nothing new under
the sun.”</aside>
<p>The second time you request the same data, you include the ETag
hash in an <code>If-None-Match</code> header of your request. If
the data hasn’t changed, the server will send you back a
<code>304</code> status code. As with the last-modified date
checking, the server sends back <em>only</em> the <code>304</code>
status code; it doesn’t send you the same data a second time.
By including the ETag hash in your second request, you’re
telling the server that there’s no need to re-send the same
data if it still matches this hash, since <a href=#caching>you
still have the data from the last time</a>.
<p>Again with the <kbd>curl</kbd>:
<pre class='nd screen'>
<a><samp class=p>you@localhost:~$ </samp><kbd>curl -I <mark>-H "If-None-Match: \"3075-ddc8d800\""</mark> http://wearehugh.com/m.jpg</kbd> <span class=u>①</span></a>
<samp>HTTP/1.1 304 Not Modified
Date: Sun, 31 May 2009 18:04:39 GMT
Server: Apache
Connection: close
ETag: "3075-ddc8d800"
Expires: Mon, 31 May 2010 18:04:39 GMT
Cache-Control: max-age=31536000, public</samp></pre>
<ol>
<li>ETags are commonly enclosed in quotation marks, but <em>the
quotation marks are part of the value</em>. That means you
need to send the quotation marks back to the server in the
<code>If-None-Match</code> header.
</ol>
<p>Python’s <abbr>HTTP</abbr> libraries do not support
ETags, but <code>httplib2</code> does.
<h3 id=compression>Compression</h3>
<p>When you talk about <abbr>HTTP</abbr> web services,
you’re almost always talking about moving text-based
data back and forth over the wire. Maybe it’s <abbr>XML</abbr>,
maybe it’s <abbr>JSON</abbr>, maybe it’s just <a
href=strings.html#boring-stuff title='there ain’t no such thing as plain text'>plain
text</a>. Regardless of the format, text compresses well.
The example feed in <a href=xml.html>the XML chapter</a> is
3070 bytes uncompressed, but would be 941 bytes after gzip
compression. That’s just 30% of the original size!
<p><abbr>HTTP</abbr> supports <a href=http://www.iana.org/assignments/http-parameters>several
compression algorithms</a>. The two most common types are
<a href=http://www.ietf.org/rfc/rfc1952.txt>gzip </a> and
<a href=http://www.ietf.org/rfc/rfc1951.txt>deflate </a>.
When you request a resource over <abbr>HTTP</abbr>, you
can ask the server to send it in compressed format. You
include an <code>Accept-encoding</code> header in your
request that lists which compression algorithms you
support. If the server supports any of the same
algorithms, it will send you back compressed data (with a
<code>Content-encoding</code> header that tells you which
algorithm it used). Then it’s up to you to
decompress the data.
<blockquote class=note>
<p><span class=u>☞</span>Important tip for
server-side developers: make sure that the compressed
version of a resource has a different <a href=#etags>Etag</a>
than the uncompressed version. Otherwise, caching
proxies will get confused and may serve the
compressed version to clients that can’t handle
it. Read the discussion of <a href="https://issues.apache.org/bugzilla/show_bug.cgi?id=39727">Apache
bug 39727</a> for more details on this subtle
issue.
</blockquote>
<p>Python’s <abbr>HTTP</abbr> libraries do not
support compression, but <code>httplib2</code> does.
<h3 id=redirects>Redirects</h3>
<p><a href=http://www.w3.org/Provider/Style/URI>Cool
<abbr>URI</abbr>s don’t change</a>, but many
<abbr>URI</abbr>s are seriously uncool. Web sites get
reorganized, pages move to new addresses. Even web
services can reorganize. A syndicated feed at <code>http://example.com/index.xml</code>
might be moved to <code>http://example.com/xml/atom.xml</code>.
Or an entire domain might move, as an organization
expands and reorganizes; <code>http://www.example.com/index.xml</code>
becomes <code>http://server-farm-1.example.com/index.xml</code>.
<aside><code>Location</code> means “look over
there!”</aside>
<p>Every time you request any kind of resource from
an <abbr>HTTP</abbr> server, the server includes a
status code in its response. Status code <code>200</code>
means “everything’s normal,
here’s the page you asked for”. Status
code <code>404</code> means “page not
found”. (You’ve probably seen 404
errors while browsing the web.) Status codes in the
300’s indicate some form of redirection.
<p><abbr>HTTP</abbr> has several different ways of
signifying that a resource has moved. The two
most common techiques are status codes <code>302</code>
and <code>301</code>. Status code <code>302</code>
is a <i>temporary redirect</i>; it means
“oops, that got moved over here
temporarily” (and then gives the temporary
address in a <code>Location</code> header).
Status code <code>301</code> is a <i>permanent
redirect</i>; it means “oops, that got
moved permanently” (and then gives the new
address in a <code>Location</code> header). If
you get a <code>302</code> status code and a new
address, the <abbr>HTTP</abbr> specification says
you should use the new address to get what you
asked for, but the next time you want to access
the same resource, you should retry the old
address. But if you get a <code>301</code> status
code and a new address, you’re supposed to
use the new address from then on.
<p>The <code>urllib.request</code> module
automatically “follow” redirects
when it receives the appropriate status code
from the <abbr>HTTP</abbr> server, but it
doesn’t tell you that it did so.
You’ll end up getting data you asked for,
but you’ll never know that the underlying
library “helpfully” followed a
redirect for you. So you’ll continue
pounding away at the old address, and each time
you’ll get redirected to the new address,
and each time the <code>urllib.request</code>
module will “helpfully” follow the
redirect. In other words, it treats permanent
redirects the same as temporary redirects. That
means two round trips instead of one, which is
bad for the server and bad for you.
<p><code>httplib2</code> handles permanent
redirects for you. Not only will it tell you
that a permanent redirect occurred, it will
keep track of them locally and automatically
rewrite redirected <abbr>URL</abbr>s before
requesting them.
<p class=a>⁂
<h2 id=dont-try-this-at-home>How Not To
Fetch Data Over HTTP</h2>
<p>Let’s say you want to download a
resource over <abbr>HTTP</abbr>, such as
<a href=xml.html>an Atom feed</a>. Being
a feed, you’re not just going to
download it once; you’re going to
download it over and over again. (Most
feed readers will check for changes once
an hour.) Let’s do it the
quick-and-dirty way first, and then see
how you can do better.
<pre class='nd screen'>
<samp class=p>>>> </samp><kbd class=pp>import urllib.request</kbd>
<samp class=p>>>> </samp><kbd class=pp>a_url = 'http://diveintopython3.org/examples/feed.xml'</kbd>
<a><samp class=p>>>> </samp><kbd class=pp>data = urllib.request.urlopen(a_url).read()</kbd> <span class=u>①</span></a>
<a><samp class=p>>>> </samp><kbd class=pp>type(data)</kbd> <span class=u>②</span></a>
<samp class=pp><class 'bytes'></samp>
<samp class=p>>>> </samp><kbd class=pp>print(data)</kbd>
<samp class=pp><?xml version='1.0' encoding='utf-8'?>
<feed xmlns='http://www.w3.org/2005/Atom' xml:lang='en'>
<title>dive into mark</title>
<subtitle>currently between addictions</subtitle>
<id>tag:diveintomark.org,2001-07-29:/</id>
<updated>2009-03-27T21:56:07Z</updated>
<link rel='alternate' type='text/html' href='http://diveintomark.org/'/>
…
</samp></pre>
<ol>
<li>Downloading anything over <abbr>HTTP</abbr>
is incredibly easy in Python; in
fact, it’s a one-liner. The
<code>urllib.request</code> module
has a handy <code>urlopen()</code>
function that takes the address of
the page you want, and returns a
file-like object that you can just
<code>read()</code> from to get the
full contents of the page. It just
can’t get any easier.
<li>The <code>urlopen().read()</code>
method always returns <a href=strings.html#byte-arrays>a
<code>bytes</code> object, not a
string</a>. Remember, bytes are
bytes; characters are an abstraction.
<abbr>HTTP</abbr> servers don’t
deal in abstractions. If you request
a resource, you get bytes. If you
want it as a string, you’ll
need to <a href=http://feedparser.org/docs/character-encoding.html>determine
the character encoding</a> and
explicitly convert it to a string.
</ol> <p>So what’s wrong with
this? For a quick one-off during
testing or development,
there’s nothing wrong with
it. I do it all the time. I wanted
the contents of the feed, and I got
the contents of the feed. The same
technique works for any web page.
But once you start thinking in
terms of a web service that you
want to access on a regular basis (<i>e.g.</i>
requesting this feed once an hour),
then you’re being
inefficient, and you’re being
rude.
<p class=a>⁂
<h2 id=whats-on-the-wire>What’s
On The Wire?</h2>
<p>To see why this is inefficient
and rude, let’s turn on
the debugging features of
Python’s <abbr>HTTP</abbr>
library and see what’s
being sent “on the
wire” (<i>i.e.</i> over
the network).
<pre class=screen>
<samp class=p>>>> </samp><kbd class=pp>from http.client import HTTPConnection</kbd>
<a><samp class=p>>>> </samp><kbd class=pp>HTTPConnection.debuglevel = 1</kbd> <span class=u>①</span></a>
<samp class=p>>>> </samp><kbd class=pp>from urllib.request import urlopen</kbd>
<a><samp class=p>>>> </samp><kbd class=pp>response = urlopen('http://diveintopython3.org/examples/feed.xml')</kbd> <span class=u>②</span></a>
<samp><a>send: b'GET /examples/feed.xml HTTP/1.1 <span class=u>③</span></a>
<a>Host: diveintopython3.org <span class=u>④</span></a>
<a>Accept-Encoding: identity <span class=u>⑤</span></a>
<a>User-Agent: Python-urllib/3.1' <span class=u>⑥</span></a>
Connection: close
reply: 'HTTP/1.1 200 OK'
…further debugging information omitted…</samp></pre>
<ol>
<li>As I mentioned at the
beginning of the chapter,
<code>urllib.request</code>
relies on another standard
Python library, <code>http.client</code>.
Normally you don’t
need to touch <code>http.client</code>
directly. (The <code>urllib.request</code>
module imports it
automatically.) But we
import it here so we can
toggle the debugging flag
on the <code>HTTPConnection</code>
class that <code>urllib.request</code>
uses to connect to the
<abbr>HTTP</abbr> server.
<li>Now that the debugging
flag is set, information on
the <abbr>HTTP</abbr>
request and response is
printed out in real time.
As you can see, when you
request the Atom feed, the
<code>urllib.request</code>
module sends five lines to
the server.
<li>The first line specifies
the <abbr>HTTP</abbr> verb
you’re using, and the
path of the resource (minus
the domain name).
<li>The second line specifies
the domain name from which
we’re requesting this
feed.
<li>The third line specifies
the compression algorithms
that the client supports.
As I mentioned earlier, <a
href=#compression><code>urllib.request</code>
does not support
compression</a> by
default.
<li>The fourth line specifies
the name of the library
that is making the request.
By default, this is <code>Python-urllib</code>
plus a version number. Both
<code>urllib.request</code>
and <code>httplib2</code>
support changing the user
agent, simply by adding a
<code>User-Agent</code>
header to the request
(which will override the
default value).
</ol>
<aside>We’re downloading
3070 bytes when we could have
just downloaded 941.</aside>
<p>Now let’s look at what
the server sent back in its
response.
<pre class=screen>
# continued from previous example
<a><samp class=p>>>> </samp><kbd class=pp>print(response.headers.as_string())</kbd> <span class=u>①</span></a>
<samp><a>Date: Sun, 31 May 2009 19:23:06 GMT <span class=u>②</span></a>
Server: Apache
<a>Last-Modified: Sun, 31 May 2009 06:39:55 GMT <span class=u>③</span></a>
<a>ETag: "bfe-93d9c4c0" <span class=u>④</span></a>
Accept-Ranges: bytes
<a>Content-Length: 3070 <span class=u>⑤</span></a>
<a>Cache-Control: max-age=86400 <span class=u>⑥</span></a>
Expires: Mon, 01 Jun 2009 19:23:06 GMT
Vary: Accept-Encoding
Connection: close
Content-Type: application/xml</samp>
<a><samp class=p>>>> </samp><kbd class=pp>data = response.read()</kbd> <span class=u>⑦</span></a>
<samp class=p>>>> </samp><kbd class=pp>len(data)</kbd>
<samp class=pp>3070</samp></pre>
<ol>
<li>The <var>response</var>
returned from the <code>urllib.request.urlopen()</code>
function contains all the
<abbr>HTTP</abbr> headers
the server sent back. It
also contains methods to
download the actual data;
we’ll get to that
in a minute.
<li>The server tells you
when it handled your
request.
<li>This response includes
a <a href=#last-modified><code>Last-Modified</code></a>
header.
<li>This response includes
an <a href=#etags><code>ETag</code></a>
header.
<li>The data is 3070 bytes
long. Notice what <em>isn’t</em>
here: a <code>Content-encoding</code>
header. Your request
stated that you only
accept uncompressed data
(<code>Accept-encoding:
identity</code>), and
sure enough, this
response contains
uncompressed data.
<li>This response includes
caching headers that
state that this feed can
be cached for up to 24
hours (86400 seconds).
<li>And finally, download
the actual data by
calling <code>response.read()</code>.
As you can tell from the
<code>len()</code>
function, this fetched a
total of 3070 bytes.
</ol>
<p>As you can see, this code
is already inefficient: it
asked for (and received)
uncompressed data. I know
for a fact that this server
supports <a href=#compression>gzip
compression</a>, but
<abbr>HTTP</abbr>
compression is opt-in. We
didn’t ask for it, so
we didn’t get it.
That means we’re
fetching 3070 bytes when we
could have fetched 941. Bad
dog, no biscuit.
<p>But wait, it gets worse!
To see just how
inefficient this code is,
let’s request the
same feed a second time.
<pre class='nd screen'>
# continued from the <a href=#whats-on-the-wire>previous example</a>
<samp class=p>>>> </samp><kbd class=pp>response2 = urlopen('http://diveintopython3.org/examples/feed.xml')</kbd>
<samp>send: b'GET /examples/feed.xml HTTP/1.1
Host: diveintopython3.org
Accept-Encoding: identity
User-Agent: Python-urllib/3.1'
Connection: close
reply: 'HTTP/1.1 200 OK'
…further debugging information omitted…</samp></pre>
<p>Notice anything
peculiar about this
request? It
hasn’t changed!
It’s exactly the
same as the first
request. No sign of <a
href=#last-modified><code>If-Modified-Since</code>
headers</a>. No sign
of <a href=#etags><code>If-None-Match</code>
headers</a>. No
respect for the caching
headers. Still no
compression.
<p>And what happens
when you do the same
thing twice? You get
the same response.
Twice.
<pre class=screen>
# continued from the previous example
<a><samp class=p>>>> </samp><kbd class=pp>print(response2.headers.as_string())</kbd> <span class=u>①</span></a>
<samp>Date: Mon, 01 Jun 2009 03:58:00 GMT
Server: Apache
Last-Modified: Sun, 31 May 2009 22:51:11 GMT
ETag: "bfe-255ef5c0"
Accept-Ranges: bytes
Content-Length: 3070
Cache-Control: max-age=86400
Expires: Tue, 02 Jun 2009 03:58:00 GMT
Vary: Accept-Encoding
Connection: close
Content-Type: application/xml</samp>
<samp class=p>>>> </samp><kbd class=pp>data2 = response2.read()</kbd>
<a><samp class=p>>>> </samp><kbd class=pp>len(data2)</kbd> <span class=u>②</span></a>
<samp class=pp>3070</samp>
<a><samp class=p>>>> </samp><kbd class=pp>data2 == data</kbd> <span class=u>③</span></a>
<samp class=pp>True</samp></pre>
<ol>
<li>The server is
still sending the
same array of
“smart”
headers: <code>Cache-Control</code>
and <code>Expires</code>
to allow caching,
<code>Last-Modified</code>
and <code>ETag</code>
to enable
“not-modified”
tracking. Even
the <code>Vary:
Accept-Encoding</code>
header hints that
the server would
support
compression, if
only you would
ask for it. But
you didn’t.
<li>Once again,
this request
fetches the whole
3070
bytes…
<li>…the
exact same 3070
bytes you got
last time.
</ol>
<p><abbr>HTTP</abbr>
is designed to work
better than this.
<code>urllib</code>
speaks <abbr>HTTP</abbr>
like I speak
Spanish — enough
to get by in a jam,
but not enough to
hold a
conversation. <abbr>HTTP</abbr>
is a conversation.
It’s time to
upgrade to a
library that speaks
<abbr>HTTP</abbr>
fluently.
<p class=a>⁂
<h2 id=introducing-httplib2>Introducing
<code>httplib2</code></h2>
<p>Before you can
use <code>httplib2</code>,
you’ll
need to install
it. Visit <a
href=http://code.google.com/p/httplib2/>
<code>code.google.com/p/httplib2/</code></a>
and download
the latest
version. <code>httplib2</code>
is available
for Python 2.x
and Python 3.x;
make sure you
get the Python
3 version,
named something
like <code>httplib2-python3-0.5.0.zip</code>.
<p>Unzip the
archive, open
a terminal
window, and
go to the
newly created
<code>httplib2</code>
directory. On
Windows, open
the <code>Start</code>
menu, select
<code>Run...</code>,
type <kbd>cmd.exe</kbd>
and press
<kbd>ENTER</kbd>.
<pre class=screen>
<samp class=p>c:\Users\pilgrim\Downloads> </samp><kbd><mark>dir</mark></kbd>
<samp> Volume in drive C has no label.
Volume Serial Number is DED5-B4F8
Directory of c:\Users\pilgrim\Downloads
07/28/2009 12:36 PM <DIR> .
07/28/2009 12:36 PM <DIR> ..
07/28/2009 12:36 PM <DIR> httplib2-python3-0.5.0
07/28/2009 12:33 PM 18,997 httplib2-python3-0.5.0.zip
1 File(s) 18,997 bytes
3 Dir(s) 61,496,684,544 bytes free</samp>
<samp class=p>c:\Users\pilgrim\Downloads> </samp><kbd><mark>cd httplib2-python3-0.5.0</mark></kbd>
<samp class=p>c:\Users\pilgrim\Downloads\httplib2-python3-0.5.0> </samp><kbd><mark>c:\python31\python.exe setup.py install</mark></kbd>
<samp>running install
running build
running build_py
running install_lib
creating c:\python31\Lib\site-packages\httplib2
copying build\lib\httplib2\iri2uri.py -> c:\python31\Lib\site-packages\httplib2
copying build\lib\httplib2\__init__.py -> c:\python31\Lib\site-packages\httplib2
byte-compiling c:\python31\Lib\site-packages\httplib2\iri2uri.py to iri2uri.pyc
byte-compiling c:\python31\Lib\site-packages\httplib2\__init__.py to __init__.pyc
running install_egg_info
Writing c:\python31\Lib\site-packages\httplib2-python3_0.5.0-py3.1.egg-info</samp></pre>
<p>On Mac OS
X, run the
<code>Terminal.app</code>
application
in your
<code>/Applications/Utilities/</code>
folder. On
Linux, run
the <code>Terminal</code>
application,
which is
usually in
your <code>Applications</code>
menu under
<code>Accessories</code>
or <code>System</code>.
<pre class='screen cmdline'>
<samp class=p>you@localhost:~/Desktop$ </samp><kbd><mark>unzip httplib2-python3-0.5.0.zip</mark></kbd>
<samp>Archive: httplib2-python3-0.5.0.zip
inflating: httplib2-python3-0.5.0/README
inflating: httplib2-python3-0.5.0/setup.py
inflating: httplib2-python3-0.5.0/PKG-INFO
inflating: httplib2-python3-0.5.0/httplib2/__init__.py
inflating: httplib2-python3-0.5.0/httplib2/iri2uri.py</samp>
<samp class=p>you@localhost:~/Desktop$ </samp><kbd><mark>cd httplib2-python3-0.5.0/</mark></kbd>
<samp class=p>you@localhost:~/Desktop/httplib2-python3-0.5.0$ </samp><kbd><mark>sudo python3 setup.py install</mark></kbd>
<samp>running install
running build
running build_py
creating build
creating build/lib.linux-x86_64-3.1
creating build/lib.linux-x86_64-3.1/httplib2
copying httplib2/iri2uri.py -> build/lib.linux-x86_64-3.1/httplib2
copying httplib2/__init__.py -> build/lib.linux-x86_64-3.1/httplib2
running install_lib
creating /usr/local/lib/python3.1/dist-packages/httplib2
copying build/lib.linux-x86_64-3.1/httplib2/iri2uri.py -> /usr/local/lib/python3.1/dist-packages/httplib2
copying build/lib.linux-x86_64-3.1/httplib2/__init__.py -> /usr/local/lib/python3.1/dist-packages/httplib2
byte-compiling /usr/local/lib/python3.1/dist-packages/httplib2/iri2uri.py to iri2uri.pyc
byte-compiling /usr/local/lib/python3.1/dist-packages/httplib2/__init__.py to __init__.pyc
running install_egg_info
Writing /usr/local/lib/python3.1/dist-packages/httplib2-python3_0.5.0.egg-info</samp></pre>
<p>To use
<code>httplib2</code>,
create an
instance
of the
<code>httplib2.Http</code>
class.
<pre
class=screen>
<samp class=p>>>> </samp><kbd class=pp>import httplib2</kbd>
<a><samp class=p>>>> </samp><kbd class=pp>h = httplib2.Http('.cache')</kbd> <span class=u>①</span></a>
<a><samp class=p>>>> </samp><kbd class=pp>response, content = h.request('http://diveintopython3.org/examples/feed.xml')</kbd> <span class=u>②</span></a>
<a><samp class=p>>>> </samp><kbd class=pp>response.status</kbd> <span class=u>③</span></a>
<samp class=pp>200</samp>
<a><samp class=p>>>> </samp><kbd class=pp>content[:52]</kbd> <span class=u>④</span></a>
<samp class=pp>b"<?xml version='1.0' encoding='utf-8'?>\r\n<feed xmlns="</samp>
<samp class=p>>>> </samp><kbd class=pp>len(content)</kbd>
<samp class=pp>3070</samp></pre>
<ol>
<li>The
primary
interface
to
<code>httplib2</code>
is
the
<code>Http</code>
object.
For
reasons
you’ll
see
in
the
next
section,
you
should
always
pass
a
directory
name
when
you
create
an
<code>Http</code>
object.
The
directory
does
not
need
to
exist;
<code>httplib2</code>
will
create
it if
necessary.
<li>Once
you
have
an
<code>Http</code>
object,
retrieving
data
is as
simple
as
calling
the
<code>request()</code>
method
with
the
address
of
the
data
you
want.
This
will
issue
an
<abbr>HTTP</abbr>
<code>GET</code>
request
for
that
<abbr>URL</abbr>.
(Later
in
this
chapter,
you’ll
see
how
to
issue
other
<abbr>HTTP</abbr>
requests,
like
<code>POST</code>.)
<li>The
<code>request()</code>
method
returns
two
values.
The
first
is an
<code>httplib2.Response</code>
object,
which
contains
all
the
<abbr>HTTP</abbr>
headers
the
server
returned.
For
example,
a
<code>status</code>
code
of
<code>200</code>
indicates
that
the
request
was
successful.
<li>The
<var>content</var>
variable
contains
the
actual
data
that
was
returned
by
the
<abbr>HTTP</abbr>
server.