|
337 | 337 | </li>
|
338 | 338 |
|
339 | 339 | <li class="md-nav__item">
|
340 |
| - <a href="#dont-steal-copyright-and-fair-use" class="md-nav__link"> |
341 |
| - Don't steal: Copyright and fair use |
| 340 | + <a href="#copyright-respecting-others-intellectual-property" class="md-nav__link"> |
| 341 | + Copyright: respecting other's intellectual property |
342 | 342 | </a>
|
343 | 343 |
|
344 | 344 | </li>
|
|
348 | 348 | Better be safe than sorry
|
349 | 349 | </a>
|
350 | 350 |
|
| 351 | + <nav class="md-nav"> |
| 352 | + <ul class="md-nav__list"> |
| 353 | + |
| 354 | + <li class="md-nav__item"> |
| 355 | + <a href="#challenge" class="md-nav__link"> |
| 356 | + Challenge |
| 357 | + </a> |
| 358 | + |
| 359 | +</li> |
| 360 | + |
| 361 | + </ul> |
| 362 | + </nav> |
| 363 | + |
351 | 364 | </li>
|
352 | 365 |
|
353 | 366 | <li class="md-nav__item">
|
|
427 | 440 | </li>
|
428 | 441 |
|
429 | 442 | <li class="md-nav__item">
|
430 |
| - <a href="#dont-steal-copyright-and-fair-use" class="md-nav__link"> |
431 |
| - Don't steal: Copyright and fair use |
| 443 | + <a href="#copyright-respecting-others-intellectual-property" class="md-nav__link"> |
| 444 | + Copyright: respecting other's intellectual property |
432 | 445 | </a>
|
433 | 446 |
|
434 | 447 | </li>
|
|
438 | 451 | Better be safe than sorry
|
439 | 452 | </a>
|
440 | 453 |
|
| 454 | + <nav class="md-nav"> |
| 455 | + <ul class="md-nav__list"> |
| 456 | + |
| 457 | + <li class="md-nav__item"> |
| 458 | + <a href="#challenge" class="md-nav__link"> |
| 459 | + Challenge |
| 460 | + </a> |
| 461 | + |
| 462 | +</li> |
| 463 | + |
| 464 | + </ul> |
| 465 | + </nav> |
| 466 | + |
441 | 467 | </li>
|
442 | 468 |
|
443 | 469 | <li class="md-nav__item">
|
@@ -499,16 +525,21 @@ <h3 id="dont-break-the-web-denial-of-service-attacks">Don't break the web: Denia
|
499 | 525 | <p>In fact, this is such an efficient way to disrupt a web site that hackers are often doing it on purpose. This is called a <a href="https://en.wikipedia.org/wiki/Denial-of-service_attack">Denial of Service (DoS) attack</a>.</p>
|
500 | 526 | <p>Since DoS attacks are unfortunately a common occurence on the Internet, modern web servers include measures to ward off such illegitimate use of their resources. They are watchful for large amounts of requests appearing to come from a single computer or IP address, and their first line of defense often involves refusing any further requests coming from this IP address.</p>
|
501 | 527 | <p>A web scraper, even one with legitimate purposes and no intent to bring a website down, can exhibit similar behaviour and, if we are not careful, result in our computer being banned from accessing a website.</p>
|
502 |
| -<h3 id="dont-steal-copyright-and-fair-use">Don't steal: Copyright and fair use<a class="headerlink" href="#dont-steal-copyright-and-fair-use" title="Permanent link">¶</a></h3> |
503 |
| -<p>It is important to recognize that in certain circumstances web scraping can be illegal. If the terms and conditions of the web site we are scraping specifically prohibit downloading and copying its content, then we could be in trouble for scraping it.</p> |
504 |
| -<p>In practice, however, web scraping is a tolerated practice, provided reasonable care is taken not to disrupt the “regular” use of a web site, as we have seen above.</p> |
505 |
| -<p>In a sense, web scraping is no different than using a web browser to visit a web page, in that it amounts to using computer software (a browser vs a scraper) to acccess data that is publicly available on the web.</p> |
506 |
| -<p>In general, if data is publicly available (the content that is being scraped is not behind a password-protected authentication system), then it is OK to scrape it, provided we don’t break the web site doing so. What is potentially problematic is if the scraped data will be shared further. For example, downloading content off one website and posting it on another website (as our own), unless explicitely permitted, would constitute copyright violation and be illegal.</p> |
507 |
| -<p>However, most copyright legislations recognize cases in which reusing some, possibly copyrighted, information in an aggregate or derivative format is considered “fair use”. In general, unless the intent is to pass off data as our own, copy it word for word or trying to make money out of it, reusing publicly available content scraped off the internet is OK.</p> |
| 528 | +<h3 id="copyright-respecting-others-intellectual-property">Copyright: respecting other's intellectual property<a class="headerlink" href="#copyright-respecting-others-intellectual-property" title="Permanent link">¶</a></h3> |
| 529 | +<p>It is important to recognize that in certain circumstances web scraping can be illegal, and this <strong>differs from country to country</strong>.</p> |
| 530 | +<p>If the terms and conditions of the web site we are scraping specifically prohibit downloading and copying its content, then we could be in trouble for scraping it. In practice, however, web scraping is a tolerated practice, provided reasonable care is taken not to disrupt the “regular” use of a web site, as we have seen above. However you must be aware that without permisson from the copyright owner you <em>may</em> be in breach of copyright law.</p> |
| 531 | +<p>In a sense, web scraping is no different than using a web browser to visit a web page, in that it amounts to using computer software (a browser vs a scraper) to acccess data that is publicly available on the web. However, researchers should be aware of the risk since the law views web browsing differently to automated web scraping.</p> |
| 532 | +<p>In general, if data is publicly available (the content that is being scraped is not behind a password-protected authentication system), then it may be OK to scrape it, provided we don’t break the web site doing so. What is potentially problematic is if the scraped data will be shared further. For example, downloading content off one website and posting it on another website (as our own), unless explicitly permitted, may constitute a violation of copyright law.</p> |
| 533 | +<p>Copyright law in some countries recognises "fair use" (USA) or "fair dealing" (Australia) which may, under very specific circumstances, allow reusing some copyrighted material. However the scope of these exceptions is narrow and you should not assume they apply to your case.</p> |
| 534 | +<p>For an interesting (Australian) copyright case involving web scraping, see <a href="https://www.claytonutz.com/knowledge/2009/april/copyright-in-compilations-under-the-spotlight-in-high-court">IceTV vs Channel Nine</a>.</p> |
508 | 535 | <h3 id="better-be-safe-than-sorry">Better be safe than sorry<a class="headerlink" href="#better-be-safe-than-sorry" title="Permanent link">¶</a></h3>
|
509 | 536 | <p>Be aware that copyright and data privacy legislation typically differs from country to country. Be sure to check the laws that apply in your context. For example, in Australia, it can be illegal to scrape and store personal information such as names, phone numbers and email addresses, even if they are publicly available.</p>
|
510 | 537 | <p>If you are looking to scrape data for your own personal use, then the above guidelines should probably be all that you need to worry about. However, if you plan to start harvesting a large amount of data for research or commercial purposes, you should probably seek legal advice first.</p>
|
511 |
| -<p>If you work in a university, chances are it has a copyright office that will help you sort out the legal aspects of your project. The university library is often the best place to start looking for help on copyright.</p> |
| 538 | +<p>If you work in a university, chances are it has a copyright office that will help you sort out the legal aspects of your project. The university library is often the best place to start looking for help on copyright related queries.</p> |
| 539 | +<h4 id="challenge">Challenge<a class="headerlink" href="#challenge" title="Permanent link">¶</a></h4> |
| 540 | +<ul> |
| 541 | +<li>What are the contact details for the copyright office (or similar) at your organisation ?</li> |
| 542 | +</ul> |
512 | 543 | <h3 id="be-nice-ask-and-share">Be nice: ask and share<a class="headerlink" href="#be-nice-ask-and-share" title="Permanent link">¶</a></h3>
|
513 | 544 | <p>Depending on the scope of your project, it might be worthwhile to consider asking the owners or curators of the data you are planning to scrape if they have it already available in a structured format that could suit your project. If your aim is do use their data for research, or to use it in a way that could potentially interest them, not only it could save you the trouble of writing a web scraper, but it could also help clarify straight away what you can and cannot do with the data.</p>
|
514 | 545 | <p>On the other hand, when you are publishing your own data, as part of a research project, documentation or a public website, you might want to think about whether someone might be interested in getting your data for their own project. If you can, try to provide others with a way to download your raw data in a structured format, and thus save them the trouble to try and scrape your own pages!</p>
|
|
0 commit comments