What Is CSU/DSU (Channel Service Unit/Data Service Unit)?

What is Wayback Machine? | Definition from TechTarget

Technology News


What is Wayback Machine?

The Internet Archive’s Wayback Machine is a digital archive of information on the internet. The Internet Archive, a nonprofit organization based in San Francisco, made it public in 2001.

Users can access archived versions of webpages with Wayback Machine. Wayback Machine holds more than 832 billion archived webpages, dating back to 1996. In addition to webpages, the Internet Archive stores books, movies, television, music and other content. The Internet Archive takes up more than 40 petabytes of data storage, and Wayback Machine takes up a significant portion of that.

Why is Wayback Machine important?

The Internet Archive was one of the first organizations to archive the internet. Wayback Machine, therefore, serves as a unique record of the internet’s early days before most recorded it.

The internet is continually growing and changing, and webpages can be deleted or edited at any time without leaving behind any artifact. Wayback Machine preserves the history of the internet even after those pages have been edited or deleted.

How does Wayback Machine work?

Wayback Machine automatically crawls and captures snapshots of webpages at various points in time. These snapshots are then stored, attached to timestamps and made accessible to users.

Wayback Machine uses several different crawlers — some from third-party sources and some from the Internet Archive. Users can also submit a page for manual archival.

Websites are typically constructed using a combination of files, such as image files, Hypertext Markup Language (HTML), JavaScript and cascading style sheets. Each file has its own URL, which Wayback Machine captures to display the full page as it looks to the user. For example, images on a webpage have their own separate URLs from the main page. The file URLs may be captured at different times from the URL to the page itself. For example, an image might be crawled and recorded days after the main HTML of a page is crawled.

To search from the Wayback Machine homepage, users enter a site’s URL into the search bar and a date range for the content they want to access.

The Wayback Machine search results page shows a graph of the number of times a webpage was crawled since 1996 and a calendar that lists crawls per day. Users can scroll over each crawl to see the date, time and reason for each.

Wayback Machine has several different features to display webpage data, including the following:

  • Collections page. This lets users see why a page was crawled.
  • Changes page. This shows how much a page has changed over time.
  • Compare feature. This lets users compare two different captures from two different times side by side.
  • Summary feature. This shows information about the entire domain.
  • Sitemap feature. This shows information about the linking structure of the site over time.

Users can click on a particular capture and view the provenance of a page. Users can also save pages to a personal web archive in their account.

In addition to searching by URL, users can search by keyword. Keyword search on Wayback Machine is different than keyword search on Google or similar search engines. The Wayback Machine’s keyword search looks for entire domains about a specific keyword, not individual pages.

The Save Page Now feature saves the one URL entered in the search bar. There are also Wayback Machine Chrome extensions, web browser add-ons, a WordPress plugin and an iOS app.

How is Wayback Machine used?

Here are some basic ways to use Wayback Machine:

  • View and compare changes between two iterations of a webpage.
  • See why or when a page was crawled.
  • See who is crawling what webpages.
  • View old versions of webpages.
  • View webpages that no longer exist.
  • Troubleshoot problems with a webpage.
  • Save pages manually to Wayback Machine.
  • Link to old webpages.
  • Conduct large-scale crawls.

These basic functions have many applied uses, including search engine optimization (SEO), web development, journalism, open source intelligence (OSINT) gathering and legal research. For example, SEO-motivated users can find old versions of websites that were never redirected to live versions and fix broken links. They can also revisit old versions of pages that performed better to see if there are any elements worth re-including in new content.

Users can also look at Wayback Machine to see how frequently their competitors update content. Legal researchers could use the tool to gather evidence for a legal case. Web developers could use it to troubleshoot or debug websites by accessing past versions of a website to see when a particular bug was introduced over time. Journalists could use the service to access historical documents or perform fact checks. Cybersecurity researchers could look for OSINT hidden in older iterations of a webpage or deleted information. And archivists at Wikipedia can use Wayback Machine to help alleviate link rot.

The Wayback Machine application programming interface (API) lets users automate data retrieval functions at scale. APIs can read and write metadata to and from items in the Internet Archive. They can also write and read media or other files to and from items. Wayback Machine has several APIs, including the following:

  • Wayback Availability JSON. This tests if a URL is archived and accessible in Wayback Machine.
  • Memento. This provides additional interfaces for querying snapshots in Wayback Machine.
  • Wayback CDX Server. This enables complex filtering, querying and analysis of Wayback Machine capture data.

The Internet Archive’s subscription service — Archive-It — lets organizations archive websites and create custom collections of content.

History of Wayback Machine

The Internet Archive was founded in 1996 to archive the internet in its nascent stages and pursue the goal of providing universal access to all knowledge. The Internet Archive is a nonprofit and was founded by Brewster Kahle and Bruce Gilliat. Wayback Machine began indexing webpages in 1996 and was formally released to the public in 2001, by which time it contained over 10 billion archived pages. Kahle founded the for-profit web crawling company Alexa Internet, which today remains one of the Internet Archive’s most prominent web crawlers.

The Internet Archive now hosts several other projects, including the National Aeronautics and Space Administration images archive and the book information site Open Library. The Internet Archive also collaborates with many institutions to maintain these libraries, including the Library of Congress and Smithsonian Institution.

The name Wayback Machine is a reference to the animated cartoon The Adventures of Rocky and Bullwinkle and Friends. In it, the characters used the WABAC — pronounced wayback — machine to travel through time and participate in various historical events.

Limitations of Wayback Machine

Not all webpages are archived in Wayback Machine. Some websites block Wayback Machine’s crawlers. Others might not be archived for various reasons, such as specific site owners requesting anonymity or pages that require a password to access. Sometimes, a site’s robot.txt file keeps the site from being crawled. Robots.txt files direct web crawlers and indicate which websites they can and can’t visit. Pages without inbound links from other websites are more difficult to archive, too. In some cases, JavaScript can be hard to archive as well. HTML is the easiest type of content for Wayback Machine to archive.

Additionally, the frequency of snapshots can vary, so not every change to a website is captured. It can sometimes take months for a webpage to appear in Wayback Machine after being collected.

In general, Wayback Machine doesn’t collect or archive personal emails or chats from private sources. It also doesn’t collect dynamic information well. For example, a user could not access a Google search engine from 2010 and use it to search for other websites.