DOM
Extraction BeginnerWhat is the DOM?
The Document Object Model (DOM) is a programming interface for HTML and XML documents. It represents the page so that programs can change the document structure, style, and content dynamically. When you open a webpage, the browser parses the HTML and builds a tree-like structure where every element is a “node” with relationships to other nodes.
The DOM is the browser’s internal representation of HTML. Master it, and you can extract data from any webpage. The DOM is what lets JavaScript do everything from updating a single paragraph to rebuilding the entire page layout — all without reloading.
Understanding DOM Structure
<!-- HTML Source -->
<html>
<body>
<div id="container">
<h1>Title</h1>
<p class="content">Text here</p>
</div>
</body>
</html>
<!-- DOM Tree -->
document
└── html
└── body
└── div#container
├── h1
│ └── "Title"
└── p.content
└── "Text here"
Key DOM Operations for Scraping
| Operation | JavaScript | XPath | CSS Selector |
|---|---|---|---|
| Select by ID | getElementById() |
//*[@id='x'] |
#x |
| Select by class | getElementsByClassName() |
//*[@class='x'] |
.x |
| Query single | querySelector() |
//x |
x |
| Query all | querySelectorAll() |
//x |
x |
| Navigate parent | parentNode |
.. |
N/A |
| Navigate children | children |
//x//y |
x y |
Pro tip: In browser DevTools, $0 refers to the currently selected element. Inspect an element, then run $0.parentElement to quickly access its parent in the DOM.