How to Extract Text and Headings from Any Webpage

You are copying and pasting from 47 browser tabs. There is a better way.

Content audits, competitive research, link prospecting. They all start the same way. You open a webpage, scan the headings, copy some text into a spreadsheet, open the next tab, and repeat. By tab fifteen your eyes glaze over. By tab forty-seven you have missed things. And your spreadsheet is a mess of inconsistent formatting.

The analysis part is not the problem. Getting the raw text into a spreadsheet is the problem.

SheetMagic gives you formulas that extract headings, paragraphs, and full page content from any URL. The data goes directly into Google Sheets. You write one formula, point it at a URL, and the content lands in your cell. Then you drag the formula down a column of URLs. You can process hundreds of pages in minutes instead of hours.

This guide covers every content extraction formula. It also includes three real workflows you can build today, plus techniques for cleaning and organizing extracted text at scale.

If you are new to SheetMagic, start with the getting started guide to install the add-on and run your first formula. For a broader overview of all scraping tools, see the complete web scraping guide.

The content extraction formulas

SheetMagic includes purpose-built formulas for every level of content extraction. From a single H1 tag to the full text of an entire page. Each formula targets a specific piece of content so you pull exactly what you need.

=GETH1() -- The main heading

Formula

=GETH1(A2)

Returns the H1 heading from the page at the URL in A2. Most pages have exactly one H1 -- the main title.

The H1 is the single most important heading on any page. It tells you (and search engines) what the page is about. Pulling H1s across your own site shows you duplicate topics, missing pages, and keyword overlap.

=GETH2() -- Section headings

Formula

=GETH2(A2)

Returns all H2 headings from the page. H2s form the content outline -- the major sections and subtopics covered.

H2 headings are the backbone of long-form content. Pulling them gives you an instant outline of any page. You don't even have to read a single paragraph.

=GETHEADINGS() -- Full heading hierarchy

Formula

=GETHEADINGS(A2)

Returns every heading on the page (H1 through H6) in document order. Shows the complete content structure at a glance.

When you need the full picture, =GETHEADINGS() returns every heading from H1 through H6. This is the formula to use for content audits where heading hierarchy and nesting matter.

=GETP() -- Paragraph text

Formula

=GETP(A2)

Returns all paragraph text from the page. Strips navigation, footers, and sidebars -- just the body content.

=GETP() extracts the paragraph-level text from a page. This is useful when you need the actual written content for analysis, word counting, or feeding into an AI formula.

=VISIT() -- Full page content

Formula

=VISIT(A2)

Returns up to 50,000 characters of text content from the URL. Supports JavaScript-rendered pages.

=VISIT() is the most powerful extraction formula. It pulls the full rendered text of a page, including headings, paragraphs, lists, and other text elements. Use this when you need everything. Or when you plan to process the content with =AITEXT() to pull out specific pieces.

=GETSELECTOR() -- Precision extraction

Formula

=GETSELECTOR(A2, "article.post-content")

Extracts only the content matching the CSS selector. Targets a specific section of the page with surgical precision.

When the other formulas return too much or too little, =GETSELECTOR() lets you target exactly the HTML element you need. You do need to know the CSS selector (right-click, then Inspect in your browser). But the precision is hard to beat.

Not sure which formula to use? Start with =GETH1() and =GETH2() for quick structural analysis. Use =GETHEADINGS() for full audits. Use =VISIT() when you need the complete text for AI processing. Use =GETSELECTOR() when you know exactly which HTML element contains your target data.

Workflow 1: Content audit, extract every heading from your site

This is the highest-value use case for content extraction. You paste your sitemap URLs into a spreadsheet and extract the heading structure of every page on your site. In fifteen minutes you have a complete content inventory that would take a full day to build by hand.

List your page URLs

Paste your sitemap URLs into column A. If you do not have a clean list, use =SERP("site:yourdomain.com") to pull indexed URLs from Google, or export your sitemap XML.

For this example, assume column A has 200 URLs from your blog.

Extract H1 headings

Formula

=GETH1(A2)

Pulls the main heading from each page. Put this in B2 and drag down to B201.

Scan column B for duplicates and missing values. Duplicate H1s point to keyword overlap. Missing H1s point to broken pages or poor template markup.

Extract H2 outlines

Formula

=GETH2(A2)

Pulls all H2 section headings. Put this in C2 and drag down. Each cell contains the full outline of that page.

Now you can see what every page covers without opening a single browser tab. Compare outlines across pages to find content overlap.

Pull meta titles and descriptions

Formula

=GETMETATITLE(A2)

Extracts the title tag. Compare against H1 to check for consistency.

Formula

=GETMETADESCRIPTION(A2)

Extracts the meta description. Check for missing or duplicate descriptions across pages.

Put these in columns D and E. You now have H1, H2s, title tag, and meta description for every page on your site. Those are the four pillars of an on-page SEO audit.

For a deeper dive into building SEO audit workflows, see the meta titles and descriptions guide when it publishes.

Flag issues with AI

Formula

=AITEXT("Compare this page title and H1 heading. Are they aligned on the same topic? Is the title under 60 characters? Is the H1 clear and specific? Return: OK if everything looks good, or a one-sentence issue description if not.", D2&" ||| "&B2)

Automated quality check across every page. Filter column F for anything that is not OK to build your fix list.

This turns raw extraction into something you can act on. The AI checks every page against SEO best practices and flags the ones that need your attention.

You can try all of this on the free tier. No credit card needed. You get 10 integration credits (one-time trial), enough to audit a handful of pages and see how the workflow feels. Paid plans start at 1,000 credits/month. Each URL extraction uses 1 credit.

Workflow 2: Competitive content research

You want to write a blog post that outranks the current top 10 results. Before writing a single word, you need to understand what the competition covers, what they miss, and how they structure their content.

Pull the top-ranking URLs

Formula

=SERP("extract text from webpage Google Sheets")

Returns up to 20 search results with titles and URLs. The top 10 are your direct competitors for this keyword.

Copy the top 10 URLs into column A of a new sheet.

Extract every competitor's outline

Formula

=GETHEADINGS(A2)

Pulls the full heading hierarchy of each competing page. Put in B2 and drag down to B11.

You now have the complete content structure of every page ranking for your target keyword. That is your competitive landscape in ten cells.

Extract the full body text

Formula

=VISIT(A2)

Pulls the full page content for deeper analysis. Put in C2 and drag down.

This gives you the raw material for content gap analysis. You can see what topics competitors cover (from headings) and how deeply they cover them.

Analyze content gaps

Formula

=AITEXT("I am analyzing 5 competing articles for the keyword: extract text from webpage Google Sheets. Here are their heading outlines. Identify 3-5 subtopics that are poorly covered or completely missing across all of them. Return as a numbered list.", B2&" ||| "&B3&" ||| "&B4&" ||| "&B5&" ||| "&B6)

Feeds multiple competitor outlines into a single AI analysis. The gaps it identifies become your content advantage.

The subtopics the AI finds are your opportunity. Cover what the competition misses and you have a structural advantage before writing quality even enters the picture.

Generate your content brief

Formula

=AITEXT("Based on these content gaps and the top-ranking outlines, create a detailed content brief for a blog post targeting the keyword: extract text from webpage Google Sheets. Include: recommended H2 sections, key points to cover under each, and suggested word count per section.", F2&" ||| "&B2&" ||| "&B3)

Turns competitive intelligence into a writing blueprint. Your content brief is informed by real SERP data, not guesswork.

This entire workflow takes about ten minutes and 15-20 integration credits. The alternative is reading ten articles by hand, taking notes, and hoping you catch the patterns. The spreadsheet approach is faster, more organized, and produces a brief you can hand to any writer.

Workflow 3: Research compilation

You are writing a report and need to gather information from dozens of sources. Academic papers, industry reports, news articles. They all sit on different websites. And they all need manual copy-paste to get the relevant sections into your working document.

Collect your source URLs

Paste the URLs of your research sources into column A. These could come from a SERP search, a curated reading list, or bookmarks you have been collecting.

Extract the full text

Formula

=VISIT(A2)

Pulls the full text content of each source. One credit per URL.

Extract specific information with AI

Formula

=AITEXT("From this article, extract: 1) The main thesis or finding, 2) Any statistics or data points cited, 3) The methodology if mentioned. Format as three sections separated by pipes.", B2)

Turns a full page of text into three structured data points. Customize the extraction prompt for your research needs.

Use =SPLIT(C2, "|") in adjacent columns to break the extraction into separate cells. Now each source has its key findings in structured columns you can sort, filter, and analyze.

Synthesize across sources

Formula

=AITEXT("Synthesize the key findings from these 10 sources into a 3-paragraph summary. Identify points of agreement, disagreement, and gaps in the research.", C2:C11)

Cross-source synthesis that would take hours to do manually. The AI identifies patterns across your entire source list.

Cleaning and formatting extracted text

Raw extracted text is not always clean. You might see extra whitespace, HTML artifacts, or inconsistent formatting. Here are patterns for handling the common issues.

Strip extra whitespace

Formula

=TRIM(CLEAN(B2))

CLEAN removes non-printable characters. TRIM removes leading, trailing, and repeated spaces. Use on any extracted text.

This is a native Google Sheets formula. No credits needed. Run it on any column of extracted content as a standard cleanup step.

Normalize heading formats

Formula

=AITEXT("Clean up this list of headings. Remove any numbering prefixes, HTML artifacts, or extra whitespace. Return each heading on its own line, properly capitalized.", B2)

Normalizes messy heading extractions into clean, consistent text.

Extract word count from page content

Formula

=LEN(TRIM(B2))-LEN(SUBSTITUTE(TRIM(B2)," ",""))+1

Counts words in extracted text using native Sheets functions. No credits used.

This is great for content audits where you want to compare article lengths across your site or against competitors.

Extract specific sections with AI

When =VISIT() returns the full page but you only need one section:

Formula

=AITEXT("From this page content, extract only the section about pricing. If there is no pricing section, return NO PRICING FOUND.", B2)

AI-powered section extraction from full page text. More flexible than CSS selectors when you do not know the HTML structure.

Batch processing: extracting from 100+ URLs

When you are working with large URL lists, a few practices keep things running smoothly.

Process in batches of 50-100 rows. Do not write your formula in row 2 and drag it straight down to row 500. Start with rows 2-51. Check that the results look right. Then continue to the next batch. This catches problems early, like a broken URL pattern, a site that blocks scraping, or a formula that returns unexpected data.

Use PAGEDATA for bulk metadata. When you need title, description, H1, and status for many URLs, =PAGEDATA() is more efficient than running four separate formulas per URL:

Formula

=PAGEDATA(A2)

Returns title, description, H1, and HTTP status in one call. One credit instead of four.

Take advantage of caching. SheetMagic caches scraping results for 30 minutes. If you need to adjust your AI processing instructions, rerun the =AITEXT() formulas without re-scraping. The cached page content is reused automatically. That saves you credits.

Handle errors gracefully. Some URLs will fail. Pages behind login walls, sites that block automated access, or URLs that have moved. Use =IFERROR() to keep your sheet clean:

Formula

=IFERROR(GETH1(A2), "FAILED")

Returns FAILED instead of an error if the extraction fails. Filter for FAILED to identify problem URLs.

For more on handling large-scale formula runs, see the batch processing guide.

Extraction results depend on the target website. Sites behind login walls, pages that require JavaScript interaction (clicks, scrolls), and pages with aggressive bot protection may return incomplete data. =VISIT() handles standard JavaScript rendering, but some single-page applications may need the Apify integrations available through the AI Chat sidebar.

Putting it all together

The content extraction formulas are building blocks. On their own, each one pulls a specific piece of data from a URL. Combined with AI formulas and native Sheets functions, they become the foundation for workflows that replace hours of manual research with a few minutes of formula work.

For content audits: Use =GETH1() + =GETH2() + =GETMETATITLE() + =GETMETADESCRIPTION() across your sitemap URLs. Add =AITEXT() for automated quality checks. You get a complete on-page audit in one sheet.

For competitive research: Use =SERP() to find competitors, =GETHEADINGS() to extract their outlines, and =AITEXT() to spot content gaps. Your content brief is built on data, not guesswork.

For research compilation: Use =VISIT() to gather source text, =AITEXT() to extract and structure findings, then synthesize across all sources in a single prompt.

Start with one formula and one URL. See the result. Then drag down.

Formula

=GETHEADINGS("https://example.com")

Try this now. Replace the URL with any page you are curious about. The heading structure appears in your cell.

You can try all of this on the free tier. No credit card needed. You get 10 integration credits (one-time trial), enough to extract content from 10 pages and see how these formulas fit your workflow.

Install SheetMagic free from the Google Workspace Marketplace and extract your first heading in the next sixty seconds.

More web scraping guides: Complete scraping guide | Batch processing | Getting started | AI formulas