Skip to content

Add Polars documentation#2685

Merged
simon04 merged 2 commits into
freeCodeCamp:mainfrom
mcagriardic:add-polars-documentation
May 26, 2026
Merged

Add Polars documentation#2685
simon04 merged 2 commits into
freeCodeCamp:mainfrom
mcagriardic:add-polars-documentation

Conversation

@mcagriardic
Copy link
Copy Markdown
Contributor

@mcagriardic mcagriardic commented May 24, 2026

Add a scraper for the Polars Python API reference
(https://docs.pola.rs/api/python/stable/reference/), pinned to the stable 1.41.0 release.

The site uses the pydata-sphinx-theme, so the scraper reuses the shared sphinx/clean_html filter alongside Polars-specific filters:

  • clean_html removes the theme chrome (sidebars, in-page TOC, prev/next navigation, footer) and tags code blocks for Python highlighting.
  • entries names each page from its heading and groups entries into types (DataFrame, LazyFrame, Series, Expressions, Functions, Data Types, Input/output, etc.). Top-level members are stored flat under api/ and are classified by their member name.

Latest-version detection uses the most recent GitHub release and strips the py- prefix, since the repo also tags Rust (rs-) releases.

If you’re adding a new scraper, please ensure that you have:

  • Tested the scraper on a local copy of DevDocs
  • Ensured that the docs are styled similarly to other docs on DevDocs
  • Added these files to the public/icons/docs/polars/ directory:
    • 16.png: a 16×16 pixel icon for the doc
    • 16@2x.png: a 32×32 pixel icon for the doc
    • SOURCE: A text file containing the URL to the page the image can be found on or the URL of the original image itself

image:
image

Add a scraper for the Polars Python API reference
(https://docs.pola.rs/api/python/stable/reference/), pinned to the
stable 1.41.0 release.

The site uses the pydata-sphinx-theme, so the scraper reuses the shared
sphinx/clean_html filter alongside Polars-specific filters:

- clean_html removes the theme chrome (sidebars, in-page TOC, prev/next
  navigation, footer) and tags code blocks for Python highlighting.
- entries names each page from its heading and groups entries into types
  (DataFrame, LazyFrame, Series, Expressions, Functions, Data Types,
  Input/output, etc.). Top-level members are stored flat under api/ and
  are classified by their member name.

Latest-version detection uses the most recent GitHub release and strips
the py- prefix, since the repo also tags Rust (rs-) releases.
@mcagriardic mcagriardic requested a review from a team as a code owner May 24, 2026 17:06
@mcagriardic
Copy link
Copy Markdown
Contributor Author

@simon04, would appreciate reviews. Many thanks!

@simon04
Copy link
Copy Markdown
Contributor

simon04 commented May 26, 2026

Thanks, the PR looks good, except for one issue: the method signature is printed poorly.

Example http://localhost:9292/polars/series/api/polars.series.sample

image

I've spent an hour fighting against the nested dl/dd/dt. Nokogiri seems to change the DOM tree. Giving up for tonight.

Related error of the W3 validator: https://validator.w3.org/nu/?doc=https%3A%2F%2Fdocs.pola.rs%2Fapi%2Fpython%2Fstable%2Freference%2Fseries%2Fapi%2Fpolars.Series.to_numpy.html

Here's a not very success full attempt.

diff --git a/lib/docs/filters/polars/clean_html.rb b/lib/docs/filters/polars/clean_html.rb
index e270c8d9..ed03e690 100644
--- a/lib/docs/filters/polars/clean_html.rb
+++ b/lib/docs/filters/polars/clean_html.rb
@@ -20,6 +20,12 @@ module Docs
         css('img').remove if root_page?
 
         # Make sure every code block is tagged so Prism highlights it as Python.
+        css('.sig').each do |node|
+          node.after node.css('.reference.external:contains("[source]")').remove
+          node.name = 'pre'
+          node.content = node.content.strip
+          node['data-language'] = 'python'
+        end
         css('.highlight pre').each do |node|
           node.content = node.content
           node['data-language'] = 'python'
diff --git a/lib/docs/scrapers/polars.rb b/lib/docs/scrapers/polars.rb
index fa332be7..c84f9e15 100644
--- a/lib/docs/scrapers/polars.rb
+++ b/lib/docs/scrapers/polars.rb
@@ -30,5 +30,17 @@ module Docs
     def get_latest_version(opts)
       get_latest_github_release('pola-rs', 'polars', opts).sub(/\Apy-/, '')
     end
+
+    private
+
+    def parse(response) # Hook here because Nokogori dislikes <dl> nested inside <dt>
+      response.body.gsub! %r{<dd}, '<div'
+      response.body.gsub! %r{<dl}, '<div'
+      response.body.gsub! %r{<dt}, '<div'
+      response.body.gsub! %r{</dd}, '</div'
+      response.body.gsub! %r{</dl}, '</div'
+      response.body.gsub! %r{</dt}, '</div'
+      super
+    end
   end
 end

@mcagriardic
Copy link
Copy Markdown
Contributor Author

mcagriardic commented May 26, 2026

Great catch and apologies for not noticing it :/ and thanks for the detailed look, its much appreciated! Got it sorted; the nested <dl> was indeed the root cause. Turned out we can sidestep the Nokogiri DOM rewriting entirely by using Nokogiri's HTML5 parser instead of the default libxml2 one; it's spec-compliant and handles Sphinx's nested <dl><dt> structure without mangling it.

Hooked it into parse on the scraper, walked each .sig on the clean tree, stripped [source] + headerlink, formatted the signature one param per line, and emitted a <pre data-language="python"> so Prism highlights it. No regex tag-swapping needed.

Result matches the upstream Polars docs rendering.

image

Copy link
Copy Markdown
Contributor

@simon04 simon04 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Amazing, thank you!

@simon04 simon04 merged commit 7b0c412 into freeCodeCamp:main May 26, 2026
2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants