Proxy and user agent #14

rcepka · 2024-02-07T11:05:40Z

Hello,
how can I implement proxy and user agent please?

ashbythorpe · 2024-02-07T19:34:03Z

It depends on the browser and session you are using.

User Agent

If you are using chromote (the default), you can use Network.setUserAgentOverride:

session <- selenider_session()

session$driver$Network$setUserAgentOverride(
  userAgent = "My user agent"
)

If you are using Selenium and Chrome:

session <- selenider_session(
  "selenium",
  browser = "chrome",
  options = selenium_options(
    client_options = selenium_client_options(
      capabilities = list(
        "goog:chromeOptions" = list(
          args = list("user-agent=My user agent")
        )
      )
    )
  )
)

Selenium and Firefox:

session <- selenider_session(
  "selenium",
  browser = "firefox",
  options = selenium_options(
    client_options = selenium_client_options(
      capabilities = list(
        "moz:firefoxOptions" = list(
          prefs = list(
            "general.useragent.override" = "My user agent"
          )
        )
      )
    )
  )
)

You can get the current user agent with:

execute_js_expr("return navigator.userAgent")

Proxy Server

Note that I haven't tested these, and used these two issues as reference:
https://stackoverflow.com/questions/48498349/set-proxy-server-with-selenium-and-chrome
https://stackoverflow.com/questions/70479865/how-to-use-selenium-with-firefox-proxy-in-selenium-4-x

In chromote, you have to pass arguments to the Chrome process using chromote::set_chrome_args()

chromote::set_chrome_args(c(
  chromote::get_chrome_args(),
  "--proxy-server=HOST:PORT"
))

session <- selenider_session()

With Selenium and Chrome:

session <- selenider_session(
  "selenium",
  browser = "chrome",
  options = selenium_options(
    client_options = selenium_client_options(
      capabilities = list(
        "goog:chromeOptions" = list(
          args = list("--proxy-server=HOST:PORT")
        )
      )
    )
  )
)

Selenium and Firefox:

session <- selenider_session(
  "selenium",
  browser = "firefox",
  options = selenium_options(
    client_options = selenium_client_options(
      capabilities = list(
        "moz:firefoxOptions" = list(
          prefs = list(
            "network.proxy.type" = 1,
            "network.proxy.socks" = "HOST",
            "network.proxy.socks_port" = PORT,
            "network.proxy.socks_remote_dns" = FALSE
          )
        )
      )
    )
  )
)

I might support this feature directly in selenider in the future.

rcepka · 2024-02-08T08:59:00Z

@ashbythorpe Thank you so much for your exhaustive answer; very valuable information.
I see that the implementation might be not a rocket science, but also definitely not trivial for less advanced users, like myself. So a huge "subscribe" from me for supporting this directly in Selenider. 👍

For static websites I am using the approach below; it is simple, easy and straightforward. It would be great to have such thing also in selenider for dynamic websites.

response <- httr::GET(
          link,
          user_agent("user agent string"),
          use_proxy(
            url = IP,
            port = port,
            username = username,
            password = password]
          )
        )

page <- xml2::read_html(response)

I was trying to reproduce the code above for Selenider and Chromote using hints from you.
I think I was able to implement the user agent, even though I was not able to check it, because I was not able to track down the user agent information in response object nor in session. It simply loaded-in the website page...

session <- selenider::selenider_session("chromote", timeout = 10)
      
session$driver$Network$setUserAgentOverride(
        userAgent = "user agent string"
)


response <- selenider::open_url(link)

page <- read_html(response)

I was less lucky with the proxy implemention. Besides IP address and port, I am forced to provide also credentials, username and password. I did not find any documentation how to do this, nor in links you provided me in your post above, nor here (reference from Chromote project page): https://peter.sh/experiments/chromium-command-line-switches/
Can you please advice me how to implement proxy credentials?

rcepka · 2024-02-09T17:20:34Z

Hello @ashbythorpe ,

I did some more testing of changing the user agent however with unsatisfactory results; please see below

session <- selenider::selenider_session("chromote", timeout = 10)
      
session <- session$driver$Network$setUserAgentOverride(
        userAgent = "Mozilla/5.0 (compatible; MSIE 10.0; Windows NT 6.2; Trident/6.0; MDDCJS)"
      )
      
response <- selenider::open_url("https://www.r-project.org/")
      
browser_user_agent <- response$driver$Browser$getVersion()
browser_user_agent$userAgent
#> [1] "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) HeadlessChrome/121.0.6167.86 Safari/537.36"

^{Created on 2024-02-09 with reprex v2.0.2}

ashbythorpe · 2024-02-09T17:50:15Z

I managed to get proxy authentication to work with chromote:

library(selenider)

chromote::set_chrome_args(c(
  chromote::default_chrome_args(),
  "--proxy-server=HOST:PORT"
))

session <- selenider_session()

authenticate <- function(x) {
  id <- x$requestId
  
  response <- list(
    response = "ProvideCredentials",
    username = "USERNAME",
    password = "PASSWORD"
  )
  
  session$driver$Fetch$continueWithAuth(
    requestId = id,
    authChallengeResponse = response
  )
}

continue_request <- function(x) {
  id <- x$requestId
  
  session$driver$Fetch$continueRequest(requestId = id)
}

session$driver$Fetch$enable(
  patterns = list(
    list(urlPattern = "*")
  ),
  handleAuthRequests = TRUE
)

session$driver$Fetch$requestPaused(
  callback_ = continue_request
)

session$driver$Fetch$authRequired(
  callback_ = authenticate
)

You essentially need to intercept every request that needs authentication, hence why the code is quite complicated. This will also cause a warning every time you navigate to a new webpage, since right now you can't use .enable methods manually (see rstudio/chromote#144).

I'll probably add this as an explicit option to selenider_session() in the next release.

ashbythorpe · 2024-02-09T18:04:43Z

Weird, it seems Browser.getVersion gives a different value to JavaScript's navigator.userAgent():

library(selenider)

session <- selenider_session()

session$driver$Network$setUserAgentOverride(
  userAgent = "My user agent"
)
#> named list()

execute_js_expr("return navigator.userAgent")
#> [1] "My user agent"

session$driver$Browser$getVersion()$userAgent
#> [1] "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) HeadlessChrome/121.0.6167.161 Safari/537.36"

Hopefully this means that Page.getVersion returns an outdated user agent, rather than Network.setUserAgentOverride not working.

EDIT:
Yes, Page.getVersion returns the default user agent, not the one you have set for the page:
puppeteer/puppeteer#2261 (comment)

puppeteer uses Network.setUserAgentOverride to set the user agent and Page.getVersion to get the user agent, hence why this issue is relevant.

rcepka · 2024-02-11T17:36:39Z

@ashbythorpe , thank you for your code, it´s quite complex :). Unfortunately I wasn´t able to get it work.

When I ran the code with reprex(), I got the web page load timed out, no clue why. Please see below.

If I ran this code manually then selenider::open_url("https://www.myip.com/") works, but myip.com returns the IP address of my computer instead of IP address of proxy supplied in the code.
I am sure that proxy is alright, it works fine with GET() function of httr package.
If this same code works for you, the only issue I can think of is whether I provide the proxy IP address and port in correct format:
"--proxy-server=69.58.9.215:7285"
Is this alright please?

# Code of @ashbythorpe START

  library(selenider)
#> Warning: package 'selenider' was built under R version 4.3.2

  chromote::set_chrome_args(c(
    chromote::default_chrome_args(),
    "--proxy-server=69.58.9.215:7285"
  ))

  session <- selenider::selenider_session()
  
  authenticate <- function(x) {
    id <- x$requestId

    response <- list(
      response = "ProvideCredentials",
      username = "myusername",
      password = "mypassword"
    )

    session$driver$Fetch$continueWithAuth(
      requestId = id,
      authChallengeResponse = response
    )
  }

  continue_request <- function(x) {
    id <- x$requestId

    session$driver$Fetch$continueRequest(requestId = id)
  }

  session$driver$Fetch$enable(
    patterns = list(
      list(urlPattern = "*")
    ),
    handleAuthRequests = TRUE
  )
#> named list()

  session$driver$Fetch$requestPaused(
    callback_ = continue_request
  )

  session$driver$Fetch$authRequired(
    callback_ = authenticate
  )
  
  # Code of @ashbythorpe END

  
  
  selenider::open_url("https://www.myip.com/")
  #> Error: Chromote: timed out waiting for event Page.loadEventFired
  
  selenider::s("#ip") |> selenider::elem_text()
  #> Error in `selenider::elem_text()`:
  #> ! To get the text inside `x`, it must exist.
  #> ℹ After 4 seconds, `x` was not present.

^{Created on 2024-02-11 with reprex v2.1.0}

ashbythorpe · 2024-02-14T19:10:03Z

Hi @rcepka,
Your IP and port are in the correct format. I have done some testing and I think this problem happens when your proxy server does not support HTTPS requests.

Example 1: using proxy.py

proxy.py only supports HTTP requests, so the proxy server connection is successful but the IP accessed by HTTPS websites is wrong.

In the command line:
proxy --basic-auth=username:password

In R:

library(selenider)

chromote::set_chrome_args(c(
  chromote::default_chrome_args(),
  "--proxy-server=127.0.0.1:8899"
))

session <- selenider_session()

x <- session$driver$Fetch$requestPaused(
  callback_ = function(x) NULL
)

session$driver$Fetch$disable()
#> named list()

authenticate <- function(x) {
  id <- x$requestId
  
  response <- list(
    response = "ProvideCredentials",
    username = "username",
    password = "password"
  )
  
  session$driver$Fetch$continueWithAuth(
    requestId = id,
    authChallengeResponse = response
  )
}

continue_request <- function(x) {
  id <- x$requestId
  
  session$driver$Fetch$continueRequest(requestId = id)
}

session$driver$Fetch$enable(
  patterns = list(
    list(urlPattern = "*")
  ),
  handleAuthRequests = TRUE
)
#> named list()

session$driver$Fetch$requestPaused(
  callback_ = continue_request
)

session$driver$Fetch$authRequired(
  callback_ = authenticate
)

open_url("http://api.ipify.org/")

elem_text(s("*"))
# my local IP

open_url("https://api.ipify.org/")

elem_text(s("*"))
# my local IP

However, while the IP is wrong, the logs on the command line (with excess information removed) show us that we are connecting to the proxy server:

GET None:None/ - None None - 0 bytes - 0.35ms
CONNECT api.ipify.org:443 - 5914 bytes - 386.98ms
GET api.ipify.org:80/ - 200 OK - 230 bytes - 702.69ms

Notably, we get the exact same result if we do this without authentication.

Now, we use a random proxy server from https://free-proxy-list.net/. We choose one that supports HTTPS.

library(selenider)

chromote::set_chrome_args(c(
  chromote::default_chrome_args(),
  "--proxy-server=167.86.115.218:8888"
))

session <- selenider_session()

open_url("http://api.ipify.org/")

elem_text(s("*"))
#> [1] "206.217.216.17"

open_url("https://api.ipify.org/")

elem_text(s("*"))
#> [1] "206.217.216.17"

This IP is different from localhost, demonstrating that this proxy works.

So yeah, I think the problem is most likely that your proxy server does not support HTTPS requests.

ashbythorpe linked a pull request May 19, 2024 that will close this issue

User agent and proxy options #21

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Proxy and user agent #14

Proxy and user agent #14

rcepka commented Feb 7, 2024

ashbythorpe commented Feb 7, 2024

rcepka commented Feb 8, 2024

rcepka commented Feb 9, 2024

ashbythorpe commented Feb 9, 2024

ashbythorpe commented Feb 9, 2024 •

edited

Loading

rcepka commented Feb 11, 2024

ashbythorpe commented Feb 14, 2024

Proxy and user agent #14

Proxy and user agent #14

Comments

rcepka commented Feb 7, 2024

ashbythorpe commented Feb 7, 2024

User Agent

Proxy Server

rcepka commented Feb 8, 2024

rcepka commented Feb 9, 2024

ashbythorpe commented Feb 9, 2024

ashbythorpe commented Feb 9, 2024 • edited Loading

rcepka commented Feb 11, 2024

ashbythorpe commented Feb 14, 2024

Example 1: using proxy.py

ashbythorpe commented Feb 9, 2024 •

edited

Loading