Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Proxy and user agent #14

Open
rcepka opened this issue Feb 7, 2024 · 7 comments · May be fixed by #21
Open

Proxy and user agent #14

rcepka opened this issue Feb 7, 2024 · 7 comments · May be fixed by #21

Comments

@rcepka
Copy link

rcepka commented Feb 7, 2024

Hello,
how can I implement proxy and user agent please?

@ashbythorpe
Copy link
Owner

It depends on the browser and session you are using.

User Agent

If you are using chromote (the default), you can use Network.setUserAgentOverride:

session <- selenider_session()

session$driver$Network$setUserAgentOverride(
  userAgent = "My user agent"
)

If you are using Selenium and Chrome:

session <- selenider_session(
  "selenium",
  browser = "chrome",
  options = selenium_options(
    client_options = selenium_client_options(
      capabilities = list(
        "goog:chromeOptions" = list(
          args = list("user-agent=My user agent")
        )
      )
    )
  )
)

Selenium and Firefox:

session <- selenider_session(
  "selenium",
  browser = "firefox",
  options = selenium_options(
    client_options = selenium_client_options(
      capabilities = list(
        "moz:firefoxOptions" = list(
          prefs = list(
            "general.useragent.override" = "My user agent"
          )
        )
      )
    )
  )
)

You can get the current user agent with:

execute_js_expr("return navigator.userAgent")

Proxy Server

Note that I haven't tested these, and used these two issues as reference:
https://stackoverflow.com/questions/48498349/set-proxy-server-with-selenium-and-chrome
https://stackoverflow.com/questions/70479865/how-to-use-selenium-with-firefox-proxy-in-selenium-4-x

In chromote, you have to pass arguments to the Chrome process using chromote::set_chrome_args()

chromote::set_chrome_args(c(
  chromote::get_chrome_args(),
  "--proxy-server=HOST:PORT"
))

session <- selenider_session()

With Selenium and Chrome:

session <- selenider_session(
  "selenium",
  browser = "chrome",
  options = selenium_options(
    client_options = selenium_client_options(
      capabilities = list(
        "goog:chromeOptions" = list(
          args = list("--proxy-server=HOST:PORT")
        )
      )
    )
  )
)

Selenium and Firefox:

session <- selenider_session(
  "selenium",
  browser = "firefox",
  options = selenium_options(
    client_options = selenium_client_options(
      capabilities = list(
        "moz:firefoxOptions" = list(
          prefs = list(
            "network.proxy.type" = 1,
            "network.proxy.socks" = "HOST",
            "network.proxy.socks_port" = PORT,
            "network.proxy.socks_remote_dns" = FALSE
          )
        )
      )
    )
  )
)

I might support this feature directly in selenider in the future.

@rcepka
Copy link
Author

rcepka commented Feb 8, 2024

@ashbythorpe Thank you so much for your exhaustive answer; very valuable information.
I see that the implementation might be not a rocket science, but also definitely not trivial for less advanced users, like myself. So a huge "subscribe" from me for supporting this directly in Selenider. 👍

For static websites I am using the approach below; it is simple, easy and straightforward. It would be great to have such thing also in selenider for dynamic websites.

response <- httr::GET(
          link,
          user_agent("user agent string"),
          use_proxy(
            url = IP,
            port = port,
            username = username,
            password = password]
          )
        )

page <- xml2::read_html(response)

I was trying to reproduce the code above for Selenider and Chromote using hints from you.
I think I was able to implement the user agent, even though I was not able to check it, because I was not able to track down the user agent information in response object nor in session. It simply loaded-in the website page...

session <- selenider::selenider_session("chromote", timeout = 10)
      
session$driver$Network$setUserAgentOverride(
        userAgent = "user agent string"
)


response <- selenider::open_url(link)

page <- read_html(response)

I was less lucky with the proxy implemention. Besides IP address and port, I am forced to provide also credentials, username and password. I did not find any documentation how to do this, nor in links you provided me in your post above, nor here (reference from Chromote project page): https://peter.sh/experiments/chromium-command-line-switches/
Can you please advice me how to implement proxy credentials?

@rcepka
Copy link
Author

rcepka commented Feb 9, 2024

Hello @ashbythorpe ,

I did some more testing of changing the user agent however with unsatisfactory results; please see below

session <- selenider::selenider_session("chromote", timeout = 10)
      
session <- session$driver$Network$setUserAgentOverride(
        userAgent = "Mozilla/5.0 (compatible; MSIE 10.0; Windows NT 6.2; Trident/6.0; MDDCJS)"
      )
      
response <- selenider::open_url("https://www.r-project.org/")
      
browser_user_agent <- response$driver$Browser$getVersion()
browser_user_agent$userAgent
#> [1] "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) HeadlessChrome/121.0.6167.86 Safari/537.36"

Created on 2024-02-09 with reprex v2.0.2

@ashbythorpe
Copy link
Owner

I managed to get proxy authentication to work with chromote:

library(selenider)

chromote::set_chrome_args(c(
  chromote::default_chrome_args(),
  "--proxy-server=HOST:PORT"
))

session <- selenider_session()

authenticate <- function(x) {
  id <- x$requestId
  
  response <- list(
    response = "ProvideCredentials",
    username = "USERNAME",
    password = "PASSWORD"
  )
  
  session$driver$Fetch$continueWithAuth(
    requestId = id,
    authChallengeResponse = response
  )
}

continue_request <- function(x) {
  id <- x$requestId
  
  session$driver$Fetch$continueRequest(requestId = id)
}

session$driver$Fetch$enable(
  patterns = list(
    list(urlPattern = "*")
  ),
  handleAuthRequests = TRUE
)

session$driver$Fetch$requestPaused(
  callback_ = continue_request
)

session$driver$Fetch$authRequired(
  callback_ = authenticate
)

You essentially need to intercept every request that needs authentication, hence why the code is quite complicated. This will also cause a warning every time you navigate to a new webpage, since right now you can't use .enable methods manually (see rstudio/chromote#144).

I'll probably add this as an explicit option to selenider_session() in the next release.

@ashbythorpe
Copy link
Owner

ashbythorpe commented Feb 9, 2024

Weird, it seems Browser.getVersion gives a different value to JavaScript's navigator.userAgent():

library(selenider)

session <- selenider_session()

session$driver$Network$setUserAgentOverride(
  userAgent = "My user agent"
)
#> named list()

execute_js_expr("return navigator.userAgent")
#> [1] "My user agent"

session$driver$Browser$getVersion()$userAgent
#> [1] "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) HeadlessChrome/121.0.6167.161 Safari/537.36"

Hopefully this means that Page.getVersion returns an outdated user agent, rather than Network.setUserAgentOverride not working.

EDIT:
Yes, Page.getVersion returns the default user agent, not the one you have set for the page:
puppeteer/puppeteer#2261 (comment)

puppeteer uses Network.setUserAgentOverride to set the user agent and Page.getVersion to get the user agent, hence why this issue is relevant.

@rcepka
Copy link
Author

rcepka commented Feb 11, 2024

@ashbythorpe , thank you for your code, it´s quite complex :). Unfortunately I wasn´t able to get it work.

When I ran the code with reprex(), I got the web page load timed out, no clue why. Please see below.

If I ran this code manually then selenider::open_url("https://www.myip.com/") works, but myip.com returns the IP address of my computer instead of IP address of proxy supplied in the code.
I am sure that proxy is alright, it works fine with GET() function of httr package.
If this same code works for you, the only issue I can think of is whether I provide the proxy IP address and port in correct format:
"--proxy-server=69.58.9.215:7285"
Is this alright please?

# Code of @ashbythorpe START

  library(selenider)
#> Warning: package 'selenider' was built under R version 4.3.2

  chromote::set_chrome_args(c(
    chromote::default_chrome_args(),
    "--proxy-server=69.58.9.215:7285"
  ))

  session <- selenider::selenider_session()
  
  authenticate <- function(x) {
    id <- x$requestId

    response <- list(
      response = "ProvideCredentials",
      username = "myusername",
      password = "mypassword"
    )

    session$driver$Fetch$continueWithAuth(
      requestId = id,
      authChallengeResponse = response
    )
  }

  continue_request <- function(x) {
    id <- x$requestId

    session$driver$Fetch$continueRequest(requestId = id)
  }

  session$driver$Fetch$enable(
    patterns = list(
      list(urlPattern = "*")
    ),
    handleAuthRequests = TRUE
  )
#> named list()

  session$driver$Fetch$requestPaused(
    callback_ = continue_request
  )

  session$driver$Fetch$authRequired(
    callback_ = authenticate
  )
  
  # Code of @ashbythorpe END

  
  
  selenider::open_url("https://www.myip.com/")
  #> Error: Chromote: timed out waiting for event Page.loadEventFired
  
  selenider::s("#ip") |> selenider::elem_text()
  #> Error in `selenider::elem_text()`:
  #> ! To get the text inside `x`, it must exist.
  #> ℹ After 4 seconds, `x` was not present.

Created on 2024-02-11 with reprex v2.1.0

@ashbythorpe
Copy link
Owner

Hi @rcepka,
Your IP and port are in the correct format. I have done some testing and I think this problem happens when your proxy server does not support HTTPS requests.

Example 1: using proxy.py

proxy.py only supports HTTP requests, so the proxy server connection is successful but the IP accessed by HTTPS websites is wrong.

In the command line:
proxy --basic-auth=username:password

In R:

library(selenider)

chromote::set_chrome_args(c(
  chromote::default_chrome_args(),
  "--proxy-server=127.0.0.1:8899"
))

session <- selenider_session()

x <- session$driver$Fetch$requestPaused(
  callback_ = function(x) NULL
)

session$driver$Fetch$disable()
#> named list()

authenticate <- function(x) {
  id <- x$requestId
  
  response <- list(
    response = "ProvideCredentials",
    username = "username",
    password = "password"
  )
  
  session$driver$Fetch$continueWithAuth(
    requestId = id,
    authChallengeResponse = response
  )
}

continue_request <- function(x) {
  id <- x$requestId
  
  session$driver$Fetch$continueRequest(requestId = id)
}

session$driver$Fetch$enable(
  patterns = list(
    list(urlPattern = "*")
  ),
  handleAuthRequests = TRUE
)
#> named list()

session$driver$Fetch$requestPaused(
  callback_ = continue_request
)

session$driver$Fetch$authRequired(
  callback_ = authenticate
)

open_url("http://api.ipify.org/")

elem_text(s("*"))
# my local IP

open_url("https://api.ipify.org/")

elem_text(s("*"))
# my local IP

However, while the IP is wrong, the logs on the command line (with excess information removed) show us that we are connecting to the proxy server:

GET None:None/ - None None - 0 bytes - 0.35ms
CONNECT api.ipify.org:443 - 5914 bytes - 386.98ms
GET api.ipify.org:80/ - 200 OK - 230 bytes - 702.69ms

Notably, we get the exact same result if we do this without authentication.

Now, we use a random proxy server from https://free-proxy-list.net/. We choose one that supports HTTPS.

library(selenider)

chromote::set_chrome_args(c(
  chromote::default_chrome_args(),
  "--proxy-server=167.86.115.218:8888"
))

session <- selenider_session()

open_url("http://api.ipify.org/")

elem_text(s("*"))
#> [1] "206.217.216.17"

open_url("https://api.ipify.org/")

elem_text(s("*"))
#> [1] "206.217.216.17"

This IP is different from localhost, demonstrating that this proxy works.

So yeah, I think the problem is most likely that your proxy server does not support HTTPS requests.

@ashbythorpe ashbythorpe linked a pull request May 19, 2024 that will close this issue
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants