[mod] Google: reversed engineered & upgrade to data_type: traits_v1

Partial reverse engineering of the Google engines including a improved language
and region handling based on the engine.traits_v1 data.

When ever possible the implementations of the Google engines try to make use of
the async REST APIs.  The get_lang_info() has been generalized to a
get_google_info() function / especially the region handling has been improved by
adding the cr parameter.

searx/data/engine_traits.json
  Add data type "traits_v1" generated by the fetch_traits() functions from:

  - Google (WEB),
  - Google images,
  - Google news,
  - Google scholar and
  - Google videos

  and remove data from obsolete data type "supported_languages".

  A traits.custom type that maps region codes to *supported_domains* is fetched
  from https://www.google.com/supported_domains

searx/autocomplete.py:
  Reversed engineered autocomplete from Google WEB.  Supports Google's languages and
  subdomains.  The old API suggestqueries.google.com/complete has been replaced
  by the async REST API: https://{subdomain}/complete/search?{args}

searx/engines/google.py
  Reverse engineering and extensive testing ..
  - fetch_traits():  Fetch languages & regions from Google properties.
  - always use the async REST API (formally known as 'use_mobile_ui')
  - use *supported_domains* from traits
  - improved the result list by fetching './/div[@data-content-feature]'
    and parsing the type of the various *content features* --> thumbnails are
    added

searx/engines/google_images.py
  Reverse engineering and extensive testing ..
  - fetch_traits():  Fetch languages & regions from Google properties.
  - use *supported_domains* from traits
  - if exists, freshness_date is added to the result
  - issue 1864: result list has been improved a lot (due to the new cr parameter)

searx/engines/google_news.py
  Reverse engineering and extensive testing ..
  - fetch_traits():  Fetch languages & regions from Google properties.
    *supported_domains* is not needed but a ceid list has been added.
  - different region handling compared to Google WEB
  - fixed for various languages & regions (due to the new ceid parameter) /
    avoid CONSENT page
  - Google News do no longer support time range
  - result list has been fixed: XPath of pub_date and pub_origin

searx/engines/google_videos.py
  - fetch_traits():  Fetch languages & regions from Google properties.
  - use *supported_domains* from traits
  - add paging support
  - implement a async request ('asearch': 'arc' & 'async':
    'use_ac:true,_fmt:html')
  - simplified code (thanks to '_fmt:html' request)
  - issue 1359: fixed xpath of video length data

searx/engines/google_scholar.py
  - fetch_traits():  Fetch languages & regions from Google properties.
  - use *supported_domains* from traits
  - request(): include patents & citations
  - response(): fixed CAPTCHA detection (Scholar has its own CATCHA manager)
  - hardening XPath to iterate over results
  - fixed XPath of pub_type (has been change from gs_ct1 to gs_cgt2 class)
  - issue 1769 fixed: new request implementation is no longer incompatible

Signed-off-by: Markus Heiser <markus.heiser@darmarit.de>
This commit is contained in:
Markus Heiser 2022-12-04 22:57:22 +01:00
parent c80e82a855
commit 2499899554
11 changed files with 2510 additions and 2174 deletions

View file

@ -5,14 +5,17 @@
"""
# pylint: disable=use-dict-literal
from json import loads
import json
from urllib.parse import urlencode
from lxml import etree
import lxml
from httpx import HTTPError
from searx import settings
from searx.engines import engines
from searx.engines import (
engines,
google,
)
from searx.network import get as http_get
from searx.exceptions import SearxEngineResponseException
@ -55,7 +58,7 @@ def dbpedia(query, _lang):
results = []
if response.ok:
dom = etree.fromstring(response.content)
dom = lxml.etree.fromstring(response.content)
results = dom.xpath('//Result/Label//text()')
return results
@ -81,18 +84,32 @@ def duckduckgo(query, sxng_locale):
return ret_val
def google(query, lang):
# google autocompleter
autocomplete_url = 'https://suggestqueries.google.com/complete/search?client=toolbar&'
def google_complete(query, sxng_locale):
"""Autocomplete from Google. Supports Google's languages and subdomains
(:py:obj:`searx.engines.google.get_google_info`) by using the async REST
API::
response = get(autocomplete_url + urlencode(dict(hl=lang, q=query)))
https://{subdomain}/complete/search?{args}
"""
google_info = google.get_google_info({'searxng_locale': sxng_locale}, engines['google'].traits)
url = 'https://{subdomain}/complete/search?{args}'
args = urlencode(
{
'q': query,
'client': 'gws-wiz',
'hl': google_info['params']['hl'],
}
)
results = []
if response.ok:
dom = etree.fromstring(response.text)
results = dom.xpath('//suggestion/@data')
resp = get(url.format(subdomain=google_info['subdomain'], args=args))
if resp.ok:
json_txt = resp.text[resp.text.find('[') : resp.text.find(']', -3) + 1]
data = json.loads(json_txt)
for item in data[0]:
results.append(lxml.html.fromstring(item[0]).text_content())
return results
@ -132,7 +149,7 @@ def swisscows(query, _lang):
# swisscows autocompleter
url = 'https://swisscows.ch/api/suggest?{query}&itemsCount=5'
resp = loads(get(url.format(query=urlencode({'query': query}))).text)
resp = json.loads(get(url.format(query=urlencode({'query': query}))).text)
return resp
@ -184,7 +201,7 @@ def yandex(query, _lang):
# yandex autocompleter
url = "https://suggest.yandex.com/suggest-ff.cgi?{0}"
resp = loads(get(url.format(urlencode(dict(part=query)))).text)
resp = json.loads(get(url.format(urlencode(dict(part=query)))).text)
if len(resp) > 1:
return resp[1]
return []
@ -193,7 +210,7 @@ def yandex(query, _lang):
backends = {
'dbpedia': dbpedia,
'duckduckgo': duckduckgo,
'google': google,
'google': google_complete,
'seznam': seznam,
'startpage': startpage,
'swisscows': swisscows,