[mod] Google: reversed engineered & upgrade to data_type: traits_v1

Partial reverse engineering of the Google engines including a improved language
and region handling based on the engine.traits_v1 data.

When ever possible the implementations of the Google engines try to make use of
the async REST APIs.  The get_lang_info() has been generalized to a
get_google_info() function / especially the region handling has been improved by
adding the cr parameter.

searx/data/engine_traits.json
  Add data type "traits_v1" generated by the fetch_traits() functions from:

  - Google (WEB),
  - Google images,
  - Google news,
  - Google scholar and
  - Google videos

  and remove data from obsolete data type "supported_languages".

  A traits.custom type that maps region codes to *supported_domains* is fetched
  from https://www.google.com/supported_domains

searx/autocomplete.py:
  Reversed engineered autocomplete from Google WEB.  Supports Google's languages and
  subdomains.  The old API suggestqueries.google.com/complete has been replaced
  by the async REST API: https://{subdomain}/complete/search?{args}

searx/engines/google.py
  Reverse engineering and extensive testing ..
  - fetch_traits():  Fetch languages & regions from Google properties.
  - always use the async REST API (formally known as 'use_mobile_ui')
  - use *supported_domains* from traits
  - improved the result list by fetching './/div[@data-content-feature]'
    and parsing the type of the various *content features* --> thumbnails are
    added

searx/engines/google_images.py
  Reverse engineering and extensive testing ..
  - fetch_traits():  Fetch languages & regions from Google properties.
  - use *supported_domains* from traits
  - if exists, freshness_date is added to the result
  - issue 1864: result list has been improved a lot (due to the new cr parameter)

searx/engines/google_news.py
  Reverse engineering and extensive testing ..
  - fetch_traits():  Fetch languages & regions from Google properties.
    *supported_domains* is not needed but a ceid list has been added.
  - different region handling compared to Google WEB
  - fixed for various languages & regions (due to the new ceid parameter) /
    avoid CONSENT page
  - Google News do no longer support time range
  - result list has been fixed: XPath of pub_date and pub_origin

searx/engines/google_videos.py
  - fetch_traits():  Fetch languages & regions from Google properties.
  - use *supported_domains* from traits
  - add paging support
  - implement a async request ('asearch': 'arc' & 'async':
    'use_ac:true,_fmt:html')
  - simplified code (thanks to '_fmt:html' request)
  - issue 1359: fixed xpath of video length data

searx/engines/google_scholar.py
  - fetch_traits():  Fetch languages & regions from Google properties.
  - use *supported_domains* from traits
  - request(): include patents & citations
  - response(): fixed CAPTCHA detection (Scholar has its own CATCHA manager)
  - hardening XPath to iterate over results
  - fixed XPath of pub_type (has been change from gs_ct1 to gs_cgt2 class)
  - issue 1769 fixed: new request implementation is no longer incompatible

Signed-off-by: Markus Heiser <markus.heiser@darmarit.de>
This commit is contained in:
Markus Heiser 2022-12-04 22:57:22 +01:00
parent c80e82a855
commit 2499899554
11 changed files with 2510 additions and 2174 deletions

View file

@ -1,31 +1,38 @@
# SPDX-License-Identifier: AGPL-3.0-or-later
# lint: pylint
"""This is the implementation of the google images engine using the google
internal API used the Google Go Android app.
"""This is the implementation of the Google Images engine using the internal
Google API used by the Google Go Android app.
This internal API offer results in
- JSON (_fmt:json)
- Protobuf (_fmt:pb)
- Protobuf compressed? (_fmt:pc)
- HTML (_fmt:html)
- Protobuf encoded in JSON (_fmt:jspb).
- JSON (``_fmt:json``)
- Protobuf_ (``_fmt:pb``)
- Protobuf_ compressed? (``_fmt:pc``)
- HTML (``_fmt:html``)
- Protobuf_ encoded in JSON (``_fmt:jspb``).
.. _Protobuf: https://en.wikipedia.org/wiki/Protocol_Buffers
"""
from typing import TYPE_CHECKING
from urllib.parse import urlencode
from json import loads
from searx.engines.google import fetch_traits # pylint: disable=unused-import
from searx.engines.google import (
get_lang_info,
get_google_info,
time_range_dict,
detect_google_sorry,
)
# pylint: disable=unused-import
from searx.engines.google import supported_languages_url, _fetch_supported_languages, fetch_traits
if TYPE_CHECKING:
import logging
from searx.enginelib.traits import EngineTraits
logger: logging.Logger
traits: EngineTraits
# pylint: enable=unused-import
# about
about = {
@ -40,7 +47,6 @@ about = {
# engine dependent config
categories = ['images', 'web']
paging = True
use_locale_domain = True
time_range_support = True
safesearch = True
send_accept_language_header = True
@ -51,20 +57,18 @@ filter_mapping = {0: 'images', 1: 'active', 2: 'active'}
def request(query, params):
"""Google-Image search request"""
lang_info = get_lang_info(params, supported_languages, language_aliases, False)
google_info = get_google_info(params, traits)
query_url = (
'https://'
+ lang_info['subdomain']
+ google_info['subdomain']
+ '/search'
+ "?"
+ urlencode(
{
'q': query,
'tbm': "isch",
**lang_info['params'],
'ie': "utf8",
'oe': "utf8",
**google_info['params'],
'asearch': 'isch',
'async': '_fmt:json,p:1,ijn:' + str(params['pageno']),
}
@ -77,9 +81,8 @@ def request(query, params):
query_url += '&' + urlencode({'safe': filter_mapping[params['safesearch']]})
params['url'] = query_url
params['headers'].update(lang_info['headers'])
params['headers']['User-Agent'] = 'NSTN/3.60.474802233.release Dalvik/2.1.0 (Linux; U; Android 12; US) gzip'
params['headers']['Accept'] = '*/*'
params['cookies'] = google_info['cookies']
params['headers'].update(google_info['headers'])
return params
@ -111,7 +114,11 @@ def response(resp):
copyright_notice = item["result"].get('iptc', {}).get('copyright_notice')
if copyright_notice:
result_item['source'] += ' / ' + copyright_notice
result_item['source'] += ' | ' + copyright_notice
freshness_date = item["result"].get("freshness_date")
if freshness_date:
result_item['source'] += ' | ' + freshness_date
file_size = item.get('gsa', {}).get('file_size')
if file_size: