Browse Source

first commit

solr_update
Simon Bowie 2 years ago
commit
318907ad3d
97 changed files with 11081 additions and 0 deletions
  1. +5
    -0
      .gitignore
  2. +21
    -0
      LICENSE
  3. +53
    -0
      README.md
  4. +23
    -0
      config.env.template
  5. +37
    -0
      docker-compose.prod.yml
  6. +25
    -0
      docker-compose.yml
  7. +16
    -0
      nginx-conf/patents.conf
  8. +67
    -0
      solr_config/currency.xml
  9. +42
    -0
      solr_config/elevate.xml
  10. +2
    -0
      solr_config/email_url_types.txt
  11. +8
    -0
      solr_config/lang/contractions_ca.txt
  12. +15
    -0
      solr_config/lang/contractions_fr.txt
  13. +5
    -0
      solr_config/lang/contractions_ga.txt
  14. +23
    -0
      solr_config/lang/contractions_it.txt
  15. +5
    -0
      solr_config/lang/hyphenations_ga.txt
  16. +6
    -0
      solr_config/lang/stemdict_nl.txt
  17. +420
    -0
      solr_config/lang/stoptags_ja.txt
  18. +125
    -0
      solr_config/lang/stopwords_ar.txt
  19. +193
    -0
      solr_config/lang/stopwords_bg.txt
  20. +220
    -0
      solr_config/lang/stopwords_ca.txt
  21. +172
    -0
      solr_config/lang/stopwords_cz.txt
  22. +110
    -0
      solr_config/lang/stopwords_da.txt
  23. +294
    -0
      solr_config/lang/stopwords_de.txt
  24. +78
    -0
      solr_config/lang/stopwords_el.txt
  25. +54
    -0
      solr_config/lang/stopwords_en.txt
  26. +356
    -0
      solr_config/lang/stopwords_es.txt
  27. +99
    -0
      solr_config/lang/stopwords_eu.txt
  28. +313
    -0
      solr_config/lang/stopwords_fa.txt
  29. +97
    -0
      solr_config/lang/stopwords_fi.txt
  30. +186
    -0
      solr_config/lang/stopwords_fr.txt
  31. +110
    -0
      solr_config/lang/stopwords_ga.txt
  32. +161
    -0
      solr_config/lang/stopwords_gl.txt
  33. +235
    -0
      solr_config/lang/stopwords_hi.txt
  34. +211
    -0
      solr_config/lang/stopwords_hu.txt
  35. +46
    -0
      solr_config/lang/stopwords_hy.txt
  36. +359
    -0
      solr_config/lang/stopwords_id.txt
  37. +303
    -0
      solr_config/lang/stopwords_it.txt
  38. +127
    -0
      solr_config/lang/stopwords_ja.txt
  39. +172
    -0
      solr_config/lang/stopwords_lv.txt
  40. +119
    -0
      solr_config/lang/stopwords_nl.txt
  41. +194
    -0
      solr_config/lang/stopwords_no.txt
  42. +253
    -0
      solr_config/lang/stopwords_pt.txt
  43. +233
    -0
      solr_config/lang/stopwords_ro.txt
  44. +243
    -0
      solr_config/lang/stopwords_ru.txt
  45. +133
    -0
      solr_config/lang/stopwords_sv.txt
  46. +119
    -0
      solr_config/lang/stopwords_th.txt
  47. +212
    -0
      solr_config/lang/stopwords_tr.txt
  48. +29
    -0
      solr_config/lang/userdict_ja.txt
  49. +34
    -0
      solr_config/params.json
  50. +21
    -0
      solr_config/protwords.txt
  51. +530
    -0
      solr_config/schema.xml
  52. +1368
    -0
      solr_config/solrconfig.xml
  53. +14
    -0
      solr_config/stopwords.txt
  54. +29
    -0
      solr_config/synonyms.txt
  55. +115
    -0
      solr_config/update-script.js
  56. +32
    -0
      solr_config/velocity/browse.vm
  57. +0
    -0
      solr_config/velocity/dropit.js
  58. +2
    -0
      solr_config/velocity/facet_doc_type.vm
  59. +12
    -0
      solr_config/velocity/facet_text_shingles.vm
  60. +24
    -0
      solr_config/velocity/facets.vm
  61. +29
    -0
      solr_config/velocity/footer.vm
  62. +290
    -0
      solr_config/velocity/head.vm
  63. +77
    -0
      solr_config/velocity/hit.vm
  64. BIN
      solr_config/velocity/img/english_640.png
  65. BIN
      solr_config/velocity/img/france_640.png
  66. BIN
      solr_config/velocity/img/germany_640.png
  67. BIN
      solr_config/velocity/img/globe_256.png
  68. +0
    -0
      solr_config/velocity/jquery.tx3-tag-cloud.js
  69. +97
    -0
      solr_config/velocity/js/dropit.js
  70. +763
    -0
      solr_config/velocity/js/jquery.autocomplete.js
  71. +70
    -0
      solr_config/velocity/js/jquery.tx3-tag-cloud.js
  72. +42
    -0
      solr_config/velocity/layout.vm
  73. +16
    -0
      solr_config/velocity/macros.vm
  74. +68
    -0
      solr_config/velocity/mime_type_lists.vm
  75. +20
    -0
      solr_config/velocity/results.vm
  76. +21
    -0
      solr_config/velocity/results_list.vm
  77. +144
    -0
      solr_import.sh
  78. +10
    -0
      web/Dockerfile
  79. +34
    -0
      web/app/__init__.py
  80. +16
    -0
      web/app/main.py
  81. +153
    -0
      web/app/ops.py
  82. +62
    -0
      web/app/random.py
  83. +51
    -0
      web/app/search.py
  84. +145
    -0
      web/app/solr.py
  85. +9
    -0
      web/app/static/js/main.js
  86. +10
    -0
      web/app/static/styles/custom.css
  87. +15
    -0
      web/app/templates/abstracts.html
  88. +51
    -0
      web/app/templates/base.html
  89. +71
    -0
      web/app/templates/compare.html
  90. +11
    -0
      web/app/templates/images.html
  91. +46
    -0
      web/app/templates/index.html
  92. +69
    -0
      web/app/templates/record.html
  93. +95
    -0
      web/app/templates/search.html
  94. +42
    -0
      web/app/templates/titles.html
  95. +23
    -0
      web/content/about.md
  96. +15
    -0
      web/content/home.md
  97. +6
    -0
      web/requirements.txt

+ 5
- 0
.gitignore View File

@@ -0,0 +1,5 @@
.DS_Store
config.env
config.env.prod
data
web/app/__pycache__/

+ 21
- 0
LICENSE View File

@@ -0,0 +1,21 @@
MIT License

Copyright (c) 2021 Simon Bowie <ad7588@coventry.ac.uk>

Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal
in the Software without restriction, including without limitation the rights
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
copies of the Software, and to permit persons to whom the Software is furnished
to do so, subject to the following conditions:

The above copyright notice and this permission notice (including the next
paragraph) shall be included in all copies or substantial portions of the
Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS
FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS
OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY,
WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF
OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.

+ 53
- 0
README.md View File

@@ -0,0 +1,53 @@
# Archival Conversations patents data search engine

This repository contains the Docker Compose, Nginx, Python, and Solr config files for deploying the development environment for the Archival Conversations patents data search engine site.

## to deploy environment

### config.env

To deploy this environment, first copy config.env.template to a new file, config.env. Fill in the appropriate environment variables.

Note that on Mac the Python container has to communicate with the Solr container using the hostname 'host.docker.internal' rather than 'localhost' or '127.0.0.1': https://stackoverflow.com/questions/24319662/from-inside-of-a-docker-container-how-do-i-connect-to-the-localhost-of-the-mach

On Linux, you can use the container name e.g. 'solr' as the Solr hostname in config.env.

### Docker Compose

In the command line, navigate to the directory where this repository is stored on your local machine and run:

`docker-compose up -d --build`

Docker should build the application environment comprising a Python container (including ImageMagick), an Apache Solr container (deployed Solr for .rtf indexing using instructions at: https://github.com/docker-solr/docker-solr), and an Nginx web server to serve the website.

The website should then be available in the browser at 'localhost:5000'.

To take down the environment, run:

`docker-compose down`

## populating Apache Solr

In order to fill the site with documents, you will have to populate the Apache Solr search engine. There is a solr_import.sh script to help with this. Place whatever files you want indexed in a directory called 'data' within the main directory.

In solr_import.sh, change the directory to point to the main directory and, if necessary, change the location parameters for the various cores.

We use different Solr cores for the different themes on the site: 'all' is a core containing all documents while 'active', 'expanding', etc. contain only documents for that theme.

### legacy Solr commands

This section should be fully superseded by solr_import.sh and including the Solr config in the repository. These are left here for reference.

Created core using:

`docker exec -it solr solr create_core -c epo_data`

Note this fix to ensure that .rtf files can be indexed using Apache Tika: https://gitmemory.com/issue/docker-solr/docker-solr/341/682877640. Once you've created the core, run these commands:

`docker exec -ti --user=solr solr bash -c 'cp -r /opt/solr/example/files/conf/* /var/solr/data/{CORE_NAME}/conf/'`

`docker restart solr`

Add files to Solr using:

`docker run --rm -v "/Users/ad7588/Downloads/2018 (10381):/2018" --network=host solr:latest post -c epo_data /2018`

+ 23
- 0
config.env.template View File

@@ -0,0 +1,23 @@
# This config file contains the environment variables for the application

# Flask variables
FLASK_APP=app/__init__.py
FLASK_RUN_HOST=0.0.0.0
FLASK_DEBUG=1

# Solr variables
# Hostname for Solr
SOLR_HOSTNAME=
# Solr port, usually 8983
SOLR_PORT=
# Solr core, usually all
SOLR_CORE=

# OPS API variables
# Hostname for OPS API, usually https://ops.epo.org
OPS_URL=
# Hostname for OPS API for images for some reason different to above, usually http://ops.epo.org
OPS_URL_IMAGES=
# API credentials from OPS https://developers.epo.org/
CONSUMER_KEY=
CONSUMER_SECRET=

+ 37
- 0
docker-compose.prod.yml View File

@@ -0,0 +1,37 @@
version: '3.9'

services:

python:
build: ./web
container_name: python
expose:
- 5000
env_file:
- ./config.env.prod
volumes:
- ./web:/code
command: gunicorn --bind 0.0.0.0:5000 "app:create_app()"

nginx:
image: nginx:latest
container_name: nginx
restart: unless-stopped
ports:
- "1337:80"
volumes:
- ./nginx-conf:/etc/nginx/conf.d
depends_on:
- python

solr:
container_name: solr
image: solr:latest
ports:
- '8983:8983'
volumes:
- solrdata:/var/solr
- ./solr_config:/opt/solr/server/solr/configsets/custom

volumes:
solrdata:

+ 25
- 0
docker-compose.yml View File

@@ -0,0 +1,25 @@
version: '3.9'

services:

python:
build: ./web
container_name: python
ports:
- "5000:5000"
volumes:
- ./web:/code
env_file:
- ./config.env

solr:
container_name: solr
image: solr:latest
ports:
- '8983:8983'
volumes:
- solrdata:/var/solr
- ./solr_config:/opt/solr/server/solr/configsets/custom

volumes:
solrdata:

+ 16
- 0
nginx-conf/patents.conf View File

@@ -0,0 +1,16 @@
upstream patents {
server python:5000;
}

server {

listen 80;

location / {
proxy_pass http://patents;
proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
proxy_set_header Host $http_host;
proxy_redirect off;
}

}

+ 67
- 0
solr_config/currency.xml View File

@@ -0,0 +1,67 @@
<?xml version="1.0" ?>
<!--
Licensed to the Apache Software Foundation (ASF) under one or more
contributor license agreements. See the NOTICE file distributed with
this work for additional information regarding copyright ownership.
The ASF licenses this file to You under the Apache License, Version 2.0
(the "License"); you may not use this file except in compliance with
the License. You may obtain a copy of the License at

http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
-->

<!-- Example exchange rates file for CurrencyField type named "currency" in example schema -->

<currencyConfig version="1.0">
<rates>
<!-- Updated from http://www.exchangerate.com/ at 2011-09-27 -->
<rate from="USD" to="ARS" rate="4.333871" comment="ARGENTINA Peso" />
<rate from="USD" to="AUD" rate="1.025768" comment="AUSTRALIA Dollar" />
<rate from="USD" to="EUR" rate="0.743676" comment="European Euro" />
<rate from="USD" to="BRL" rate="1.881093" comment="BRAZIL Real" />
<rate from="USD" to="CAD" rate="1.030815" comment="CANADA Dollar" />
<rate from="USD" to="CLP" rate="519.0996" comment="CHILE Peso" />
<rate from="USD" to="CNY" rate="6.387310" comment="CHINA Yuan" />
<rate from="USD" to="CZK" rate="18.47134" comment="CZECH REP. Koruna" />
<rate from="USD" to="DKK" rate="5.515436" comment="DENMARK Krone" />
<rate from="USD" to="HKD" rate="7.801922" comment="HONG KONG Dollar" />
<rate from="USD" to="HUF" rate="215.6169" comment="HUNGARY Forint" />
<rate from="USD" to="ISK" rate="118.1280" comment="ICELAND Krona" />
<rate from="USD" to="INR" rate="49.49088" comment="INDIA Rupee" />
<rate from="USD" to="XDR" rate="0.641358" comment="INTNL MON. FUND SDR" />
<rate from="USD" to="ILS" rate="3.709739" comment="ISRAEL Sheqel" />
<rate from="USD" to="JPY" rate="76.32419" comment="JAPAN Yen" />
<rate from="USD" to="KRW" rate="1169.173" comment="KOREA (SOUTH) Won" />
<rate from="USD" to="KWD" rate="0.275142" comment="KUWAIT Dinar" />
<rate from="USD" to="MXN" rate="13.85895" comment="MEXICO Peso" />
<rate from="USD" to="NZD" rate="1.285159" comment="NEW ZEALAND Dollar" />
<rate from="USD" to="NOK" rate="5.859035" comment="NORWAY Krone" />
<rate from="USD" to="PKR" rate="87.57007" comment="PAKISTAN Rupee" />
<rate from="USD" to="PEN" rate="2.730683" comment="PERU Sol" />
<rate from="USD" to="PHP" rate="43.62039" comment="PHILIPPINES Peso" />
<rate from="USD" to="PLN" rate="3.310139" comment="POLAND Zloty" />
<rate from="USD" to="RON" rate="3.100932" comment="ROMANIA Leu" />
<rate from="USD" to="RUB" rate="32.14663" comment="RUSSIA Ruble" />
<rate from="USD" to="SAR" rate="3.750465" comment="SAUDI ARABIA Riyal" />
<rate from="USD" to="SGD" rate="1.299352" comment="SINGAPORE Dollar" />
<rate from="USD" to="ZAR" rate="8.329761" comment="SOUTH AFRICA Rand" />
<rate from="USD" to="SEK" rate="6.883442" comment="SWEDEN Krona" />
<rate from="USD" to="CHF" rate="0.906035" comment="SWITZERLAND Franc" />
<rate from="USD" to="TWD" rate="30.40283" comment="TAIWAN Dollar" />
<rate from="USD" to="THB" rate="30.89487" comment="THAILAND Baht" />
<rate from="USD" to="AED" rate="3.672955" comment="U.A.E. Dirham" />
<rate from="USD" to="UAH" rate="7.988582" comment="UKRAINE Hryvnia" />
<rate from="USD" to="GBP" rate="0.647910" comment="UNITED KINGDOM Pound" />
<!-- Cross-rates for some common currencies -->
<rate from="EUR" to="GBP" rate="0.869914" />
<rate from="EUR" to="NOK" rate="7.800095" />
<rate from="GBP" to="NOK" rate="8.966508" />
</rates>
</currencyConfig>

+ 42
- 0
solr_config/elevate.xml View File

@@ -0,0 +1,42 @@
<?xml version="1.0" encoding="UTF-8" ?>
<!--
Licensed to the Apache Software Foundation (ASF) under one or more
contributor license agreements. See the NOTICE file distributed with
this work for additional information regarding copyright ownership.
The ASF licenses this file to You under the Apache License, Version 2.0
(the "License"); you may not use this file except in compliance with
the License. You may obtain a copy of the License at

http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
-->

<!-- If this file is found in the config directory, it will only be
loaded once at startup. If it is found in Solr's data
directory, it will be re-loaded every commit.

See http://wiki.apache.org/solr/QueryElevationComponent for more info

-->
<elevate>
<!-- Query elevation examples
<query text="foo bar">
<doc id="1" />
<doc id="2" />
<doc id="3" />
</query>

for use with techproducts example
<query text="ipod">
<doc id="MA147LL/A" /> put the actual ipod at the top
<doc id="IW-02" exclude="true" /> exclude this cable
</query>
-->

</elevate>

+ 2
- 0
solr_config/email_url_types.txt View File

@@ -0,0 +1,2 @@
<URL>
<EMAIL>

+ 8
- 0
solr_config/lang/contractions_ca.txt View File

@@ -0,0 +1,8 @@
# Set of Catalan contractions for ElisionFilter
# TODO: load this as a resource from the analyzer and sync it in build.xml
d
l
m
n
s
t

+ 15
- 0
solr_config/lang/contractions_fr.txt View File

@@ -0,0 +1,15 @@
# Set of French contractions for ElisionFilter
# TODO: load this as a resource from the analyzer and sync it in build.xml
l
m
t
qu
n
s
j
d
c
jusqu
quoiqu
lorsqu
puisqu

+ 5
- 0
solr_config/lang/contractions_ga.txt View File

@@ -0,0 +1,5 @@
# Set of Irish contractions for ElisionFilter
# TODO: load this as a resource from the analyzer and sync it in build.xml
d
m
b

+ 23
- 0
solr_config/lang/contractions_it.txt View File

@@ -0,0 +1,23 @@
# Set of Italian contractions for ElisionFilter
# TODO: load this as a resource from the analyzer and sync it in build.xml
c
l
all
dall
dell
nell
sull
coll
pell
gl
agl
dagl
degl
negl
sugl
un
m
t
s
v
d

+ 5
- 0
solr_config/lang/hyphenations_ga.txt View File

@@ -0,0 +1,5 @@
# Set of Irish hyphenations for StopFilter
# TODO: load this as a resource from the analyzer and sync it in build.xml
h
n
t

+ 6
- 0
solr_config/lang/stemdict_nl.txt View File

@@ -0,0 +1,6 @@
# Set of overrides for the dutch stemmer
# TODO: load this as a resource from the analyzer and sync it in build.xml
fiets fiets
bromfiets bromfiets
ei eier
kind kinder

+ 420
- 0
solr_config/lang/stoptags_ja.txt View File

@@ -0,0 +1,420 @@
#
# This file defines a Japanese stoptag set for JapanesePartOfSpeechStopFilter.
#
# Any token with a part-of-speech tag that exactly matches those defined in this
# file are removed from the token stream.
#
# Set your own stoptags by uncommenting the lines below. Note that comments are
# not allowed on the same line as a stoptag. See LUCENE-3745 for frequency lists,
# etc. that can be useful for building you own stoptag set.
#
# The entire possible tagset is provided below for convenience.
#
#####
# noun: unclassified nouns
#名詞
#
# noun-common: Common nouns or nouns where the sub-classification is undefined
#名詞-一般
#
# noun-proper: Proper nouns where the sub-classification is undefined
#名詞-固有名詞
#
# noun-proper-misc: miscellaneous proper nouns
#名詞-固有名詞-一般
#
# noun-proper-person: Personal names where the sub-classification is undefined
#名詞-固有名詞-人名
#
# noun-proper-person-misc: names that cannot be divided into surname and
# given name; foreign names; names where the surname or given name is unknown.
# e.g. お市の方
#名詞-固有名詞-人名-一般
#
# noun-proper-person-surname: Mainly Japanese surnames.
# e.g. 山田
#名詞-固有名詞-人名-姓
#
# noun-proper-person-given_name: Mainly Japanese given names.
# e.g. 太郎
#名詞-固有名詞-人名-名
#
# noun-proper-organization: Names representing organizations.
# e.g. 通産省, NHK
#名詞-固有名詞-組織
#
# noun-proper-place: Place names where the sub-classification is undefined
#名詞-固有名詞-地域
#
# noun-proper-place-misc: Place names excluding countries.
# e.g. アジア, バルセロナ, 京都
#名詞-固有名詞-地域-一般
#
# noun-proper-place-country: Country names.
# e.g. 日本, オーストラリア
#名詞-固有名詞-地域-国
#
# noun-pronoun: Pronouns where the sub-classification is undefined
#名詞-代名詞
#
# noun-pronoun-misc: miscellaneous pronouns:
# e.g. それ, ここ, あいつ, あなた, あちこち, いくつ, どこか, なに, みなさん, みんな, わたくし, われわれ
#名詞-代名詞-一般
#
# noun-pronoun-contraction: Spoken language contraction made by combining a
# pronoun and the particle 'wa'.
# e.g. ありゃ, こりゃ, こりゃあ, そりゃ, そりゃあ
#名詞-代名詞-縮約
#
# noun-adverbial: Temporal nouns such as names of days or months that behave
# like adverbs. Nouns that represent amount or ratios and can be used adverbially,
# e.g. 金曜, 一月, 午後, 少量
#名詞-副詞可能
#
# noun-verbal: Nouns that take arguments with case and can appear followed by
# 'suru' and related verbs (する, できる, なさる, くださる)
# e.g. インプット, 愛着, 悪化, 悪戦苦闘, 一安心, 下取り
#名詞-サ変接続
#
# noun-adjective-base: The base form of adjectives, words that appear before な ("na")
# e.g. 健康, 安易, 駄目, だめ
#名詞-形容動詞語幹
#
# noun-numeric: Arabic numbers, Chinese numerals, and counters like 何 (回), 数.
# e.g. 0, 1, 2, 何, 数, 幾
#名詞-数
#
# noun-affix: noun affixes where the sub-classification is undefined
#名詞-非自立
#
# noun-affix-misc: Of adnominalizers, the case-marker の ("no"), and words that
# attach to the base form of inflectional words, words that cannot be classified
# into any of the other categories below. This category includes indefinite nouns.
# e.g. あかつき, 暁, かい, 甲斐, 気, きらい, 嫌い, くせ, 癖, こと, 事, ごと, 毎, しだい, 次第,
# 順, せい, 所為, ついで, 序で, つもり, 積もり, 点, どころ, の, はず, 筈, はずみ, 弾み,
# 拍子, ふう, ふり, 振り, ほう, 方, 旨, もの, 物, 者, ゆえ, 故, ゆえん, 所以, わけ, 訳,
# わり, 割り, 割, ん-口語/, もん-口語/
#名詞-非自立-一般
#
# noun-affix-adverbial: noun affixes that that can behave as adverbs.
# e.g. あいだ, 間, あげく, 挙げ句, あと, 後, 余り, 以外, 以降, 以後, 以上, 以前, 一方, うえ,
# 上, うち, 内, おり, 折り, かぎり, 限り, きり, っきり, 結果, ころ, 頃, さい, 際, 最中, さなか,
# 最中, じたい, 自体, たび, 度, ため, 為, つど, 都度, とおり, 通り, とき, 時, ところ, 所,
# とたん, 途端, なか, 中, のち, 後, ばあい, 場合, 日, ぶん, 分, ほか, 他, まえ, 前, まま,
# 儘, 侭, みぎり, 矢先
#名詞-非自立-副詞可能
#
# noun-affix-aux: noun affixes treated as 助動詞 ("auxiliary verb") in school grammars
# with the stem よう(だ) ("you(da)").
# e.g. よう, やう, 様 (よう)
#名詞-非自立-助動詞語幹
#
# noun-affix-adjective-base: noun affixes that can connect to the indeclinable
# connection form な (aux "da").
# e.g. みたい, ふう
#名詞-非自立-形容動詞語幹
#
# noun-special: special nouns where the sub-classification is undefined.
#名詞-特殊
#
# noun-special-aux: The そうだ ("souda") stem form that is used for reporting news, is
# treated as 助動詞 ("auxiliary verb") in school grammars, and attach to the base
# form of inflectional words.
# e.g. そう
#名詞-特殊-助動詞語幹
#
# noun-suffix: noun suffixes where the sub-classification is undefined.
#名詞-接尾
#
# noun-suffix-misc: Of the nouns or stem forms of other parts of speech that connect
# to ガル or タイ and can combine into compound nouns, words that cannot be classified into
# any of the other categories below. In general, this category is more inclusive than
# 接尾語 ("suffix") and is usually the last element in a compound noun.
# e.g. おき, かた, 方, 甲斐 (がい), がかり, ぎみ, 気味, ぐるみ, (~した) さ, 次第, 済 (ず) み,
# よう, (でき)っこ, 感, 観, 性, 学, 類, 面, 用
#名詞-接尾-一般
#
# noun-suffix-person: Suffixes that form nouns and attach to person names more often
# than other nouns.
# e.g. 君, 様, 著
#名詞-接尾-人名
#
# noun-suffix-place: Suffixes that form nouns and attach to place names more often
# than other nouns.
# e.g. 町, 市, 県
#名詞-接尾-地域
#
# noun-suffix-verbal: Of the suffixes that attach to nouns and form nouns, those that
# can appear before スル ("suru").
# e.g. 化, 視, 分け, 入り, 落ち, 買い
#名詞-接尾-サ変接続
#
# noun-suffix-aux: The stem form of そうだ (様態) that is used to indicate conditions,
# is treated as 助動詞 ("auxiliary verb") in school grammars, and attach to the
# conjunctive form of inflectional words.
# e.g. そう
#名詞-接尾-助動詞語幹
#
# noun-suffix-adjective-base: Suffixes that attach to other nouns or the conjunctive
# form of inflectional words and appear before the copula だ ("da").
# e.g. 的, げ, がち
#名詞-接尾-形容動詞語幹
#
# noun-suffix-adverbial: Suffixes that attach to other nouns and can behave as adverbs.
# e.g. 後 (ご), 以後, 以降, 以前, 前後, 中, 末, 上, 時 (じ)
#名詞-接尾-副詞可能
#
# noun-suffix-classifier: Suffixes that attach to numbers and form nouns. This category
# is more inclusive than 助数詞 ("classifier") and includes common nouns that attach
# to numbers.
# e.g. 個, つ, 本, 冊, パーセント, cm, kg, カ月, か国, 区画, 時間, 時半
#名詞-接尾-助数詞
#
# noun-suffix-special: Special suffixes that mainly attach to inflecting words.
# e.g. (楽し) さ, (考え) 方
#名詞-接尾-特殊
#
# noun-suffix-conjunctive: Nouns that behave like conjunctions and join two words
# together.
# e.g. (日本) 対 (アメリカ), 対 (アメリカ), (3) 対 (5), (女優) 兼 (主婦)
#名詞-接続詞的
#
# noun-verbal_aux: Nouns that attach to the conjunctive particle て ("te") and are
# semantically verb-like.
# e.g. ごらん, ご覧, 御覧, 頂戴
#名詞-動詞非自立的
#
# noun-quotation: text that cannot be segmented into words, proverbs, Chinese poetry,
# dialects, English, etc. Currently, the only entry for 名詞 引用文字列 ("noun quotation")
# is いわく ("iwaku").
#名詞-引用文字列
#
# noun-nai_adjective: Words that appear before the auxiliary verb ない ("nai") and
# behave like an adjective.
# e.g. 申し訳, 仕方, とんでも, 違い
#名詞-ナイ形容詞語幹
#
#####
# prefix: unclassified prefixes
#接頭詞
#
# prefix-nominal: Prefixes that attach to nouns (including adjective stem forms)
# excluding numerical expressions.
# e.g. お (水), 某 (氏), 同 (社), 故 (~氏), 高 (品質), お (見事), ご (立派)
#接頭詞-名詞接続
#
# prefix-verbal: Prefixes that attach to the imperative form of a verb or a verb
# in conjunctive form followed by なる/なさる/くださる.
# e.g. お (読みなさい), お (座り)
#接頭詞-動詞接続
#
# prefix-adjectival: Prefixes that attach to adjectives.
# e.g. お (寒いですねえ), バカ (でかい)
#接頭詞-形容詞接続
#
# prefix-numerical: Prefixes that attach to numerical expressions.
# e.g. 約, およそ, 毎時
#接頭詞-数接続
#
#####
# verb: unclassified verbs
#動詞
#
# verb-main:
#動詞-自立
#
# verb-auxiliary:
#動詞-非自立
#
# verb-suffix:
#動詞-接尾
#
#####
# adjective: unclassified adjectives
#形容詞
#
# adjective-main:
#形容詞-自立
#
# adjective-auxiliary:
#形容詞-非自立
#
# adjective-suffix:
#形容詞-接尾
#
#####
# adverb: unclassified adverbs
#副詞
#
# adverb-misc: Words that can be segmented into one unit and where adnominal
# modification is not possible.
# e.g. あいかわらず, 多分
#副詞-一般
#
# adverb-particle_conjunction: Adverbs that can be followed by の, は, に,
# な, する, だ, etc.
# e.g. こんなに, そんなに, あんなに, なにか, なんでも
#副詞-助詞類接続
#
#####
# adnominal: Words that only have noun-modifying forms.
# e.g. この, その, あの, どの, いわゆる, なんらかの, 何らかの, いろんな, こういう, そういう, ああいう,
# どういう, こんな, そんな, あんな, どんな, 大きな, 小さな, おかしな, ほんの, たいした,
# 「(, も) さる (ことながら)」, 微々たる, 堂々たる, 単なる, いかなる, 我が」「同じ, 亡き
#連体詞
#
#####
# conjunction: Conjunctions that can occur independently.
# e.g. が, けれども, そして, じゃあ, それどころか
接続詞
#
#####
# particle: unclassified particles.
助詞
#
# particle-case: case particles where the subclassification is undefined.
助詞-格助詞
#
# particle-case-misc: Case particles.
# e.g. から, が, で, と, に, へ, より, を, の, にて
助詞-格助詞-一般
#
# particle-case-quote: the "to" that appears after nouns, a person’s speech,
# quotation marks, expressions of decisions from a meeting, reasons, judgements,
# conjectures, etc.
# e.g. ( だ) と (述べた.), ( である) と (して執行猶予...)
助詞-格助詞-引用
#
# particle-case-compound: Compounds of particles and verbs that mainly behave
# like case particles.
# e.g. という, といった, とかいう, として, とともに, と共に, でもって, にあたって, に当たって, に当って,
# にあたり, に当たり, に当り, に当たる, にあたる, において, に於いて,に於て, における, に於ける,
# にかけ, にかけて, にかんし, に関し, にかんして, に関して, にかんする, に関する, に際し,
# に際して, にしたがい, に従い, に従う, にしたがって, に従って, にたいし, に対し, にたいして,
# に対して, にたいする, に対する, について, につき, につけ, につけて, につれ, につれて, にとって,
# にとり, にまつわる, によって, に依って, に因って, により, に依り, に因り, による, に依る, に因る,
# にわたって, にわたる, をもって, を以って, を通じ, を通じて, を通して, をめぐって, をめぐり, をめぐる,
# って-口語/, ちゅう-関西弁「という」/, (何) ていう (人)-口語/, っていう-口語/, といふ, とかいふ
助詞-格助詞-連語
#
# particle-conjunctive:
# e.g. から, からには, が, けれど, けれども, けど, し, つつ, て, で, と, ところが, どころか, とも, ども,
# ながら, なり, ので, のに, ば, ものの, や ( した), やいなや, (ころん) じゃ(いけない)-口語/,
# (行っ) ちゃ(いけない)-口語/, (言っ) たって (しかたがない)-口語/, (それがなく)ったって (平気)-口語/
助詞-接続助詞
#
# particle-dependency:
# e.g. こそ, さえ, しか, すら, は, も, ぞ
助詞-係助詞
#
# particle-adverbial:
# e.g. がてら, かも, くらい, 位, ぐらい, しも, (学校) じゃ(これが流行っている)-口語/,
# (それ)じゃあ (よくない)-口語/, ずつ, (私) なぞ, など, (私) なり (に), (先生) なんか (大嫌い)-口語/,
# (私) なんぞ, (先生) なんて (大嫌い)-口語/, のみ, だけ, (私) だって-口語/, だに,
# (彼)ったら-口語/, (お茶) でも (いかが), 等 (とう), (今後) とも, ばかり, ばっか-口語/, ばっかり-口語/,
# ほど, 程, まで, 迄, (誰) も (が)([助詞-格助詞] および [助詞-係助詞] の前に位置する「も」)
助詞-副助詞
#
# particle-interjective: particles with interjective grammatical roles.
# e.g. (松島) や
助詞-間投助詞
#
# particle-coordinate:
# e.g. と, たり, だの, だり, とか, なり, や, やら
助詞-並立助詞
#
# particle-final:
# e.g. かい, かしら, さ, ぜ, (だ)っけ-口語/, (とまってる) で-方言/, な, ナ, なあ-口語/, ぞ, ね, ネ,
# ねぇ-口語/, ねえ-口語/, ねん-方言/, の, のう-口語/, や, よ, ヨ, よぉ-口語/, わ, わい-口語/
助詞-終助詞
#
# particle-adverbial/conjunctive/final: The particle "ka" when unknown whether it is
# adverbial, conjunctive, or sentence final. For example:
# (a) 「A か B か」. Ex:「(国内で運用する) か,(海外で運用する) か (.)」
# (b) Inside an adverb phrase. Ex:「(幸いという) か (, 死者はいなかった.)」
# 「(祈りが届いたせい) か (, 試験に合格した.)」
# (c) 「かのように」. Ex:「(何もなかった) か (のように振る舞った.)」
# e.g. か
助詞-副助詞/並立助詞/終助詞
#
# particle-adnominalizer: The "no" that attaches to nouns and modifies
# non-inflectional words.
助詞-連体化
#
# particle-adnominalizer: The "ni" and "to" that appear following nouns and adverbs
# that are giongo, giseigo, or gitaigo.
# e.g. に, と
助詞-副詞化
#
# particle-special: A particle that does not fit into one of the above classifications.
# This includes particles that are used in Tanka, Haiku, and other poetry.
# e.g. かな, けむ, ( しただろう) に, (あんた) にゃ(わからん), (俺) ん (家)
助詞-特殊
#
#####
# auxiliary-verb:
助動詞
#
#####
# interjection: Greetings and other exclamations.
# e.g. おはよう, おはようございます, こんにちは, こんばんは, ありがとう, どうもありがとう, ありがとうございます,
# いただきます, ごちそうさま, さよなら, さようなら, はい, いいえ, ごめん, ごめんなさい
#感動詞
#
#####
# symbol: unclassified Symbols.
記号
#
# symbol-misc: A general symbol not in one of the categories below.
# e.g. [○◎@$〒→+]
記号-一般
#
# symbol-comma: Commas
# e.g. [,、]
記号-読点
#
# symbol-period: Periods and full stops.
# e.g. [..。]
記号-句点
#
# symbol-space: Full-width whitespace.
記号-空白
#
# symbol-open_bracket:
# e.g. [({‘“『【]
記号-括弧開
#
# symbol-close_bracket:
# e.g. [)}’”』」】]
記号-括弧閉
#
# symbol-alphabetic:
#記号-アルファベット
#
#####
# other: unclassified other
#その他
#
# other-interjection: Words that are hard to classify as noun-suffixes or
# sentence-final particles.
# e.g. (だ)ァ
その他-間投
#
#####
# filler: Aizuchi that occurs during a conversation or sounds inserted as filler.
# e.g. あの, うんと, えと
フィラー
#
#####
# non-verbal: non-verbal sound.
非言語音
#
#####
# fragment:
#語断片
#
#####
# unknown: unknown part of speech.
#未知語
#
##### End of file

+ 125
- 0
solr_config/lang/stopwords_ar.txt View File

@@ -0,0 +1,125 @@
# This file was created by Jacques Savoy and is distributed under the BSD license.
# See http://members.unine.ch/jacques.savoy/clef/index.html.
# Also see http://www.opensource.org/licenses/bsd-license.html
# Cleaned on October 11, 2009 (not normalized, so use before normalization)
# This means that when modifying this list, you might need to add some
# redundant entries, for example containing forms with both أ and ا
من
ومن
منها
منه
في
وفي
فيها
فيه
و
ف
ثم
او
أو
ب
بها
به
ا
أ
اى
اي
أي
أى
لا
ولا
الا
ألا
إلا
لكن
ما
وما
كما
فما
عن
مع
اذا
إذا
ان
أن
إن
انها
أنها
إنها
انه
أنه
إنه
بان
بأن
فان
فأن
وان
وأن
وإن
التى
التي
الذى
الذي
الذين
الى
الي
إلى
إلي
على
عليها
عليه
اما
أما
إما
ايضا
أيضا
كل
وكل
لم
ولم
لن
ولن
هى
هي
هو
وهى
وهي
وهو
فهى
فهي
فهو
انت
أنت
لك
لها
له
هذه
هذا
تلك
ذلك
هناك
كانت
كان
يكون
تكون
وكانت
وكان
غير
بعض
قد
نحو
بين
بينما
منذ
ضمن
حيث
الان
الآن
خلال
بعد
قبل
حتى
عند
عندما
لدى
جميع

+ 193
- 0
solr_config/lang/stopwords_bg.txt View File

@@ -0,0 +1,193 @@
# This file was created by Jacques Savoy and is distributed under the BSD license.
# See http://members.unine.ch/jacques.savoy/clef/index.html.
# Also see http://www.opensource.org/licenses/bsd-license.html
а
аз
ако
ала
бе
без
беше
би
бил
била
били
било
близо
бъдат
бъде
бяха
в
вас
ваш
ваша
вероятно
вече
взема
ви
вие
винаги
все
всеки
всички
всичко
всяка
във
въпреки
върху
г
ги
главно
го
д
да
дали
до
докато
докога
дори
досега
доста
е
едва
един
ето
за
зад
заедно
заради
засега
затова
защо
защото
и
из
или
им
има
имат
иска
й
каза
как
каква
какво
както
какъв
като
кога
когато
което
които
кой
който
колко
която
къде
където
към
ли
м
ме
между
мен
ми
мнозина
мога
могат
може
моля
момента
му
н
на
над
назад
най
направи
напред
например
нас
не
него
нея
ни
ние
никой
нито
но
някои
някой
няма
обаче
около
освен
особено
от
отгоре
отново
още
пак
по
повече
повечето
под
поне
поради
после
почти
прави
пред
преди
през
при
пък
първо
с
са
само
се
сега
си
скоро
след
сме
според
сред
срещу
сте
съм
със
също
т
тази
така
такива
такъв
там
твой
те
тези
ти
тн
то
това
тогава
този
той
толкова
точно
трябва
тук
тъй
тя
тях
у
харесва
ч
че
често
чрез
ще
щом
я

+ 220
- 0
solr_config/lang/stopwords_ca.txt View File

@@ -0,0 +1,220 @@
# Catalan stopwords from http://github.com/vcl/cue.language (Apache 2 Licensed)
a
abans
ací
ah
així
això
al
als
aleshores
algun
alguna
algunes
alguns
alhora
allà
allí
allò
altra
altre
altres
amb
ambdós
ambdues
apa
aquell
aquella
aquelles
aquells
aquest
aquesta
aquestes
aquests
aquí
baix
cada
cadascú
cadascuna
cadascunes
cadascuns
com
contra
d'un
d'una
d'unes
d'uns
dalt
de
del
dels
des
després
dins
dintre
donat
doncs
durant
e
eh
el
els
em
en
encara
ens
entre
érem
eren
éreu
es
és
esta
està
estàvem
estaven
estàveu
esteu
et
etc
ets
fins
fora
gairebé
ha
han
has
havia
he
hem
heu
hi
ho
i
igual
iguals
ja
l'hi
la
les
li
li'n
llavors
m'he
ma
mal
malgrat
mateix
mateixa
mateixes
mateixos
me
mentre
més
meu
meus
meva
meves
molt
molta
moltes
molts
mon
mons
n'he
n'hi
ne
ni
no
nogensmenys
només
nosaltres
nostra
nostre
nostres
o
oh
oi
on
pas
pel
pels
per
però
perquè
poc
poca
pocs
poques
potser
propi
qual
quals
quan
quant
que
què
quelcom
qui
quin
quina
quines
quins
s'ha
s'han
sa
semblant
semblants
ses
seu
seus
seva
seva
seves
si
sobre
sobretot
sóc
solament
sols
son
són
sons
sota
sou
t'ha
t'han
t'he
ta
tal
també
tampoc
tan
tant
tanta
tantes
teu
teus
teva
teves
ton
tons
tot
tota
totes
tots
un
una
unes
uns
us
va
vaig
vam
van
vas
veu
vosaltres
vostra
vostre
vostres

+ 172
- 0
solr_config/lang/stopwords_cz.txt View File

@@ -0,0 +1,172 @@
a
s
k
o
i
u
v
z
dnes
cz
tímto
budeš
budem
byli
jseš
můj
svým
ta
tomto
tohle
tuto
tyto
jej
zda
proč
máte
tato
kam
tohoto
kdo
kteří
mi
nám
tom
tomuto
mít
nic
proto
kterou
byla
toho
protože
asi
ho
naši
napište
re
což
tím
takže
svých
její
svými
jste
aj
tu
tedy
teto
bylo
kde
ke
pravé
ji
nad
nejsou
či
pod
téma
mezi
přes
ty
pak
vám
ani
když
však
neg
jsem
tento
článku
články
aby
jsme
před
pta
jejich
byl
ještě
bez
také
pouze
první
vaše
která
nás
nový
tipy
pokud
může
strana
jeho
své
jiné
zprávy
nové
není
vás
jen
podle
zde
být
více
bude
již
než
který
by
které
co
nebo
ten
tak
při
od
po
jsou
jak
další
ale
si
se
ve
to
jako
za
zpět
ze
do
pro
je
na
atd
atp
jakmile
přičemž
on
ona
ono
oni
ony
my
vy
ji
mne
jemu
tomu
těm
těmu
němu
němuž
jehož
jíž
jelikož
jež
jakož
načež

+ 110
- 0
solr_config/lang/stopwords_da.txt View File

@@ -0,0 +1,110 @@
| From svn.tartarus.org/snowball/trunk/website/algorithms/danish/stop.txt
| This file is distributed under the BSD License.
| See http://snowball.tartarus.org/license.php
| Also see http://www.opensource.org/licenses/bsd-license.html
| - Encoding was converted to UTF-8.
| - This notice was added.
|
| NOTE: To use this file with StopFilterFactory, you must specify format="snowball"

| A Danish stop word list. Comments begin with vertical bar. Each stop
| word is at the start of a line.

| This is a ranked list (commonest to rarest) of stopwords derived from
| a large text sample.


og | and
i | in
jeg | I
det | that (dem. pronoun)/it (pers. pronoun)
at | that (in front of a sentence)/to (with infinitive)
en | a/an
den | it (pers. pronoun)/that (dem. pronoun)
til | to/at/for/until/against/by/of/into, more
er | present tense of "to be"
som | who, as
på | on/upon/in/on/at/to/after/of/with/for, on
de | they
med | with/by/in, along
han | he
af | of/by/from/off/for/in/with/on, off
for | at/for/to/from/by/of/ago, in front/before, because
ikke | not
der | who/which, there/those
var | past tense of "to be"
mig | me/myself
sig | oneself/himself/herself/itself/themselves
men | but
et | a/an/one, one (number), someone/somebody/one
har | present tense of "to have"
om | round/about/for/in/a, about/around/down, if
vi | we
min | my
havde | past tense of "to have"
ham | him
hun | she
nu | now
over | over/above/across/by/beyond/past/on/about, over/past
da | then, when/as/since
fra | from/off/since, off, since
du | you
ud | out
sin | his/her/its/one's
dem | them
os | us/ourselves
op | up
man | you/one
hans | his
hvor | where
eller | or
hvad | what
skal | must/shall etc.
selv | myself/youself/herself/ourselves etc., even
her | here
alle | all/everyone/everybody etc.
vil | will (verb)
blev | past tense of "to stay/to remain/to get/to become"
kunne | could
ind | in
når | when
være | present tense of "to be"
dog | however/yet/after all
noget | something
ville | would
jo | you know/you see (adv), yes
deres | their/theirs
efter | after/behind/according to/for/by/from, later/afterwards
ned | down
skulle | should
denne | this
end | than
dette | this
mit | my/mine
også | also
under | under/beneath/below/during, below/underneath
have | have
dig | you
anden | other
hende | her
mine | my
alt | everything
meget | much/very, plenty of
sit | his, her, its, one's
sine | his, her, its, one's
vor | our
mod | against
disse | these
hvis | if
din | your/yours
nogle | some
hos | by/at
blive | be/become
mange | many
ad | by/through
bliver | present tense of "to be/to become"
hendes | her/hers
været | be
thi | for (conj)
jer | you
sådan | such, like this/like that

+ 294
- 0
solr_config/lang/stopwords_de.txt View File

@@ -0,0 +1,294 @@
| From svn.tartarus.org/snowball/trunk/website/algorithms/german/stop.txt
| This file is distributed under the BSD License.
| See http://snowball.tartarus.org/license.php
| Also see http://www.opensource.org/licenses/bsd-license.html
| - Encoding was converted to UTF-8.
| - This notice was added.
|
| NOTE: To use this file with StopFilterFactory, you must specify format="snowball"

| A German stop word list. Comments begin with vertical bar. Each stop
| word is at the start of a line.

| The number of forms in this list is reduced significantly by passing it
| through the German stemmer.


aber | but

alle | all
allem
allen
aller
alles

als | than, as
also | so
am | an + dem
an | at

ander | other
andere
anderem
anderen
anderer
anderes
anderm
andern
anderr
anders

auch | also
auf | on
aus | out of
bei | by
bin | am
bis | until
bist | art
da | there
damit | with it
dann | then

der | the
den
des
dem
die
das

daß | that

derselbe | the same
derselben
denselben
desselben
demselben
dieselbe
dieselben
dasselbe

dazu | to that

dein | thy
deine
deinem
deinen
deiner
deines

denn | because

derer | of those
dessen | of him

dich | thee
dir | to thee
du | thou

dies | this
diese
diesem
diesen
dieser
dieses


doch | (several meanings)
dort | (over) there


durch | through

ein | a
eine
einem
einen
einer
eines

einig | some
einige
einigem
einigen
einiger
einiges

einmal | once

er | he
ihn | him
ihm | to him

es | it
etwas | something

euer | your
eure
eurem
euren
eurer
eures

für | for
gegen | towards
gewesen | p.p. of sein
hab | have
habe | have
haben | have
hat | has
hatte | had
hatten | had
hier | here
hin | there
hinter | behind

ich | I
mich | me
mir | to me


ihr | you, to her
ihre
ihrem
ihren
ihrer
ihres
euch | to you

im | in + dem
in | in
indem | while
ins | in + das
ist | is

jede | each, every
jedem
jeden
jeder
jedes

jene | that
jenem
jenen
jener
jenes

jetzt | now
kann | can

kein | no
keine
keinem
keinen
keiner
keines

können | can
könnte | could
machen | do
man | one

manche | some, many a
manchem
manchen
mancher
manches

mein | my
meine
meinem
meinen
meiner
meines

mit | with
muss | must
musste | had to
nach | to(wards)
nicht | not
nichts | nothing
noch | still, yet
nun | now
nur | only
ob | whether
oder | or
ohne | without
sehr | very

sein | his
seine
seinem
seinen
seiner
seines

selbst | self
sich | herself

sie | they, she
ihnen | to them

sind | are
so | so

solche | such
solchem
solchen
solcher
solches

soll | shall
sollte | should
sondern | but
sonst | else
über | over
um | about, around
und | and

uns | us
unse
unsem
unsen
unser
unses

unter | under
viel | much
vom | von + dem
von | from
vor | before
während | while
war | was
waren | were
warst | wast
was | what
weg | away, off
weil | because
weiter | further

welche | which
welchem
welchen
welcher
welches

wenn | when
werde | will
werden | will
wie | how
wieder | again
will | want
wir | we
wird | will
wirst | willst
wo | where
wollen | want
wollte | wanted
würde | would
würden | would
zu | to
zum | zu + dem
zur | zu + der
zwar | indeed
zwischen | between


+ 78
- 0
solr_config/lang/stopwords_el.txt View File

@@ -0,0 +1,78 @@
# Lucene Greek Stopwords list
# Note: by default this file is used after GreekLowerCaseFilter,
# so when modifying this file use 'σ' instead of 'ς'
ο
η
το
οι
τα
του
τησ
των
τον
την
και
κι
κ
ειμαι
εισαι
ειναι
ειμαστε
ειστε
στο
στον
στη
στην
μα
αλλα
απο
για
προσ
με
σε
ωσ
παρα
αντι
κατα
μετα
θα
να
δε
δεν
μη
μην
επι
ενω
εαν
αν
τοτε
που
πωσ
ποιοσ
ποια
ποιο
ποιοι
ποιεσ
ποιων
ποιουσ
αυτοσ
αυτη
αυτο
αυτοι
αυτων
αυτουσ
αυτεσ
αυτα
εκεινοσ
εκεινη
εκεινο
εκεινοι
εκεινεσ
εκεινα
εκεινων
εκεινουσ
οπωσ
ομωσ
ισωσ
οσο
οτι

+ 54
- 0
solr_config/lang/stopwords_en.txt View File

@@ -0,0 +1,54 @@
# Licensed to the Apache Software Foundation (ASF) under one or more
# contributor license agreements. See the NOTICE file distributed with
# this work for additional information regarding copyright ownership.
# The ASF licenses this file to You under the Apache License, Version 2.0
# (the "License"); you may not use this file except in compliance with
# the License. You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

# a couple of test stopwords to test that the words are really being
# configured from this file:
stopworda
stopwordb

# Standard english stop words taken from Lucene's StopAnalyzer
a
an
and
are
as
at
be
but
by
for
if
in
into
is
it
no
not
of
on
or
such
that
the
their
then
there
these
they
this
to
was
will
with

+ 356
- 0
solr_config/lang/stopwords_es.txt View File

@@ -0,0 +1,356 @@
| From svn.tartarus.org/snowball/trunk/website/algorithms/spanish/stop.txt
| This file is distributed under the BSD License.
| See http://snowball.tartarus.org/license.php
| Also see http://www.opensource.org/licenses/bsd-license.html
| - Encoding was converted to UTF-8.
| - This notice was added.
|
| NOTE: To use this file with StopFilterFactory, you must specify format="snowball"

| A Spanish stop word list. Comments begin with vertical bar. Each stop
| word is at the start of a line.


| The following is a ranked list (commonest to rarest) of stopwords
| deriving from a large sample of text.

| Extra words have been added at the end.

de | from, of
la | the, her
que | who, that
el | the
en | in
y | and
a | to
los | the, them
del | de + el
se | himself, from him etc
las | the, them
por | for, by, etc
un | a
para | for
con | with
no | no
una | a
su | his, her
al | a + el
| es from SER
lo | him
como | how
más | more
pero | pero
sus | su plural
le | to him, her
ya | already
o | or
| fue from SER
este | this
| ha from HABER
sí | himself etc
porque | because
esta | this
| son from SER
entre | between
| está from ESTAR
cuando | when
muy | very
sin | without
sobre | on
| ser from SER
| tiene from TENER
también | also
me | me
hasta | until
hay | there is/are
donde | where
| han from HABER
quien | whom, that
| están from ESTAR
| estado from ESTAR
desde | from
todo | all
nos | us
durante | during
| estados from ESTAR
todos | all
uno | a
les | to them
ni | nor
contra | against
otros | other
| fueron from SER
ese | that
eso | that
| había from HABER
ante | before
ellos | they
e | and (variant of y)
esto | this
mí | me
antes | before
algunos | some
qué | what?
unos | a
yo | I
otro | other
otras | other
otra | other
él | he
tanto | so much, many
esa | that
estos | these
mucho | much, many
quienes | who
nada | nothing
muchos | many
cual | who
| sea from SER
poco | few
ella | she
estar | to be
| haber from HABER
estas | these
| estaba from ESTAR
| estamos from ESTAR
algunas | some
algo | something
nosotros | we

| other forms

mi | me
mis | mi plural
tú | thou
te | thee
ti | thee
tu | thy
tus | tu plural
ellas | they
nosotras | we
vosotros | you
vosotras | you
os | you
mío | mine
mía |
míos |
mías |
tuyo | thine
tuya |
tuyos |
tuyas |
suyo | his, hers, theirs
suya |
suyos |
suyas |
nuestro | ours
nuestra |
nuestros |
nuestras |
vuestro | yours
vuestra |
vuestros |
vuestras |
esos | those
esas | those

| forms of estar, to be (not including the infinitive):
estoy
estás
está
estamos
estáis
están
esté
estés
estemos
estéis
estén
estaré
estarás
estará
estaremos
estaréis
estarán
estaría
estarías
estaríamos
estaríais
estarían
estaba
estabas
estábamos
estabais
estaban
estuve
estuviste
estuvo
estuvimos
estuvisteis
estuvieron
estuviera
estuvieras
estuviéramos
estuvierais
estuvieran
estuviese
estuvieses
estuviésemos
estuvieseis
estuviesen
estando
estado
estada
estados
estadas
estad

| forms of haber, to have (not including the infinitive):
he
has
ha
hemos
habéis
han
haya
hayas
hayamos
hayáis
hayan
habré
habrás
habrá
habremos
habréis
habrán
habría
habrías
habríamos
habríais
habrían
había
habías
habíamos
habíais
habían
hube
hubiste
hubo
hubimos
hubisteis
hubieron
hubiera
hubieras
hubiéramos
hubierais
hubieran
hubiese
hubieses
hubiésemos
hubieseis
hubiesen
habiendo
habido
habida
habidos
habidas

| forms of ser, to be (not including the infinitive):
soy
eres
es
somos
sois
son
sea
seas
seamos
seáis
sean
seré
serás
será
seremos
seréis
serán
sería
serías
seríamos
seríais
serían
era
eras
éramos
erais
eran
fui
fuiste
fue
fuimos
fuisteis
fueron
fuera
fueras
fuéramos
fuerais
fueran
fuese
fueses
fuésemos
fueseis
fuesen
siendo
sido
| sed also means 'thirst'

| forms of tener, to have (not including the infinitive):
tengo
tienes
tiene
tenemos
tenéis
tienen
tenga
tengas
tengamos
tengáis
tengan
tendré
tendrás
tendrá
tendremos
tendréis
tendrán
tendría
tendrías
tendríamos
tendríais
tendrían
tenía
tenías
teníamos
teníais
tenían
tuve
tuviste
tuvo
tuvimos
tuvisteis
tuvieron
tuviera
tuvieras
tuviéramos
tuvierais
tuvieran
tuviese
tuvieses
tuviésemos
tuvieseis
tuviesen
teniendo
tenido
tenida
tenidos
tenidas
tened


+ 99
- 0
solr_config/lang/stopwords_eu.txt View File

@@ -0,0 +1,99 @@
# example set of basque stopwords
al
anitz
arabera
asko
baina
bat
batean
batek
bati
batzuei
batzuek
batzuetan
batzuk
bera
beraiek
berau
berauek
bere
berori
beroriek
beste
bezala
da
dago
dira
ditu
du
dute
edo
egin
ere
eta
eurak
ez
gainera
gu
gutxi
guzti
haiei
haiek
haietan
hainbeste
hala
han
handik
hango
hara
hari
hark
hartan
hau
hauei
hauek
hauetan
hemen
hemendik
hemengo
hi
hona
honek
honela
honetan
honi
hor
hori
horiei
horiek
horietan
horko
horra
horrek
horrela
horretan
horri
hortik
hura
izan
ni
noiz
nola
non
nondik
nongo
nor
nora
ze
zein
zen
zenbait
zenbat
zer
zergatik
ziren
zituen
zu
zuek
zuen
zuten

+ 313
- 0
solr_config/lang/stopwords_fa.txt View File

@@ -0,0 +1,313 @@
# This file was created by Jacques Savoy and is distributed under the BSD license.
# See http://members.unine.ch/jacques.savoy/clef/index.html.
# Also see http://www.opensource.org/licenses/bsd-license.html
# Note: by default this file is used after normalization, so when adding entries
# to this file, use the arabic 'ي' instead of 'ی'
انان
نداشته
سراسر
خياه
ايشان
وي
تاكنون
بيشتري
دوم
پس
ناشي
وگو
يا
داشتند
سپس
هنگام
هرگز
پنج
نشان
امسال
ديگر
گروهي
شدند
چطور
ده
و
دو
نخستين
ولي
چرا
چه
وسط
ه
كدام
قابل
يك
رفت
هفت
همچنين
در
هزار
بله
بلي
شايد
اما
شناسي
گرفته
دهد
داشته
دانست
داشتن
خواهيم
ميليارد
وقتيكه
امد
خواهد
جز
اورده
شده
بلكه
خدمات
شدن
برخي
نبود
بسياري
جلوگيري
حق
كردند
نوعي
بعري
نكرده
نظير
نبايد
بوده
بودن
داد
اورد
هست
جايي
شود
دنبال
داده
بايد
سابق
هيچ
همان
انجا
كمتر
كجاست
گردد
كسي
تر
مردم
تان
دادن
بودند
سري
جدا
ندارند
مگر
يكديگر
دارد
دهند
بنابراين
هنگامي
سمت
جا
انچه
خود
دادند
زياد
دارند
اثر
بدون
بهترين
بيشتر
البته
به
براساس
بيرون
كرد
بعضي
گرفت
توي
اي
ميليون
او
جريان
تول
بر
مانند
برابر
باشيم
مدتي
گويند
اكنون
تا
تنها
جديد
چند
بي
نشده
كردن
كردم
گويد
كرده
كنيم
نمي
نزد
روي
قصد
فقط
بالاي
ديگران
اين
ديروز
توسط
سوم
ايم
دانند
سوي
استفاده
شما
كنار
داريم
ساخته
طور
امده
رفته
نخست
بيست
نزديك
طي
كنيد
از
انها
تمامي
داشت
يكي
طريق
اش
چيست
روب
نمايد
گفت
چندين
چيزي
تواند
ام
ايا
با
ان
ايد
ترين
اينكه
ديگري
راه
هايي
بروز
همچنان
پاعين
كس
حدود
مختلف
مقابل
چيز
گيرد
ندارد
ضد
همچون
سازي
شان
مورد
باره
مرسي
خويش
برخوردار
چون
خارج
شش
هنوز
تحت
ضمن
هستيم
گفته
فكر
بسيار
پيش
براي
روزهاي
انكه
نخواهد
بالا
كل
وقتي
كي
چنين
كه
گيري
نيست
است
كجا
كند
نيز
يابد
بندي
حتي
توانند
عقب
خواست
كنند
بين
تمام
همه
ما
باشند
مثل
شد
اري
باشد
اره
طبق
بعد
اگر
صورت
غير
جاي
بيش
ريزي
اند
زيرا
چگونه
بار
لطفا
مي
درباره
من
ديده
همين
گذاري
برداري
علت
گذاشته
هم
فوق
نه
ها
شوند
اباد
همواره
هر
اول
خواهند
چهار
نام
امروز
مان
هاي
قبل
كنم
سعي
تازه
را
هستند
زير
جلوي
عنوان
بود

+ 97
- 0
solr_config/lang/stopwords_fi.txt View File

@@ -0,0 +1,97 @@
| From svn.tartarus.org/snowball/trunk/website/algorithms/finnish/stop.txt
| This file is distributed under the BSD License.
| See http://snowball.tartarus.org/license.php
| Also see http://www.opensource.org/licenses/bsd-license.html
| - Encoding was converted to UTF-8.
| - This notice was added.
|
| NOTE: To use this file with StopFilterFactory, you must specify format="snowball"
| forms of BE

olla
olen
olet
on
olemme
olette
ovat
ole | negative form

oli
olisi
olisit
olisin
olisimme
olisitte
olisivat
olit
olin
olimme
olitte
olivat
ollut
olleet

en | negation
et
ei
emme
ette
eivät

|Nom Gen Acc Part Iness Elat Illat Adess Ablat Allat Ess Trans
minä minun minut minua minussa minusta minuun minulla minulta minulle | I
sinä sinun sinut sinua sinussa sinusta sinuun sinulla sinulta sinulle | you
hän hänen hänet häntä hänessä hänestä häneen hänellä häneltä hänelle | he she
me meidän meidät meitä meissä meistä meihin meillä meiltä meille | we
te teidän teidät teitä teissä teistä teihin teillä teiltä teille | you
he heidän heidät heitä heissä heistä heihin heillä heiltä heille | they

tämä tämän tätä tässä tästä tähän tallä tältä tälle tänä täksi | this
tuo tuon tuotä tuossa tuosta tuohon tuolla tuolta tuolle tuona tuoksi | that
se sen sitä siinä siitä siihen sillä siltä sille sinä siksi | it
nämä näiden näitä näissä näistä näihin näillä näiltä näille näinä näiksi | these
nuo noiden noita noissa noista noihin noilla noilta noille noina noiksi | those
ne niiden niitä niissä niistä niihin niillä niiltä niille niinä niiksi | they

kuka kenen kenet ketä kenessä kenestä keneen kenellä keneltä kenelle kenenä keneksi| who
ketkä keiden ketkä keitä keissä keistä keihin keillä keiltä keille keinä keiksi | (pl)
mikä minkä minkä mitä missä mistä mihin millä miltä mille minä miksi | which what
mitkä | (pl)

joka jonka jota jossa josta johon jolla jolta jolle jona joksi | who which
jotka joiden joita joissa joista joihin joilla joilta joille joina joiksi | (pl)

| conjunctions

että | that
ja | and
jos | if
koska | because
kuin | than
mutta | but
niin | so
sekä | and
sillä | for
tai | or
vaan | but
vai | or
vaikka | although


| prepositions

kanssa | with
mukaan | according to
noin | about
poikki | across
yli | over, across

| other

kun | when
niin | so
nyt | now
itse | self


+ 186
- 0
solr_config/lang/stopwords_fr.txt View File

@@ -0,0 +1,186 @@
| From svn.tartarus.org/snowball/trunk/website/algorithms/french/stop.txt
| This file is distributed under the BSD License.
| See http://snowball.tartarus.org/license.php
| Also see http://www.opensource.org/licenses/bsd-license.html
| - Encoding was converted to UTF-8.
| - This notice was added.
|
| NOTE: To use this file with StopFilterFactory, you must specify format="snowball"

| A French stop word list. Comments begin with vertical bar. Each stop
| word is at the start of a line.

au | a + le
aux | a + les
avec | with
ce | this
ces | these
dans | with
de | of
des | de + les
du | de + le
elle | she
en | `of them' etc
et | and
eux | them
il | he
je | I
la | the
le | the
leur | their
lui | him
ma | my (fem)
mais | but
me | me
même | same; as in moi-même (myself) etc
mes | me (pl)
moi | me
mon | my (masc)
ne | not
nos | our (pl)
notre | our
nous | we
on | one
ou | where
par | by
pas | not
pour | for
qu | que before vowel
que | that
qui | who
sa | his, her (fem)
se | oneself
ses | his (pl)
son | his, her (masc)
sur | on
ta | thy (fem)
te | thee
tes | thy (pl)
toi | thee
ton | thy (masc)
tu | thou
un | a
une | a
vos | your (pl)
votre | your
vous | you

| single letter forms

c | c'
d | d'
j | j'
l | l'
à | to, at
m | m'
n | n'
s | s'
t | t'
y | there

| forms of être (not including the infinitive):
été
étée
étées
étés
étant
suis
es
est
sommes
êtes
sont
serai
seras
sera
serons
serez
seront
serais
serait
serions
seriez
seraient
étais
était
étions
étiez
étaient
fus
fut
fûmes
fûtes
furent
sois
soit
soyons
soyez
soient
fusse
fusses
fût
fussions
fussiez
fussent

| forms of avoir (not including the infinitive):
ayant
eu
eue
eues
eus
ai
as
avons
avez
ont
aurai
auras
aura
aurons
aurez
auront
aurais
aurait
aurions
auriez
auraient
avais
avait
avions
aviez
avaient
eut
eûmes
eûtes
eurent
aie
aies
ait
ayons
ayez
aient
eusse
eusses
eût
eussions
eussiez
eussent

| Later additions (from Jean-Christophe Deschamps)
ceci | this
cela | that
celà | that
cet | this
cette | this
ici | here
ils | they
les | the (pl)
leurs | their (pl)
quel | which
quels | which
quelle | which
quelles | which
sans | without
soi | oneself


+ 110
- 0
solr_config/lang/stopwords_ga.txt View File

@@ -0,0 +1,110 @@

a
ach
ag
agus
an
aon
ar
arna
as
b'
ba
beirt
bhúr
caoga
ceathair
ceathrar
chomh
chtó
chuig
chun
cois
céad
cúig
cúigear
d'
daichead
dar
de
deich
deichniúr
den
dhá
do
don
dtí
dár
faoi
faoin
faoina
faoinár
fara
fiche
gach
gan
go
gur
haon
hocht
i
iad
idir
in
ina
ins
inár
is
le
leis
lena
lenár
m'
mar
mo
na
nach
naoi
naonúr
níor
nócha
ocht
ochtar
os
roimh
sa
seacht
seachtar
seachtó
seasca
seisear
siad
sibh
sinn
sna
tar
thar
thú
triúr
trí
trína
trínár
tríocha
um
ár
é
éis
í
ó
ón
óna
ónár

+ 161
- 0
solr_config/lang/stopwords_gl.txt View File

@@ -0,0 +1,161 @@
# galican stopwords
a
aínda
alí
aquel
aquela
aquelas
aqueles
aquilo
aquí
ao
aos
as
así
á
ben
cando
che
co
coa
comigo
con
connosco
contigo
convosco
coas
cos
cun
cuns
cunha
cunhas
da
dalgunha
dalgunhas
dalgún
dalgúns
das
de
del
dela
delas
deles
desde
deste
do
dos
dun
duns
dunha
dunhas
e
el
ela
elas
eles
en
era
eran
esa
esas
ese
eses
esta
estar
estaba
está
están
este
estes
estiven
estou
eu
é
facer
foi
foron
fun
había
hai
iso
isto
la
las
lle
lles
lo
los
mais
me
meu
meus
min
miña
miñas
moi
na
nas
neste
nin
no
non
nos
nosa
nosas
noso
nosos
nós
nun
nunha
nuns
nunhas
o
os
ou
ó
ós
para
pero
pode
pois
pola
polas
polo
polos
por
que
se
senón
ser
seu
seus
sexa
sido
sobre
súa
súas
tamén
tan
te
ten
teñen
teño
ter
teu
teus
ti
tido
tiña
tiven
túa
túas
un
unha
unhas
uns
vos
vosa
vosas
voso
vosos
vós

+ 235
- 0
solr_config/lang/stopwords_hi.txt View File

@@ -0,0 +1,235 @@
# Also see http://www.opensource.org/licenses/bsd-license.html
# See http://members.unine.ch/jacques.savoy/clef/index.html.
# This file was created by Jacques Savoy and is distributed under the BSD license.
# Note: by default this file also contains forms normalized by HindiNormalizer
# for spelling variation (see section below), such that it can be used whether or
# not you enable that feature. When adding additional entries to this list,
# please add the normalized form as well.
अंदर
अत
अपना
अपनी
अपने
अभी
आदि
आप
इत्यादि
इन
इनका
इन्हीं
इन्हें
इन्हों
इस
इसका
इसकी
इसके
इसमें
इसी
इसे
उन
उनका
उनकी
उनके
उनको
उन्हीं
उन्हें
उन्हों
उस
उसके
उसी
उसे
एक
एवं
एस
ऐसे
और
कई
कर
करता
करते
करना
करने
करें
कहते
कहा
का
काफ़ी
कि
कितना
किन्हें
किन्हों
किया
किर
किस
किसी
किसे
की
कुछ
कुल
के
को
कोई
कौन
कौनसा
गया
घर
जब
जहाँ
जा
जितना
जिन
जिन्हें
जिन्हों
जिस
जिसे
जीधर
जैसा
जैसे
जो
तक
तब
तरह
तिन
तिन्हें
तिन्हों
तिस
तिसे
तो
था
थी
थे
दबारा
दिया
दुसरा
दूसरे
दो
द्वारा
नहीं
ना
निहायत
नीचे
ने
पर
पर
पहले
पूरा
पे
फिर
बनी
बही
बहुत
बाद
बाला
बिलकुल
भी
भीतर
मगर
मानो
मे
में
यदि
यह
यहाँ
यही
या
यिह
ये
रखें
रहा
रहे
ऱ्वासा
लिए
लिये
लेकिन
वर्ग
वह
वह
वहाँ
वहीं
वाले
वुह
वे
वग़ैरह
संग
सकता
सकते
सबसे
सभी
साथ
साबुत
साभ
सारा
से
सो
ही
हुआ
हुई
हुए
है
हैं
हो
होता
होती
होते
होना
होने
# additional normalized forms of the above
अपनि
जेसे
होति
सभि
तिंहों
इंहों
दवारा
इसि
किंहें
थि
उंहों
ओर
जिंहें
वहिं
अभि
बनि
हि
उंहिं
उंहें
हें
वगेरह
एसे
रवासा
कोन
निचे
काफि
उसि
पुरा
भितर
हे
बहि
वहां
कोइ
यहां
जिंहों
तिंहें
किसि
कइ
यहि
इंहिं
जिधर
इंहें
अदि
इतयादि
हुइ
कोनसा
इसकि
दुसरे
जहां
अप
किंहों
उनकि
भि
वरग
हुअ
जेसा
नहिं

+ 211
- 0
solr_config/lang/stopwords_hu.txt View File

@@ -0,0 +1,211 @@
| From svn.tartarus.org/snowball/trunk/website/algorithms/hungarian/stop.txt
| This file is distributed under the BSD License.
| See http://snowball.tartarus.org/license.php
| Also see http://www.opensource.org/licenses/bsd-license.html
| - Encoding was converted to UTF-8.
| - This notice was added.
|
| NOTE: To use this file with StopFilterFactory, you must specify format="snowball"
| Hungarian stop word list
| prepared by Anna Tordai

a
ahogy
ahol
aki
akik
akkor
alatt
által
általában
amely
amelyek
amelyekben
amelyeket
amelyet
amelynek
ami
amit
amolyan
amíg
amikor
át
abban
ahhoz
annak
arra
arról
az
azok
azon
azt
azzal
azért
aztán
azután
azonban
bár
be
belül
benne
cikk
cikkek
cikkeket
csak
de
e
eddig
egész
egy
egyes
egyetlen
egyéb
egyik
egyre
ekkor
el
elég
ellen
elő
először
előtt
első
én
éppen
ebben
ehhez
emilyen
ennek
erre
ez
ezt
ezek
ezen
ezzel
ezért
és
fel
felé
hanem
hiszen
hogy
hogyan
igen
így
illetve
ill.
ill
ilyen
ilyenkor
ison
ismét
itt
jól
jobban
kell
kellett
keresztül
keressünk
ki
kívül
között
közül
legalább
lehet
lehetett
legyen
lenne
lenni
lesz
lett
maga
magát
majd
majd
már
más
másik
meg
még
mellett
mert
mely
melyek
mi
mit
míg
miért
milyen
mikor
minden
mindent
mindenki
mindig
mint
mintha
mivel
most
nagy
nagyobb
nagyon
ne
néha
nekem
neki
nem
néhány
nélkül
nincs
olyan
ott
össze
ő
ők
őket
pedig
persze
s
saját
sem
semmi
sok
sokat
sokkal
számára
szemben
szerint
szinte
talán
tehát
teljes
tovább
továbbá
több
úgy
ugyanis
új
újabb
újra
után
utána
utolsó
vagy
vagyis
valaki
valami
valamint
való
vagyok
van
vannak
volt
voltam
voltak
voltunk
vissza
vele
viszont
volna

+ 46
- 0
solr_config/lang/stopwords_hy.txt View File

@@ -0,0 +1,46 @@
# example set of Armenian stopwords.
այդ
այլ
այն
այս
դու
դուք
եմ
են
ենք
ես
եք
է
էի
էին
էինք
էիր
էիք
էր
ըստ
թ
ի
ին
իսկ
իր
կամ
համար
հետ
հետո
մենք
մեջ
մի
ն
նա
նաև
նրա
նրանք
որ
որը
որոնք
որպես
ու
ում
պիտի
վրա
և

+ 359
- 0
solr_config/lang/stopwords_id.txt View File

@@ -0,0 +1,359 @@
# from appendix D of: A Study of Stemming Effects on Information
# Retrieval in Bahasa Indonesia
ada
adanya
adalah
adapun
agak
agaknya
agar
akan
akankah
akhirnya
aku
akulah
amat
amatlah
anda
andalah
antar
diantaranya
antara
antaranya
diantara
apa
apaan
mengapa
apabila
apakah
apalagi
apatah
atau
ataukah
ataupun
bagai
bagaikan
sebagai
sebagainya
bagaimana
bagaimanapun
sebagaimana
bagaimanakah
bagi
bahkan
bahwa
bahwasanya
sebaliknya
banyak
sebanyak
beberapa
seberapa
begini
beginian
beginikah
beginilah
sebegini
begitu
begitukah
begitulah
begitupun
sebegitu
belum
belumlah
sebelum
sebelumnya
sebenarnya
berapa
berapakah
berapalah
berapapun
betulkah
sebetulnya
biasa
biasanya
bila
bilakah
bisa
bisakah
sebisanya
boleh
bolehkah
bolehlah
buat
bukan
bukankah
bukanlah
bukannya
cuma
percuma
dahulu
dalam
dan
dapat
dari
daripada
dekat
demi
demikian
demikianlah
sedemikian
dengan
depan
di
dia
dialah
dini
diri
dirinya
terdiri
dong
dulu
enggak
enggaknya
entah
entahlah
terhadap
terhadapnya
hal
hampir
hanya
hanyalah
harus
haruslah
harusnya
seharusnya
hendak
hendaklah
hendaknya
hingga
sehingga
ia
ialah
ibarat
ingin
inginkah
inginkan
ini
inikah
inilah
itu
itukah
itulah
jangan
jangankan
janganlah
jika
jikalau
juga
justru
kala
kalau
kalaulah
kalaupun
kalian
kami
kamilah
kamu
kamulah
kan
kapan
kapankah
kapanpun
dikarenakan
karena
karenanya
ke
kecil
kemudian
kenapa
kepada
kepadanya
ketika
seketika
khususnya
kini
kinilah
kiranya
sekiranya
kita
kitalah
kok
lagi
lagian
selagi
lah
lain
lainnya
melainkan
selaku
lalu
melalui
terlalu
lama
lamanya
selama
selama
selamanya
lebih
terlebih
bermacam
macam
semacam
maka
makanya
makin
malah
malahan
mampu
mampukah
mana
manakala
manalagi
masih
masihkah
semasih
masing
mau
maupun
semaunya
memang
mereka
merekalah
meski
meskipun
semula
mungkin
mungkinkah
nah
namun
nanti
nantinya
nyaris
oleh
olehnya
seorang
seseorang
pada
padanya
padahal
paling
sepanjang
pantas
sepantasnya
sepantasnyalah
para
pasti
pastilah
per
pernah
pula
pun
merupakan
rupanya
serupa
saat
saatnya
sesaat
saja
sajalah
saling
bersama
sama
sesama
sambil
sampai
sana
sangat
sangatlah
saya
sayalah
se
sebab
sebabnya
sebuah
tersebut
tersebutlah
sedang
sedangkan
sedikit
sedikitnya
segala
segalanya
segera
sesegera
sejak
sejenak
sekali
sekalian
sekalipun
sesekali
sekaligus
sekarang
sekarang
sekitar
sekitarnya
sela
selain
selalu
seluruh
seluruhnya
semakin
sementara
sempat
semua
semuanya
sendiri
sendirinya
seolah
seperti
sepertinya
sering
seringnya
serta
siapa
siapakah
siapapun
disini
disinilah
sini
sinilah
sesuatu
sesuatunya
suatu
sesudah
sesudahnya
sudah
sudahkah
sudahlah
supaya
tadi
tadinya
tak
tanpa
setelah
telah
tentang
tentu
tentulah
tentunya
tertentu
seterusnya
tapi
tetapi
setiap
tiap
setidaknya
tidak
tidakkah
tidaklah
toh
waduh
wah
wahai
sewaktu
walau
walaupun
wong
yaitu
yakni
yang

+ 303
- 0
solr_config/lang/stopwords_it.txt View File

@@ -0,0 +1,303 @@
| From svn.tartarus.org/snowball/trunk/website/algorithms/italian/stop.txt
| This file is distributed under the BSD License.
| See http://snowball.tartarus.org/license.php
| Also see http://www.opensource.org/licenses/bsd-license.html
| - Encoding was converted to UTF-8.
| - This notice was added.
|
| NOTE: To use this file with StopFilterFactory, you must specify format="snowball"

| An Italian stop word list. Comments begin with vertical bar. Each stop
| word is at the start of a line.

ad | a (to) before vowel
al | a + il
allo | a + lo
ai | a + i
agli | a + gli
all | a + l'
agl | a + gl'
alla | a + la
alle | a + le
con | with
col | con + il
coi | con + i (forms collo, cogli etc are now very rare)
da | from
dal | da + il
dallo | da + lo
dai | da + i
dagli | da + gli
dall | da + l'
dagl | da + gll'
dalla | da + la
dalle | da + le
di | of
del | di + il
dello | di + lo
dei | di + i
degli | di + gli
dell | di + l'
degl | di + gl'
della | di + la
delle | di + le
in | in
nel | in + el
nello | in + lo
nei | in + i
negli | in + gli
nell | in + l'
negl | in + gl'
nella | in + la
nelle | in + le
su | on
sul | su + il
sullo | su + lo
sui | su + i
sugli | su + gli
sull | su + l'
sugl | su + gl'
sulla | su + la
sulle | su + le
per | through, by
tra | among
contro | against
io | I
tu | thou
lui | he
lei | she
noi | we
voi | you
loro | they
mio | my
mia |
miei |
mie |
tuo |
tua |
tuoi | thy
tue |
suo |
sua |
suoi | his, her
sue |
nostro | our
nostra |
nostri |
nostre |
vostro | your
vostra |
vostri |
vostre |
mi | me
ti | thee
ci | us, there
vi | you, there
lo | him, the
la | her, the
li | them
le | them, the
gli | to him, the
ne | from there etc
il | the
un | a
uno | a
una | a
ma | but
ed | and
se | if
perché | why, because
anche | also
come | how
dov | where (as dov')
dove | where
che | who, that
chi | who
cui | whom
non | not
più | more
quale | who, that
quanto | how much
quanti |
quanta |
quante |
quello | that
quelli |
quella |
quelle |
questo | this
questi |
questa |
queste |
si | yes
tutto | all
tutti | all

| single letter forms:

a | at
c | as c' for ce or ci
e | and
i | the
l | as l'
o | or

| forms of avere, to have (not including the infinitive):

ho
hai
ha
abbiamo
avete
hanno
abbia
abbiate
abbiano
avrò
avrai
avrà
avremo
avrete
avranno
avrei
avresti
avrebbe
avremmo
avreste
avrebbero
avevo
avevi
aveva
avevamo
avevate
avevano
ebbi
avesti
ebbe
avemmo
aveste
ebbero
avessi
avesse
avessimo
avessero
avendo
avuto
avuta
avuti
avute

| forms of essere, to be (not including the infinitive):
sono
sei
è
siamo
siete
sia
siate
siano
sarò
sarai
sarà
saremo
sarete
saranno
sarei
saresti
sarebbe
saremmo
sareste
sarebbero
ero
eri
era
eravamo
eravate
erano
fui
fosti
fu
fummo
foste
furono
fossi
fosse
fossimo
fossero
essendo

| forms of fare, to do (not including the infinitive, fa, fat-):
faccio
fai
facciamo
fanno
faccia
facciate
facciano
farò
farai
farà
faremo
farete
faranno
farei
faresti
farebbe
faremmo
fareste
farebbero
facevo
facevi
faceva
facevamo
facevate
facevano
feci
facesti
fece
facemmo
faceste
fecero
facessi
facesse
facessimo
facessero
facendo

| forms of stare, to be (not including the infinitive):
sto
stai
sta
stiamo
stanno
stia
stiate
stiano
starò
starai
starà
staremo
starete
staranno
starei
staresti
starebbe
staremmo
stareste
starebbero
stavo
stavi
stava
stavamo
stavate
stavano
stetti
stesti
stette
stemmo
steste
stettero
stessi
stesse
stessimo
stessero
stando

+ 127
- 0
solr_config/lang/stopwords_ja.txt View File

@@ -0,0 +1,127 @@
#
# This file defines a stopword set for Japanese.
#
# This set is made up of hand-picked frequent terms from segmented Japanese Wikipedia.
# Punctuation characters and frequent kanji have mostly been left out. See LUCENE-3745
# for frequency lists, etc. that can be useful for making your own set (if desired)
#
# Note that there is an overlap between these stopwords and the terms stopped when used
# in combination with the JapanesePartOfSpeechStopFilter. When editing this file, note
# that comments are not allowed on the same line as stopwords.
#
# Also note that stopping is done in a case-insensitive manner. Change your StopFilter
# configuration if you need case-sensitive stopping. Lastly, note that stopping is done
# using the same character width as the entries in this file. Since this StopFilter is
# normally done after a CJKWidthFilter in your chain, you would usually want your romaji
# entries to be in half-width and your kana entries to be in full-width.
#
ある
いる
する
から
こと
として
れる
など
なっ
ない
この
ため
その
あっ
よう
また
もの
という
あり
まで
られ
なる
これ
によって
により
おり
より
による
なり
られる
において
なかっ
なく
しかし
について
だっ
その後
できる
それ
ので
なお
のみ
でき
における
および
いう
さらに
でも
たり
その他
に関する
たち
ます
なら
に対して
特に
せる
及び
これら
とき
では
にて
ほか
ながら
うち
そして
とともに
ただし
かつて
それぞれ
または
ほど
ものの
に対する
ほとんど
と共に
といった
です
とも
ところ
ここ
##### End of file

+ 172
- 0
solr_config/lang/stopwords_lv.txt View File

@@ -0,0 +1,172 @@
# Set of Latvian stopwords from A Stemming Algorithm for Latvian, Karlis Kreslins
# the original list of over 800 forms was refined:
# pronouns, adverbs, interjections were removed
#
# prepositions
aiz
ap
ar
apakš
ārpus
augšpus
bez
caur
dēļ
gar
iekš
iz
kopš
labad
lejpus
līdz
no
otrpus
pa
par
pār
pēc
pie
pirms
pret
priekš
starp
šaipus
uz
viņpus
virs
virspus
zem
apakšpus
# Conjunctions
un
bet
jo
ja
ka
lai
tomēr
tikko
turpretī
arī
kaut
gan
tādēļ
ne
tikvien
vien
ir
te
vai
kamēr
# Particles
ar
diezin
droši
diemžēl
nebūt
ik
it
taču
nu
pat
tiklab
iekšpus
nedz
tik
nevis
turpretim
jeb
iekam
iekām
iekāms
kolīdz
līdzko
tiklīdz
jebšu
tālab
tāpēc
nekā
itin
jau
jel
nezin
tad
tikai
vis
tak
iekams
vien
# modal verbs
būt
biju
biji
bija
bijām
bijāt
esmu
esi
esam
esat
būšu
būsi
būs
būsim
būsiet
tikt
tiku
tiki
tika
tikām
tikāt
tieku
tiec
tiek
tiekam
tiekat
tikšu
tiks
tiksim
tiksiet
tapt
tapi
tapāt
topat
tapšu
tapsi
taps
tapsim
tapsiet
kļūt
kļuvu
kļuvi
kļuva
kļuvām
kļuvāt
kļūstu
kļūsti
kļūst
kļūstam
kļūstat
kļūšu
kļūsi
kļūs
kļūsim
kļūsiet
# verbs
varēt
varēju
varējām
varēšu
varēsim
var
varēji
varējāt
varēsi
varēsiet
varat
varēja
varēs

+ 119
- 0
solr_config/lang/stopwords_nl.txt View File

@@ -0,0 +1,119 @@
| From svn.tartarus.org/snowball/trunk/website/algorithms/dutch/stop.txt
| This file is distributed under the BSD License.
| See http://snowball.tartarus.org/license.php
| Also see http://www.opensource.org/licenses/bsd-license.html
| - Encoding was converted to UTF-8.
| - This notice was added.
|
| NOTE: To use this file with StopFilterFactory, you must specify format="snowball"

| A Dutch stop word list. Comments begin with vertical bar. Each stop
| word is at the start of a line.

| This is a ranked list (commonest to rarest) of stopwords derived from
| a large sample of Dutch text.

| Dutch stop words frequently exhibit homonym clashes. These are indicated
| clearly below.

de | the
en | and
van | of, from
ik | I, the ego
te | (1) chez, at etc, (2) to, (3) too
dat | that, which
die | that, those, who, which
in | in, inside
een | a, an, one
hij | he
het | the, it
niet | not, nothing, naught
zijn | (1) to be, being, (2) his, one's, its
is | is
was | (1) was, past tense of all persons sing. of 'zijn' (to be) (2) wax, (3) the washing, (4) rise of river
op | on, upon, at, in, up, used up
aan | on, upon, to (as dative)
met | with, by
als | like, such as, when
voor | (1) before, in front of, (2) furrow
had | had, past tense all persons sing. of 'hebben' (have)
er | there
maar | but, only
om | round, about, for etc
hem | him
dan | then
zou | should/would, past tense all persons sing. of 'zullen'
of | or, whether, if
wat | what, something, anything
mijn | possessive and noun 'mine'
men | people, 'one'
dit | this
zo | so, thus, in this way
door | through by
over | over, across
ze | she, her, they, them
zich | oneself
bij | (1) a bee, (2) by, near, at
ook | also, too
tot | till, until
je | you
mij | me
uit | out of, from
der | Old Dutch form of 'van der' still found in surnames
daar | (1) there, (2) because
haar | (1) her, their, them, (2) hair
naar | (1) unpleasant, unwell etc, (2) towards, (3) as
heb | present first person sing. of 'to have'
hoe | how, why
heeft | present third person sing. of 'to have'
hebben | 'to have' and various parts thereof
deze | this
u | you
want | (1) for, (2) mitten, (3) rigging
nog | yet, still
zal | 'shall', first and third person sing. of verb 'zullen' (will)
me | me
zij | she, they
nu | now
ge | 'thou', still used in Belgium and south Netherlands
geen | none
omdat | because
iets | something, somewhat
worden | to become, grow, get
toch | yet, still
al | all, every, each
waren | (1) 'were' (2) to wander, (3) wares, (3)
veel | much, many
meer | (1) more, (2) lake
doen | to do, to make
toen | then, when
moet | noun 'spot/mote' and present form of 'to must'
ben | (1) am, (2) 'are' in interrogative second person singular of 'to be'
zonder | without
kan | noun 'can' and present form of 'to be able'
hun | their, them
dus | so, consequently
alles | all, everything, anything
onder | under, beneath
ja | yes, of course
eens | once, one day
hier | here
wie | who
werd | imperfect third person sing. of 'become'
altijd | always
doch | yet, but etc
wordt | present third person sing. of 'become'
wezen | (1) to be, (2) 'been' as in 'been fishing', (3) orphans
kunnen | to be able
ons | us/our
zelf | self
tegen | against, towards, at
na | after, near
reeds | already
wil | (1) present tense of 'want', (2) 'will', noun, (3) fender
kon | could; past tense of 'to be able'
niets | nothing
uw | your
iemand | somebody
geweest | been; past participle of 'be'
andere | other

+ 194
- 0
solr_config/lang/stopwords_no.txt View File

@@ -0,0 +1,194 @@
| From svn.tartarus.org/snowball/trunk/website/algorithms/norwegian/stop.txt
| This file is distributed under the BSD License.
| See http://snowball.tartarus.org/license.php
| Also see http://www.opensource.org/licenses/bsd-license.html
| - Encoding was converted to UTF-8.
| - This notice was added.
|
| NOTE: To use this file with StopFilterFactory, you must specify format="snowball"

| A Norwegian stop word list. Comments begin with vertical bar. Each stop
| word is at the start of a line.

| This stop word list is for the dominant bokmål dialect. Words unique
| to nynorsk are marked *.

| Revised by Jan Bruusgaard <Jan.Bruusgaard@ssb.no>, Jan 2005

og | and
i | in
jeg | I
det | it/this/that
at | to (w. inf.)
en | a/an
et | a/an
den | it/this/that
til | to
er | is/am/are
som | who/that
på | on
de | they / you(formal)
med | with
han | he
av | of
ikke | not
ikkje | not *
der | there
så | so
var | was/were
meg | me
seg | you
men | but
ett | one
har | have
om | about
vi | we
min | my
mitt | my
ha | have
hadde | had
hun | she
nå | now
over | over
da | when/as
ved | by/know
fra | from
du | you
ut | out
sin | your
dem | them
oss | us
opp | up
man | you/one
kan | can
hans | his
hvor | where
eller | or
hva | what
skal | shall/must
selv | self (reflective)
sjøl | self (reflective)
her | here
alle | all
vil | will
bli | become
ble | became
blei | became *
blitt | have become
kunne | could
inn | in
når | when
være | be
kom | come
noen | some
noe | some
ville | would
dere | you
som | who/which/that
deres | their/theirs
kun | only/just
ja | yes
etter | after
ned | down
skulle | should
denne | this
for | for/because
deg | you
si | hers/his
sine | hers/his
sitt | hers/his
mot | against
å | to
meget | much
hvorfor | why
dette | this
disse | these/those
uten | without
hvordan | how
ingen | none
din | your
ditt | your
blir | become
samme | same
hvilken | which
hvilke | which (plural)
sånn | such a
inni | inside/within
mellom | between
vår | our
hver | each
hvem | who
vors | us/ours
hvis | whose
både | both
bare | only/just
enn | than
fordi | as/because
før | before
mange | many
også | also
slik | just
vært | been
være | to be
båe | both *
begge | both
siden | since
dykk | your *
dykkar | yours *
dei | they *
deira | them *
deires | theirs *
deim | them *
di | your (fem.) *
då | as/when *
eg | I *
ein | a/an *
eit | a/an *
eitt | a/an *
elles | or *
honom | he *
hjå | at *
ho | she *
hoe | she *
henne | her
hennar | her/hers
hennes | hers
hoss | how *
hossen | how *
ikkje | not *
ingi | noone *
inkje | noone *
korleis | how *
korso | how *
kva | what/which *
kvar | where *
kvarhelst | where *
kven | who/whom *
kvi | why *
kvifor | why *
me | we *
medan | while *
mi | my *
mine | my *
mykje | much *
no | now *
nokon | some (masc./neut.) *
noka | some (fem.) *
nokor | some *
noko | some *
nokre | some *
si | his/hers *
sia | since *
sidan | since *
so | so *
somt | some *
somme | some *
um | about*
upp | up *
vere | be *
vore | was *
verte | become *
vort | become *
varte | became *
vart | became *


+ 253
- 0
solr_config/lang/stopwords_pt.txt View File

@@ -0,0 +1,253 @@
| From svn.tartarus.org/snowball/trunk/website/algorithms/portuguese/stop.txt
| This file is distributed under the BSD License.
| See http://snowball.tartarus.org/license.php
| Also see http://www.opensource.org/licenses/bsd-license.html
| - Encoding was converted to UTF-8.
| - This notice was added.
|
| NOTE: To use this file with StopFilterFactory, you must specify format="snowball"

| A Portuguese stop word list. Comments begin with vertical bar. Each stop
| word is at the start of a line.


| The following is a ranked list (commonest to rarest) of stopwords
| deriving from a large sample of text.

| Extra words have been added at the end.

de | of, from
a | the; to, at; her
o | the; him
que | who, that
e | and
do | de + o
da | de + a
em | in
um | a
para | for
| é from SER
com | with
não | not, no
uma | a
os | the; them
no | em + o
se | himself etc
na | em + a
por | for
mais | more
as | the; them
dos | de + os
como | as, like
mas | but
| foi from SER
ao | a + o
ele | he
das | de + as
| tem from TER
à | a + a
seu | his
sua | her
ou | or
| ser from SER
quando | when
muito | much
| há from HAV
nos | em + os; us
já | already, now
| está from EST
eu | I
também | also
só | only, just
pelo | per + o
pela | per + a
até | up to
isso | that
ela | he
entre | between
| era from SER
depois | after
sem | without
mesmo | same
aos | a + os
| ter from TER
seus | his
quem | whom
nas | em + as
me | me
esse | that
eles | they
| estão from EST
você | you
| tinha from TER
| foram from SER
essa | that
num | em + um
nem | nor
suas | her
meu | my
às | a + as
minha | my
| têm from TER
numa | em + uma
pelos | per + os
elas | they
| havia from HAV
| seja from SER
qual | which
| será from SER
nós | we
| tenho from TER
lhe | to him, her
deles | of them
essas | those
esses | those
pelas | per + as
este | this
| fosse from SER
dele | of him

| other words. There are many contractions such as naquele = em+aquele,
| mo = me+o, but they are rare.
| Indefinite article plural forms are also rare.

tu | thou
te | thee
vocês | you (plural)
vos | you
lhes | to them
meus | my
minhas
teu | thy
tua
teus
tuas
nosso | our
nossa
nossos
nossas

dela | of her
delas | of them

esta | this
estes | these
estas | these
aquele | that
aquela | that
aqueles | those
aquelas | those
isto | this
aquilo | that

| forms of estar, to be (not including the infinitive):
estou
está
estamos
estão
estive
esteve
estivemos
estiveram
estava
estávamos
estavam
estivera
estivéramos
esteja
estejamos
estejam
estivesse
estivéssemos
estivessem
estiver
estivermos
estiverem

| forms of haver, to have (not including the infinitive):
hei
havemos
hão
houve
houvemos
houveram
houvera
houvéramos
haja
hajamos
hajam
houvesse
houvéssemos
houvessem
houver
houvermos
houverem
houverei
houverá
houveremos
houverão
houveria
houveríamos
houveriam

| forms of ser, to be (not including the infinitive):
sou
somos
são
era
éramos
eram
fui
foi
fomos
foram
fora
fôramos
seja
sejamos
sejam
fosse
fôssemos
fossem
for
formos
forem
serei
será
seremos
serão
seria
seríamos
seriam

| forms of ter, to have (not including the infinitive):
tenho
tem
temos
tém
tinha
tínhamos
tinham
tive
teve
tivemos
tiveram
tivera
tivéramos
tenha
tenhamos
tenham
tivesse
tivéssemos
tivessem
tiver
tivermos
tiverem
terei
terá
teremos
terão
teria
teríamos
teriam

+ 233
- 0
solr_config/lang/stopwords_ro.txt View File

@@ -0,0 +1,233 @@
# This file was created by Jacques Savoy and is distributed under the BSD license.
# See http://members.unine.ch/jacques.savoy/clef/index.html.
# Also see http://www.opensource.org/licenses/bsd-license.html
acea
aceasta
această
aceea
acei
aceia
acel
acela
acele
acelea
acest
acesta
aceste
acestea
aceşti
aceştia
acolo
acum
ai
aia
aibă
aici
al
ăla
ale
alea
ălea
altceva
altcineva
am
ar
are
aşadar
asemenea
asta
ăsta
astăzi
astea
ăstea
ăştia
asupra
aţi
au
avea
avem
aveţi
azi
bine
bucur
bună
ca
căci
când
care
cărei
căror
cărui
cât
câte
câţi
către
câtva
ce
cel
ceva
chiar
cînd
cine
cineva
cît
cîte
cîţi
cîtva
contra
cu
cum
cumva
curând
curînd
da
dacă
dar
datorită
de
deci
deja
deoarece
departe
deşi
din
dinaintea
dintr
dintre
drept
după
ea
ei
el
ele
eram
este
eşti
eu
face
fără
fi
fie
fiecare
fii
fim
fiţi
iar
ieri
îi
îl
îmi
împotriva
în
înainte
înaintea
încât
încît
încotro
între
întrucât
întrucît
îţi
la
lângă
le
li
lîngă
lor
lui
mâine
mea
mei
mele
mereu
meu
mi
mine
mult
multă
mulţi
ne
nicăieri
nici
nimeni
nişte
noastră
noastre
noi
noştri
nostru
nu
ori
oricând
oricare
oricât
orice
oricînd
oricine
oricît
oricum
oriunde
până
pe
pentru
peste
pînă
poate
pot
prea
prima
primul
prin
printr
sa
săi
sale
sau
său
se
şi
sînt
sîntem
sînteţi
spre
sub
sunt
suntem
sunteţi
ta
tăi
tale
tău
te
ţi
ţie
tine
toată
toate
tot
toţi
totuşi
tu
un
una
unde
undeva
unei
unele
uneori
unor
vi
voastră
voastre
voi
voştri
vostru
vouă
vreo
vreun

+ 243
- 0
solr_config/lang/stopwords_ru.txt View File

@@ -0,0 +1,243 @@
| From svn.tartarus.org/snowball/trunk/website/algorithms/russian/stop.txt
| This file is distributed under the BSD License.
| See http://snowball.tartarus.org/license.php
| Also see http://www.opensource.org/licenses/bsd-license.html
| - Encoding was converted to UTF-8.
| - This notice was added.
|
| NOTE: To use this file with StopFilterFactory, you must specify format="snowball"

| a russian stop word list. comments begin with vertical bar. each stop
| word is at the start of a line.

| this is a ranked list (commonest to rarest) of stopwords derived from
| a large text sample.

| letter `ё' is translated to `е'.

и | and
в | in/into
во | alternative form
не | not
что | what/that
он | he
на | on/onto
я | i
с | from
со | alternative form
как | how
а | milder form of `no' (but)
то | conjunction and form of `that'
все | all
она | she
так | so, thus
его | him
но | but
да | yes/and
ты | thou
к | towards, by
у | around, chez
же | intensifier particle
вы | you
за | beyond, behind
бы | conditional/subj. particle
по | up to, along
только | only
ее | her
мне | to me
было | it was
вот | here is/are, particle
от | away from
меня | me
еще | still, yet, more
нет | no, there isnt/arent
о | about
из | out of
ему | to him
теперь | now
когда | when
даже | even
ну | so, well
вдруг | suddenly
ли | interrogative particle
если | if
уже | already, but homonym of `narrower'
или | or
ни | neither
быть | to be
был | he was
него | prepositional form of его
до | up to
вас | you accusative
нибудь | indef. suffix preceded by hyphen
опять | again
уж | already, but homonym of `adder'
вам | to you
сказал | he said
ведь | particle `after all'
там | there
потом | then
себя | oneself
ничего | nothing
ей | to her
может | usually with `быть' as `maybe'
они | they
тут | here
где | where
есть | there is/are
надо | got to, must
ней | prepositional form of ей
для | for
мы | we
тебя | thee
их | them, their
чем | than
была | she was
сам | self
чтоб | in order to
без | without
будто | as if
человек | man, person, one
чего | genitive form of `what'
раз | once
тоже | also
себе | to oneself
под | beneath
жизнь | life
будет | will be
ж | short form of intensifer particle `же'
тогда | then
кто | who
этот | this
говорил | was saying
того | genitive form of `that'
потому | for that reason
этого | genitive form of `this'
какой | which
совсем | altogether
ним | prepositional form of `его', `они'
здесь | here
этом | prepositional form of `этот'
один | one
почти | almost
мой | my
тем | instrumental/dative plural of `тот', `то'
чтобы | full form of `in order that'
нее | her (acc.)
кажется | it seems
сейчас | now
были | they were
куда | where to
зачем | why
сказать | to say
всех | all (acc., gen. preposn. plural)
никогда | never
сегодня | today
можно | possible, one can
при | by
наконец | finally
два | two
об | alternative form of `о', about
другой | another
хоть | even
после | after
над | above
больше | more
тот | that one (masc.)
через | across, in
эти | these
нас | us
про | about
всего | in all, only, of all
них | prepositional form of `они' (they)
какая | which, feminine
много | lots
разве | interrogative particle
сказала | she said
три | three
эту | this, acc. fem. sing.
моя | my, feminine
впрочем | moreover, besides
хорошо | good
свою | ones own, acc. fem. sing.
этой | oblique form of `эта', fem. `this'
перед | in front of
иногда | sometimes
лучше | better
чуть | a little
том | preposn. form of `that one'
нельзя | one must not
такой | such a one
им | to them
более | more
всегда | always
конечно | of course
всю | acc. fem. sing of `all'
между | between


| b: some paradigms
|
| personal pronouns
|
| я меня мне мной [мною]
| ты тебя тебе тобой [тобою]
| он его ему им [него, нему, ним]
| она ее эи ею [нее, нэи, нею]
| оно его ему им [него, нему, ним]
|
| мы нас нам нами
| вы вас вам вами
| они их им ими [них, ним, ними]
|
| себя себе собой [собою]
|
| demonstrative pronouns: этот (this), тот (that)
|
| этот эта это эти
| этого эты это эти
| этого этой этого этих
| этому этой этому этим
| этим этой этим [этою] этими
| этом этой этом этих
|
| тот та то те
| того ту то те
| того той того тех
| тому той тому тем
| тем той тем [тою] теми
| том той том тех
|
| determinative pronouns
|
| (a) весь (all)
|
| весь вся все все
| всего всю все все
| всего всей всего всех
| всему всей всему всем
| всем всей всем [всею] всеми
| всем всей всем всех
|
| (b) сам (himself etc)
|
| сам сама само сами
| самого саму само самих
| самого самой самого самих
| самому самой самому самим
| самим самой самим [самою] самими
| самом самой самом самих
|
| stems of verbs `to be', `to have', `to do' and modal
|
| быть бы буд быв есть суть
| име
| дел
| мог мож мочь
| уме
| хоч хот
| долж
| можн
| нужн
| нельзя


+ 133
- 0
solr_config/lang/stopwords_sv.txt View File

@@ -0,0 +1,133 @@
| From svn.tartarus.org/snowball/trunk/website/algorithms/swedish/stop.txt
| This file is distributed under the BSD License.
| See http://snowball.tartarus.org/license.php
| Also see http://www.opensource.org/licenses/bsd-license.html
| - Encoding was converted to UTF-8.
| - This notice was added.
|
| NOTE: To use this file with StopFilterFactory, you must specify format="snowball"

| A Swedish stop word list. Comments begin with vertical bar. Each stop
| word is at the start of a line.

| This is a ranked list (commonest to rarest) of stopwords derived from
| a large text sample.

| Swedish stop words occasionally exhibit homonym clashes. For example
| så = so, but also seed. These are indicated clearly below.

och | and
det | it, this/that
att | to (with infinitive)
i | in, at
en | a
jag | I
hon | she
som | who, that
han | he
på | on
den | it, this/that
med | with
var | where, each
sig | him(self) etc
för | for
så | so (also: seed)
till | to
är | is
men | but
ett | a
om | if; around, about
hade | had
de | they, these/those
av | of
icke | not, no
mig | me
du | you
henne | her
då | then, when
sin | his
nu | now
har | have
inte | inte någon = no one
hans | his
honom | him
skulle | 'sake'
hennes | her
där | there
min | my
man | one (pronoun)
ej | nor
vid | at, by, on (also: vast)
kunde | could
något | some etc
från | from, off
ut | out
när | when
efter | after, behind
upp | up
vi | we
dem | them
vara | be
vad | what
över | over
än | than
dig | you
kan | can
sina | his
här | here
ha | have
mot | towards
alla | all
under | under (also: wonder)
någon | some etc
eller | or (else)
allt | all
mycket | much
sedan | since
ju | why
denna | this/that
själv | myself, yourself etc
detta | this/that
åt | to
utan | without
varit | was
hur | how
ingen | no
mitt | my
ni | you
bli | to be, become
blev | from bli
oss | us
din | thy
dessa | these/those
några | some etc
deras | their
blir | from bli
mina | my
samma | (the) same
vilken | who, that
er | you, your
sådan | such a
vår | our
blivit | from bli
dess | its
inom | within
mellan | between
sådant | such a
varför | why
varje | each
vilka | who, that
ditt | thy
vem | who
vilket | who, that
sitta | his
sådana | such a
vart | each
dina | thy
vars | whose
vårt | our
våra | our
ert | your
era | your
vilkas | whose


+ 119
- 0
solr_config/lang/stopwords_th.txt View File

@@ -0,0 +1,119 @@
# Thai stopwords from:
# "Opinion Detection in Thai Political News Columns
# Based on Subjectivity Analysis"
# Khampol Sukhum, Supot Nitsuwat, and Choochart Haruechaiyasak
ไว้
ไม่
ไป
ได้
ให้
ใน
โดย
แห่ง
แล้ว
และ
แรก
แบบ
แต่
เอง
เห็น
เลย
เริ่ม
เรา
เมื่อ
เพื่อ
เพราะ
เป็นการ
เป็น
เปิดเผย
เปิด
เนื่องจาก
เดียวกัน
เดียว
เช่น
เฉพาะ
เคย
เข้า
เขา
อีก
อาจ
อะไร
ออก
อย่าง
อยู่
อยาก
หาก
หลาย
หลังจาก
หลัง
หรือ
หนึ่ง
ส่วน
ส่ง
สุด
สําหรับ
ว่า
วัน
ลง
ร่วม
ราย
รับ
ระหว่าง
รวม
ยัง
มี
มาก
มา
พร้อม
พบ
ผ่าน
ผล
บาง
น่า
นี้
นํา
นั้น
นัก
นอกจาก
ทุก
ที่สุด
ที่
ทําให้
ทํา
ทาง
ทั้งนี้
ทั้ง
ถ้า
ถูก
ถึง
ต้อง
ต่างๆ
ต่าง
ต่อ
ตาม
ตั้งแต่
ตั้ง
ด้าน
ด้วย
ดัง
ซึ่ง
ช่วง
จึง
จาก
จัด
จะ
คือ
ความ
ครั้ง
คง
ขึ้น
ของ
ขอ
ขณะ
ก่อน
ก็
การ
กับ
กัน
กว่า
กล่าว

+ 212
- 0
solr_config/lang/stopwords_tr.txt View File

@@ -0,0 +1,212 @@
# Turkish stopwords from LUCENE-559
# merged with the list from "Information Retrieval on Turkish Texts"
# (http://www.users.muohio.edu/canf/papers/JASIST2008offPrint.pdf)
acaba
altmış
altı
ama
ancak
arada
aslında
ayrıca
bana
bazı
belki
ben
benden
beni
benim
beri
beş
bile
bin
bir
birçok
biri
birkaç
birkez
birşey
birşeyi
biz
bize
bizden
bizi
bizim
böyle
böylece
bu
buna
bunda
bundan
bunlar
bunları
bunların
bunu
bunun
burada
çok
çünkü
da
daha
dahi
de
defa
değil
diğer
diye
doksan
dokuz
dolayı
dolayısıyla
dört
edecek
eden
ederek
edilecek
ediliyor
edilmesi
ediyor
eğer
elli
en
etmesi
etti
ettiği
ettiğini
gibi
göre
halen
hangi
hatta
hem
henüz
hep
hepsi
her
herhangi
herkesin
hiç
hiçbir
için
iki
ile
ilgili
ise
işte
itibaren
itibariyle
kadar
karşın
katrilyon
kendi
kendilerine
kendini
kendisi
kendisine
kendisini
kez
ki
kim
kimden
kime
kimi
kimse
kırk
milyar
milyon
mu
nasıl
ne
neden
nedenle
nerde
nerede
nereye
niye
niçin
o
olan
olarak
oldu
olduğu
olduğunu
olduklarını
olmadı
olmadığı
olmak
olması
olmayan
olmaz
olsa
olsun
olup
olur
olursa
oluyor
on
ona
ondan
onlar
onlardan
onları
onların
onu
onun
otuz
oysa
öyle
pek
rağmen
sadece
sanki
sekiz
seksen
sen
senden
seni
senin
siz
sizden
sizi
sizin
şey
şeyden
şeyi
şeyler
şöyle
şu
şuna
şunda
şundan
şunları
şunu
tarafından
trilyon
tüm
üç
üzere
var
vardı
ve
veya
ya
yani
yapacak
yapılan
yapılması
yapıyor
yapmak
yaptı
yaptığı
yaptığını
yaptıkları
yedi
yerine
yetmiş
yine
yirmi
yoksa
yüz
zaten

+ 29
- 0
solr_config/lang/userdict_ja.txt View File

@@ -0,0 +1,29 @@
#
# This is a sample user dictionary for Kuromoji (JapaneseTokenizer)
#
# Add entries to this file in order to override the statistical model in terms
# of segmentation, readings and part-of-speech tags. Notice that entries do
# not have weights since they are always used when found. This is by-design
# in order to maximize ease-of-use.
#
# Entries are defined using the following CSV format:
# <text>,<token 1> ... <token n>,<reading 1> ... <reading n>,<part-of-speech tag>
#
# Notice that a single half-width space separates tokens and readings, and
# that the number tokens and readings must match exactly.
#
# Also notice that multiple entries with the same <text> is undefined.
#
# Whitespace only lines are ignored. Comments are not allowed on entry lines.
#

# Custom segmentation for kanji compounds
日本経済新聞,日本 経済 新聞,ニホン ケイザイ シンブン,カスタム名詞
関西国際空港,関西 国際 空港,カンサイ コクサイ クウコウ,カスタム名詞

# Custom segmentation for compound katakana
トートバッグ,トート バッグ,トート バッグ,かずカナ名詞
ショルダーバッグ,ショルダー バッグ,ショルダー バッグ,かずカナ名詞

# Custom reading for former sumo wrestler
朝青龍,朝青龍,アサショウリュウ,カスタム人名

+ 34
- 0
solr_config/params.json View File

@@ -0,0 +1,34 @@
{"params":{
"query":{
"defType":"edismax",
"q.alt":"*:*",
"rows":"10",
"fl":"*,score",
"":{"v":0}},
"facets":{
"facet":"on",
"facet.mincount":"1",
"f.doc_type.facet.mincount":"0",
"facet.field":["text_shingles","{!ex=type}doc_type", "language"],
"f.text_shingles.facet.limit":10,
"facet.query":"{!ex=type key=all_types}*:*",
"f.doc_type.facet.missing":true,
"":{"v":0}},
"browse":{
"type_fq":"{!field f=doc_type v=$type}",
"hl":"on",
"hl.fl":"content",
"v.locale":"${locale}",
"debug":"true",
"hl.simple.pre":"HL_START",
"hl.simple.post":"HL_END",
"echoParams": "explicit",
"_appends_": {
"fq": "{!switch v=$type tag=type case='*:*' case.all='*:*' case.unknown='-doc_type:[* TO *]' default=$type_fq}"
},
"":{"v":0}},
"velocity":{
"wt":"velocity",
"v.template":"browse",
"v.layout":"layout",
"":{"v":0}}}}

+ 21
- 0
solr_config/protwords.txt View File

@@ -0,0 +1,21 @@
# The ASF licenses this file to You under the Apache License, Version 2.0
# (the "License"); you may not use this file except in compliance with
# the License. You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

#-----------------------------------------------------------------------
# Use a protected word file to protect against the stemmer reducing two
# unrelated words to the same base word.

# Some non-words that normally won't be encountered,
# just to test that they won't be stemmed.
dontstems
zwhacky


+ 530
- 0
solr_config/schema.xml View File

@@ -0,0 +1,530 @@
<?xml version="1.0" encoding="UTF-8"?>
<!-- Solr managed schema - automatically generated - DO NOT EDIT -->
<schema name="example-data-driven-schema" version="1.6">
<uniqueKey>id</uniqueKey>
<fieldType name="ancestor_path" class="solr.TextField">
<analyzer type="index">
<tokenizer class="solr.KeywordTokenizerFactory"/>
</analyzer>
<analyzer type="query">
<tokenizer class="solr.PathHierarchyTokenizerFactory" delimiter="/"/>
</analyzer>
</fieldType>
<fieldType name="binary" class="solr.BinaryField"/>
<fieldType name="boolean" class="solr.BoolField" sortMissingLast="true"/>
<fieldType name="booleans" class="solr.BoolField" sortMissingLast="true" multiValued="true"/>
<fieldType name="currency" class="solr.CurrencyFieldType" amountLongSuffix="_l_ns" codeStrSuffix="_s_ns" defaultCurrency="USD" currencyConfig="currency.xml" />
<fieldType name="descendent_path" class="solr.TextField">
<analyzer type="index">
<tokenizer class="solr.PathHierarchyTokenizerFactory" delimiter="/"/>
</analyzer>
<analyzer type="query">
<tokenizer class="solr.KeywordTokenizerFactory"/>
</analyzer>
</fieldType>
<fieldType name="ignored" class="solr.StrField" indexed="false" stored="false" multiValued="true"/>
<fieldType name="location" class="solr.LatLonType" subFieldSuffix="_coordinate"/>
<fieldType name="location_rpt" class="solr.SpatialRecursivePrefixTreeFieldType" geo="true" maxDistErr="0.001" distErrPct="0.025" distanceUnits="kilometers"/>
<fieldType name="lowercase" class="solr.TextField" positionIncrementGap="100">
<analyzer>
<tokenizer class="solr.KeywordTokenizerFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
</analyzer>
</fieldType>
<fieldType name="phonetic_en" class="solr.TextField" indexed="true" stored="false">
<analyzer>
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.DoubleMetaphoneFilterFactory" inject="false"/>
</analyzer>
</fieldType>
<fieldType name="pdate" class="solr.DatePointField" docValues="true"/>
<fieldType name="pdates" class="solr.DatePointField" docValues="true" multiValued="true"/>
<fieldType name="pdouble" class="solr.DoublePointField" docValues="true"/>
<fieldType name="pdoubles" class="solr.DoublePointField" docValues="true" multiValued="true"/>
<fieldType name="pfloat" class="solr.FloatPointField" docValues="true"/>
<fieldType name="pfloats" class="solr.FloatPointField" docValues="true" multiValued="true"/>
<fieldType name="pint" class="solr.IntPointField" docValues="true"/>
<fieldType name="pints" class="solr.IntPointField" docValues="true" multiValued="true"/>
<fieldType name="plong" class="solr.LongPointField" docValues="true"/>
<fieldType name="plongs" class="solr.LongPointField" docValues="true" multiValued="true"/>
<fieldType name="point" class="solr.PointType" subFieldSuffix="_d" dimension="2"/>
<fieldType name="random" class="solr.RandomSortField" indexed="true"/>
<fieldType name="string" class="solr.StrField" sortMissingLast="true"/>
<fieldType name="strings" class="solr.StrField" sortMissingLast="true" multiValued="true"/>
<fieldType name="text_ar" class="solr.TextField" positionIncrementGap="100">
<analyzer>
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.StopFilterFactory" words="lang/stopwords_ar.txt" ignoreCase="true"/>
<filter class="solr.ArabicNormalizationFilterFactory"/>
<filter class="solr.ArabicStemFilterFactory"/>
</analyzer>
</fieldType>
<fieldType name="text_bg" class="solr.TextField" positionIncrementGap="100">
<analyzer>
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.StopFilterFactory" words="lang/stopwords_bg.txt" ignoreCase="true"/>
<filter class="solr.BulgarianStemFilterFactory"/>
</analyzer>
</fieldType>
<fieldType name="text_ca" class="solr.TextField" positionIncrementGap="100">
<analyzer>
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.ElisionFilterFactory" articles="lang/contractions_ca.txt" ignoreCase="true"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.StopFilterFactory" words="lang/stopwords_ca.txt" ignoreCase="true"/>
<filter class="solr.SnowballPorterFilterFactory" language="Catalan"/>
</analyzer>
</fieldType>
<fieldType name="text_cjk" class="solr.TextField" positionIncrementGap="100">
<analyzer>
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.CJKWidthFilterFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.CJKBigramFilterFactory"/>
</analyzer>
</fieldType>
<fieldType name="text_cz" class="solr.TextField" positionIncrementGap="100">
<analyzer>
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.StopFilterFactory" words="lang/stopwords_cz.txt" ignoreCase="true"/>
<filter class="solr.CzechStemFilterFactory"/>
</analyzer>
</fieldType>
<fieldType name="text_da" class="solr.TextField" positionIncrementGap="100">
<analyzer>
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.StopFilterFactory" format="snowball" words="lang/stopwords_da.txt" ignoreCase="true"/>
<filter class="solr.SnowballPorterFilterFactory" language="Danish"/>
</analyzer>
</fieldType>
<fieldType name="text_de" class="solr.TextField" positionIncrementGap="100">
<analyzer>
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.StopFilterFactory" format="snowball" words="lang/stopwords_de.txt" ignoreCase="true"/>
<filter class="solr.GermanNormalizationFilterFactory"/>
<filter class="solr.GermanLightStemFilterFactory"/>
</analyzer>
</fieldType>
<fieldType name="text_el" class="solr.TextField" positionIncrementGap="100">
<analyzer>
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.GreekLowerCaseFilterFactory"/>
<filter class="solr.StopFilterFactory" words="lang/stopwords_el.txt" ignoreCase="false"/>
<filter class="solr.GreekStemFilterFactory"/>
</analyzer>
</fieldType>
<fieldType name="text_en" class="solr.TextField" positionIncrementGap="100">
<analyzer type="index">
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.StopFilterFactory" words="lang/stopwords_en.txt" ignoreCase="true"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.EnglishPossessiveFilterFactory"/>
<filter class="solr.KeywordMarkerFilterFactory" protected="protwords.txt"/>
<filter class="solr.PorterStemFilterFactory"/>
</analyzer>
<analyzer type="query">
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.SynonymGraphFilterFactory" expand="true" ignoreCase="true" synonyms="synonyms.txt"/>
<filter class="solr.StopFilterFactory" words="lang/stopwords_en.txt" ignoreCase="true"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.EnglishPossessiveFilterFactory"/>
<filter class="solr.KeywordMarkerFilterFactory" protected="protwords.txt"/>
<filter class="solr.PorterStemFilterFactory"/>
</analyzer>
</fieldType>
<fieldType name="text_en_splitting" class="solr.TextField" autoGeneratePhraseQueries="true" positionIncrementGap="100">
<analyzer type="index">
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter class="solr.StopFilterFactory" words="lang/stopwords_en.txt" ignoreCase="true"/>
<filter class="solr.WordDelimiterGraphFilterFactory" catenateNumbers="1" generateNumberParts="1" splitOnCaseChange="1" generateWordParts="1" catenateAll="0" catenateWords="1"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.KeywordMarkerFilterFactory" protected="protwords.txt"/>
<filter class="solr.PorterStemFilterFactory"/>
<filter class="solr.FlattenGraphFilterFactory" />
</analyzer>
<analyzer type="query">
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter class="solr.SynonymGraphFilterFactory" expand="true" ignoreCase="true" synonyms="synonyms.txt"/>
<filter class="solr.StopFilterFactory" words="lang/stopwords_en.txt" ignoreCase="true"/>
<filter class="solr.WordDelimiterGraphFilterFactory" catenateNumbers="0" generateNumberParts="1" splitOnCaseChange="1" generateWordParts="1" catenateAll="0" catenateWords="0"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.KeywordMarkerFilterFactory" protected="protwords.txt"/>
<filter class="solr.PorterStemFilterFactory"/>
</analyzer>
</fieldType>
<fieldType name="text_en_splitting_tight" class="solr.TextField" autoGeneratePhraseQueries="true" positionIncrementGap="100">
<analyzer type="index">
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter class="solr.SynonymGraphFilterFactory" expand="false" ignoreCase="true" synonyms="synonyms.txt"/>
<filter class="solr.StopFilterFactory" words="lang/stopwords_en.txt" ignoreCase="true"/>
<filter class="solr.WordDelimiterGraphFilterFactory" catenateNumbers="1" generateNumberParts="0" generateWordParts="0" catenateAll="0" catenateWords="1"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.KeywordMarkerFilterFactory" protected="protwords.txt"/>
<filter class="solr.EnglishMinimalStemFilterFactory"/>
<filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
<filter class="solr.FlattenGraphFilterFactory" />
</analyzer>
<analyzer type="query">
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter class="solr.SynonymGraphFilterFactory" expand="false" ignoreCase="true" synonyms="synonyms.txt"/>
<filter class="solr.StopFilterFactory" words="lang/stopwords_en.txt" ignoreCase="true"/>
<filter class="solr.WordDelimiterGraphFilterFactory" catenateNumbers="1" generateNumberParts="0" generateWordParts="0" catenateAll="0" catenateWords="1"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.KeywordMarkerFilterFactory" protected="protwords.txt"/>
<filter class="solr.EnglishMinimalStemFilterFactory"/>
<filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
</analyzer>
</fieldType>
<fieldType name="text_es" class="solr.TextField" positionIncrementGap="100">
<analyzer>
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.StopFilterFactory" format="snowball" words="lang/stopwords_es.txt" ignoreCase="true"/>
<filter class="solr.SpanishLightStemFilterFactory"/>
</analyzer>
</fieldType>
<fieldType name="text_eu" class="solr.TextField" positionIncrementGap="100">
<analyzer>
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.StopFilterFactory" words="lang/stopwords_eu.txt" ignoreCase="true"/>
<filter class="solr.SnowballPorterFilterFactory" language="Basque"/>
</analyzer>
</fieldType>
<fieldType name="text_fa" class="solr.TextField" positionIncrementGap="100">
<analyzer>
<charFilter class="solr.PersianCharFilterFactory"/>
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.ArabicNormalizationFilterFactory"/>
<filter class="solr.PersianNormalizationFilterFactory"/>
<filter class="solr.StopFilterFactory" words="lang/stopwords_fa.txt" ignoreCase="true"/>
</analyzer>
</fieldType>
<fieldType name="text_fi" class="solr.TextField" positionIncrementGap="100">
<analyzer>
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.StopFilterFactory" format="snowball" words="lang/stopwords_fi.txt" ignoreCase="true"/>
<filter class="solr.SnowballPorterFilterFactory" language="Finnish"/>
</analyzer>
</fieldType>
<fieldType name="text_fr" class="solr.TextField" positionIncrementGap="100">
<analyzer>
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.ElisionFilterFactory" articles="lang/contractions_fr.txt" ignoreCase="true"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.StopFilterFactory" format="snowball" words="lang/stopwords_fr.txt" ignoreCase="true"/>
<filter class="solr.FrenchLightStemFilterFactory"/>
</analyzer>
</fieldType>
<fieldType name="text_ga" class="solr.TextField" positionIncrementGap="100">
<analyzer>
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.ElisionFilterFactory" articles="lang/contractions_ga.txt" ignoreCase="true"/>
<filter class="solr.StopFilterFactory" words="lang/hyphenations_ga.txt" ignoreCase="true"/>
<filter class="solr.IrishLowerCaseFilterFactory"/>
<filter class="solr.StopFilterFactory" words="lang/stopwords_ga.txt" ignoreCase="true"/>
<filter class="solr.SnowballPorterFilterFactory" language="Irish"/>
</analyzer>
</fieldType>
<fieldType name="text_general" class="solr.TextField" positionIncrementGap="100" multiValued="true">
<analyzer type="index">
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.StopFilterFactory" words="stopwords.txt" ignoreCase="true"/>
<filter class="solr.LowerCaseFilterFactory"/>
</analyzer>
<analyzer type="query">
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.StopFilterFactory" words="stopwords.txt" ignoreCase="true"/>
<filter class="solr.SynonymGraphFilterFactory" expand="true" ignoreCase="true" synonyms="synonyms.txt"/>
<filter class="solr.LowerCaseFilterFactory"/>
</analyzer>
</fieldType>
<fieldType name="text_general_rev" class="solr.TextField" positionIncrementGap="100">
<analyzer type="index">
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.StopFilterFactory" words="stopwords.txt" ignoreCase="true"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.ReversedWildcardFilterFactory" maxPosQuestion="2" maxFractionAsterisk="0.33" maxPosAsterisk="3" withOriginal="true"/>
</analyzer>
<analyzer type="query">
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.SynonymGraphFilterFactory" expand="true" ignoreCase="true" synonyms="synonyms.txt"/>
<filter class="solr.StopFilterFactory" words="stopwords.txt" ignoreCase="true"/>
<filter class="solr.LowerCaseFilterFactory"/>
</analyzer>
</fieldType>
<fieldType name="text_gl" class="solr.TextField" positionIncrementGap="100">
<analyzer>
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.StopFilterFactory" words="lang/stopwords_gl.txt" ignoreCase="true"/>
<filter class="solr.GalicianStemFilterFactory"/>
</analyzer>
</fieldType>
<fieldType name="text_hi" class="solr.TextField" positionIncrementGap="100">
<analyzer>
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.IndicNormalizationFilterFactory"/>
<filter class="solr.HindiNormalizationFilterFactory"/>
<filter class="solr.StopFilterFactory" words="lang/stopwords_hi.txt" ignoreCase="true"/>
<filter class="solr.HindiStemFilterFactory"/>
</analyzer>
</fieldType>
<fieldType name="text_hu" class="solr.TextField" positionIncrementGap="100">
<analyzer>
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.StopFilterFactory" format="snowball" words="lang/stopwords_hu.txt" ignoreCase="true"/>
<filter class="solr.SnowballPorterFilterFactory" language="Hungarian"/>
</analyzer>
</fieldType>
<fieldType name="text_hy" class="solr.TextField" positionIncrementGap="100">
<analyzer>
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.StopFilterFactory" words="lang/stopwords_hy.txt" ignoreCase="true"/>
<filter class="solr.SnowballPorterFilterFactory" language="Armenian"/>
</analyzer>
</fieldType>
<fieldType name="text_id" class="solr.TextField" positionIncrementGap="100">
<analyzer>
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.StopFilterFactory" words="lang/stopwords_id.txt" ignoreCase="true"/>
<filter class="solr.IndonesianStemFilterFactory" stemDerivational="true"/>
</analyzer>
</fieldType>
<fieldType name="text_it" class="solr.TextField" positionIncrementGap="100">
<analyzer>
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.ElisionFilterFactory" articles="lang/contractions_it.txt" ignoreCase="true"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.StopFilterFactory" format="snowball" words="lang/stopwords_it.txt" ignoreCase="true"/>
<filter class="solr.ItalianLightStemFilterFactory"/>
</analyzer>
</fieldType>
<fieldType name="text_ja" class="solr.TextField" autoGeneratePhraseQueries="false" positionIncrementGap="100">
<analyzer>
<tokenizer class="solr.JapaneseTokenizerFactory" mode="search"/>
<filter class="solr.JapaneseBaseFormFilterFactory"/>
<filter class="solr.JapanesePartOfSpeechStopFilterFactory" tags="lang/stoptags_ja.txt"/>
<filter class="solr.CJKWidthFilterFactory"/>
<filter class="solr.StopFilterFactory" words="lang/stopwords_ja.txt" ignoreCase="true"/>
<filter class="solr.JapaneseKatakanaStemFilterFactory" minimumLength="4"/>
<filter class="solr.LowerCaseFilterFactory"/>
</analyzer>
</fieldType>
<fieldType name="text_ko" class="solr.TextField" positionIncrementGap="100">
<analyzer>
<tokenizer class="solr.KoreanTokenizerFactory" decompoundMode="discard" outputUnknownUnigrams="false"/>
<filter class="solr.KoreanPartOfSpeechStopFilterFactory" />
<filter class="solr.KoreanReadingFormFilterFactory" />
<filter class="solr.LowerCaseFilterFactory" />
</analyzer>
</fieldType>
<fieldType name="text_lv" class="solr.TextField" positionIncrementGap="100">
<analyzer>
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.StopFilterFactory" words="lang/stopwords_lv.txt" ignoreCase="true"/>
<filter class="solr.LatvianStemFilterFactory"/>
</analyzer>
</fieldType>
<fieldType name="text_nl" class="solr.TextField" positionIncrementGap="100">
<analyzer>
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.StopFilterFactory" format="snowball" words="lang/stopwords_nl.txt" ignoreCase="true"/>
<filter class="solr.StemmerOverrideFilterFactory" dictionary="lang/stemdict_nl.txt" ignoreCase="false"/>
<filter class="solr.SnowballPorterFilterFactory" language="Dutch"/>
</analyzer>
</fieldType>
<fieldType name="text_no" class="solr.TextField" positionIncrementGap="100">
<analyzer>
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.StopFilterFactory" format="snowball" words="lang/stopwords_no.txt" ignoreCase="true"/>
<filter class="solr.SnowballPorterFilterFactory" language="Norwegian"/>
</analyzer>
</fieldType>
<fieldType name="text_pt" class="solr.TextField" positionIncrementGap="100">
<analyzer>
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.StopFilterFactory" format="snowball" words="lang/stopwords_pt.txt" ignoreCase="true"/>
<filter class="solr.PortugueseLightStemFilterFactory"/>
</analyzer>
</fieldType>
<fieldType name="text_ro" class="solr.TextField" positionIncrementGap="100">
<analyzer>
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.StopFilterFactory" words="lang/stopwords_ro.txt" ignoreCase="true"/>
<filter class="solr.SnowballPorterFilterFactory" language="Romanian"/>
</analyzer>
</fieldType>
<fieldType name="text_ru" class="solr.TextField" positionIncrementGap="100">
<analyzer>
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.StopFilterFactory" format="snowball" words="lang/stopwords_ru.txt" ignoreCase="true"/>
<filter class="solr.SnowballPorterFilterFactory" language="Russian"/>
</analyzer>
</fieldType>
<fieldType name="text_sv" class="solr.TextField" positionIncrementGap="100">
<analyzer>
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.StopFilterFactory" format="snowball" words="lang/stopwords_sv.txt" ignoreCase="true"/>
<filter class="solr.SnowballPorterFilterFactory" language="Swedish"/>
</analyzer>
</fieldType>
<fieldType name="text_th" class="solr.TextField" positionIncrementGap="100">
<analyzer>
<tokenizer class="solr.ThaiTokenizerFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.StopFilterFactory" words="lang/stopwords_th.txt" ignoreCase="true"/>
</analyzer>
</fieldType>
<fieldType name="text_tr" class="solr.TextField" positionIncrementGap="100">
<analyzer>
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.TurkishLowerCaseFilterFactory"/>
<filter class="solr.StopFilterFactory" words="lang/stopwords_tr.txt" ignoreCase="false"/>
<filter class="solr.SnowballPorterFilterFactory" language="Turkish"/>
</analyzer>
</fieldType>
<fieldType name="text_ws" class="solr.TextField" positionIncrementGap="100">
<analyzer>
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
</analyzer>
</fieldType>

<fieldType name="text_email_url" class="solr.TextField">
<analyzer>
<tokenizer class="solr.UAX29URLEmailTokenizerFactory"/>
<filter class="solr.TypeTokenFilterFactory" types="email_url_types.txt" useWhitelist="true"/>
</analyzer>
</fieldType>

<fieldType name="text_shingles" class="solr.TextField" positionIncrementGap="100" multiValued="true">
<analyzer type="index">
<tokenizer class="solr.StandardTokenizerFactory"/>
<!-- <filter class="solr.StopFilterFactory" words="lang/stopwords_en.txt" ignoreCase="false" /> -->
<filter class="solr.LengthFilterFactory" min="2" max="18"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.PatternReplaceFilterFactory" pattern="(^[^a-z]+$)" replacement="" replace="all"/>
<filter class="solr.ShingleFilterFactory" minShingleSize="3" maxShingleSize="3"
outputUnigrams="false" outputUnigramsIfNoShingles="false" tokenSeparator=" " fillerToken="*"/>
<filter class="solr.PatternReplaceFilterFactory" pattern="(.*[\*].*)" replacement=""/>
<filter class="solr.TrimFilterFactory"/>

<!-- PRFF could have removed everything down to an empty string, remove if so -->
<filter class="solr.LengthFilterFactory" min="1" max="100"/>
</analyzer>
<analyzer type="query">
<tokenizer class="solr.KeywordTokenizerFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
</analyzer>
</fieldType>

<field name="id" type="string" multiValued="false" indexed="true" required="true" stored="true"/>
<field name="_version_" type="plong" indexed="true" stored="true"/>
<field name="content_type" type="string" indexed="true" stored="true"/>
<field name="doc_type" type="string" indexed="true" stored="true"/>
<field name="title" type="string" indexed="true" stored="true"/>
<field name="language" type="string" indexed="true" stored="true"/>
<field name="content" type="text_general" multiValued="false" indexed="true" stored="true"/>
<field name="text_shingles" type="text_shingles" indexed="true" stored="false"/>
<field name="_text_" type="text_general" multiValued="true" indexed="true" stored="false"/>

<dynamicField name="*_txt_en_split_tight" type="text_en_splitting_tight" indexed="true" stored="true"/>
<dynamicField name="*_descendent_path" type="descendent_path" indexed="true" stored="true"/>
<dynamicField name="*_ancestor_path" type="ancestor_path" indexed="true" stored="true"/>
<dynamicField name="*_txt_en_split" type="text_en_splitting" indexed="true" stored="true"/>
<dynamicField name="*_coordinate" type="pdouble" indexed="true" stored="false"/>
<dynamicField name="ignored_*" type="ignored" multiValued="true"/>
<dynamicField name="*_txt_rev" type="text_general_rev" indexed="true" stored="true"/>
<dynamicField name="*_phon_en" type="phonetic_en" indexed="true" stored="true"/>
<dynamicField name="*_s_lower" type="lowercase" indexed="true" stored="true"/>
<dynamicField name="*_txt_cjk" type="text_cjk" indexed="true" stored="true"/>
<dynamicField name="random_*" type="random"/>
<dynamicField name="*_txt_en" type="text_en" indexed="true" stored="true"/>
<dynamicField name="*_txt_ar" type="text_ar" indexed="true" stored="true"/>
<dynamicField name="*_txt_bg" type="text_bg" indexed="true" stored="true"/>
<dynamicField name="*_txt_ca" type="text_ca" indexed="true" stored="true"/>
<dynamicField name="*_txt_cz" type="text_cz" indexed="true" stored="true"/>
<dynamicField name="*_txt_da" type="text_da" indexed="true" stored="true"/>
<dynamicField name="*_txt_de" type="text_de" indexed="true" stored="true"/>
<dynamicField name="*_txt_el" type="text_el" indexed="true" stored="true"/>
<dynamicField name="*_txt_es" type="text_es" indexed="true" stored="true"/>
<dynamicField name="*_txt_eu" type="text_eu" indexed="true" stored="true"/>
<dynamicField name="*_txt_fa" type="text_fa" indexed="true" stored="true"/>
<dynamicField name="*_txt_fi" type="text_fi" indexed="true" stored="true"/>
<dynamicField name="*_txt_fr" type="text_fr" indexed="true" stored="true"/>
<dynamicField name="*_txt_ga" type="text_ga" indexed="true" stored="true"/>
<dynamicField name="*_txt_gl" type="text_gl" indexed="true" stored="true"/>
<dynamicField name="*_txt_hi" type="text_hi" indexed="true" stored="true"/>
<dynamicField name="*_txt_hu" type="text_hu" indexed="true" stored="true"/>
<dynamicField name="*_txt_hy" type="text_hy" indexed="true" stored="true"/>
<dynamicField name="*_txt_id" type="text_id" indexed="true" stored="true"/>
<dynamicField name="*_txt_it" type="text_it" indexed="true" stored="true"/>
<dynamicField name="*_txt_ja" type="text_ja" indexed="true" stored="true"/>
<dynamicField name="*_txt_ko" type="text_ko" indexed="true" stored="true"/>
<dynamicField name="*_txt_lv" type="text_lv" indexed="true" stored="true"/>
<dynamicField name="*_txt_nl" type="text_nl" indexed="true" stored="true"/>
<dynamicField name="*_txt_no" type="text_no" indexed="true" stored="true"/>
<dynamicField name="*_txt_pt" type="text_pt" indexed="true" stored="true"/>
<dynamicField name="*_txt_ro" type="text_ro" indexed="true" stored="true"/>
<dynamicField name="*_txt_ru" type="text_ru" indexed="true" stored="true"/>
<dynamicField name="*_txt_sv" type="text_sv" indexed="true" stored="true"/>
<dynamicField name="*_txt_th" type="text_th" indexed="true" stored="true"/>
<dynamicField name="*_txt_tr" type="text_tr" indexed="true" stored="true"/>
<dynamicField name="*_point" type="point" indexed="true" stored="true"/>
<dynamicField name="*_srpt" type="location_rpt" indexed="true" stored="true"/>
<dynamicField name="attr_*" type="text_general" multiValued="true" indexed="true" stored="true"/>
<dynamicField name="*_l_ns" type="plong" indexed="true" stored="false"/>
<dynamicField name="*_s_ns" type="string" indexed="true" stored="false"/>
<dynamicField name="*_txt" type="text_general" indexed="true" stored="true"/>
<dynamicField name="*_dts" type="pdate" multiValued="true" indexed="true" stored="true"/>
<dynamicField name="*_is" type="pints" indexed="true" stored="true"/>
<dynamicField name="*_ss" type="strings" indexed="true" stored="true"/>
<dynamicField name="*_ls" type="plongs" indexed="true" stored="true"/>
<dynamicField name="*_bs" type="booleans" indexed="true" stored="true"/>
<dynamicField name="*_fs" type="pfloats" indexed="true" stored="true"/>
<dynamicField name="*_ds" type="pdoubles" indexed="true" stored="true"/>
<dynamicField name="*_dt" type="pdate" indexed="true" stored="true"/>
<dynamicField name="*_ws" type="text_ws" indexed="true" stored="true"/>
<dynamicField name="*_i" type="pint" indexed="true" stored="true"/>
<dynamicField name="*_s" type="string" indexed="true" stored="true"/>
<dynamicField name="*_l" type="plong" indexed="true" stored="true"/>
<dynamicField name="*_t" type="text_general" indexed="true" stored="true"/>
<dynamicField name="*_b" type="boolean" indexed="true" stored="true"/>
<dynamicField name="*_f" type="pfloat" indexed="true" stored="true"/>
<dynamicField name="*_d" type="pdouble" indexed="true" stored="true"/>
<dynamicField name="*_p" type="location" indexed="true" stored="true"/>
<dynamicField name="*_c" type="currency" indexed="true" stored="true"/>

<copyField source="content" dest="text_shingles"/>
<copyField source="*" dest="_text_"/>

<!-- ADDED BY SIMON BOWIE 2022-04-04 -->
<copyField source="content" dest="year"/>
<field name="year" type="year" indexed="true" stored="true"/>

<fieldType name="year" class="solr.TextField" positionIncrementGap="100">
<analyzer>
<tokenizer class="solr.PatternTokenizerFactory" pattern="=D[^\s]*\s[^\s]*\s[^\s]*\s[^\s]*\s(\d{4})" group="1" />
</analyzer>
</fieldType>
<!-- END -->

</schema>

+ 1368
- 0
solr_config/solrconfig.xml
File diff suppressed because it is too large
View File


+ 14
- 0
solr_config/stopwords.txt View File

@@ -0,0 +1,14 @@
# Licensed to the Apache Software Foundation (ASF) under one or more
# contributor license agreements. See the NOTICE file distributed with
# this work for additional information regarding copyright ownership.
# The ASF licenses this file to You under the Apache License, Version 2.0
# (the "License"); you may not use this file except in compliance with
# the License. You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

+ 29
- 0
solr_config/synonyms.txt View File

@@ -0,0 +1,29 @@
# The ASF licenses this file to You under the Apache License, Version 2.0
# (the "License"); you may not use this file except in compliance with
# the License. You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

#-----------------------------------------------------------------------
#some test synonym mappings unlikely to appear in real input text
aaafoo => aaabar
bbbfoo => bbbfoo bbbbar
cccfoo => cccbar cccbaz
fooaaa,baraaa,bazaaa

# Some synonym groups specific to this example
GB,gib,gigabyte,gigabytes
MB,mib,megabyte,megabytes
Television, Televisions, TV, TVs
#notice we use "gib" instead of "GiB" so any WordDelimiterGraphFilter coming
#after us won't split it into two words.

# Synonym mappings can be used for spelling correction too
pixima => pixma


+ 115
- 0
solr_config/update-script.js View File

@@ -0,0 +1,115 @@
function get_class(name) {
var clazz;
try {
// Java8 Nashorn
clazz = eval("Java.type(name).class");
} catch(e) {
// Java7 Rhino
clazz = eval("Packages."+name);
}

return clazz;
}

function processAdd(cmd) {

doc = cmd.solrDoc; // org.apache.solr.common.SolrInputDocument
var id = doc.getFieldValue("id");
logger.info("update-script#processAdd: id=" + id);

// The idea here is to use the file's content_type value to
// simplify into user-friendly values, such that types of, say, image/jpeg and image/tiff
// are in an "Images" facet

var ct = doc.getFieldValue("content_type");
if (ct) {
// strip off semicolon onward
var semicolon_index = ct.indexOf(';');
if (semicolon_index != -1) {
ct = ct.substring(0,semicolon_index);
}
// and split type/subtype
var ct_type = ct.substring(0,ct.indexOf('/'));
var ct_subtype = ct.substring(ct.indexOf('/')+1);

var doc_type;
switch(true) {
case /^application\/rtf/.test(ct) || /wordprocessing/.test(ct):
doc_type = "doc";
break;

case /html/.test(ct):
doc_type = "html";
break;

case /^image\/.*/.test(ct):
doc_type = "image";
break;

case /presentation|powerpoint/.test(ct):
doc_type = "presentation";
break;

case /spreadsheet|excel/.test(ct):
doc_type = "spreadsheet";
break;

case /^application\/pdf/.test(ct):
doc_type = "pdf";
break;

case /^text\/plain/.test(ct):
doc_type = "text"
break;

default:
break;
}

// TODO: error handling needed? What if there is no slash?
if(doc_type) { doc.setField("doc_type", doc_type); }
doc.setField("content_type_type_s", ct_type);
doc.setField("content_type_subtype_s", ct_subtype);
}

var content = doc.getFieldValue("content");
if (!content) {
return; //No content found, so we are done here
}

var analyzer =
req.getCore().getLatestSchema()
.getFieldTypeByName("text_email_url")
.getIndexAnalyzer();

var token_stream =
analyzer.tokenStream("content", content);
var term_att = token_stream.getAttribute(get_class("org.apache.lucene.analysis.tokenattributes.CharTermAttribute"));
var type_att = token_stream.getAttribute(get_class("org.apache.lucene.analysis.tokenattributes.TypeAttribute"));
token_stream.reset();
while (token_stream.incrementToken()) {
doc.addField(type_att.type().replace(/\<|\>/g,'').toLowerCase()+"_ss", term_att.toString());
}
token_stream.end();
token_stream.close();
}

function processDelete(cmd) {
// no-op
}

function processMergeIndexes(cmd) {
// no-op
}

function processCommit(cmd) {
// no-op
}

function processRollback(cmd) {
// no-op
}

function finish() {
// no-op
}

+ 32
- 0
solr_config/velocity/browse.vm View File

@@ -0,0 +1,32 @@
<div id="query-box">
<form id="query-form" action="#{url_for_home}" method="GET">
$resource.find:
<input type="text" id="q" name="q" style="width: 50%" value="$!esc.html($request.params.get('q'))"/>
<input type="submit" value="$resource.submit"/>
<div id="debug_query" class="debug">
<span id="parsed_query">$esc.html($response.response.debug.parsedquery)</span>
</div>

<input type="hidden" name="type" value="#current_type"/>
#if("#current_locale"!="")<input type="hidden" value="locale" value="#current_locale"/>#end
#foreach($fq in $response.responseHeader.params.getAll("fq"))
<input type="hidden" name="fq" id="allFQs" value="$esc.html($fq)"/>
#end
</form>

<div id="constraints">
#foreach($fq in $response.responseHeader.params.getAll("fq"))
#set($previous_fq_count=$velocityCount - 1)
#if($fq != '')
&gt; $fq<a href="#url_for_filters($response.responseHeader.params.fq.subList(0,$previous_fq_count))">x</a>
#end
#end
</div>

</div>


<div id="browse_results">
#parse("results.vm")
</div>


+ 0
- 0
solr_config/velocity/dropit.js View File


+ 2
- 0
solr_config/velocity/facet_doc_type.vm View File

@@ -0,0 +1,2 @@
## intentionally empty


+ 12
- 0
solr_config/velocity/facet_text_shingles.vm View File

@@ -0,0 +1,12 @@
<div id="facet_$field.name">
<span class="facet-field">$resource.facet.top_phrases</span><br/>

<ul id="tagcloud">
#foreach($facet in $sort.sort($field.values,"name"))
<li data-weight="$math.mul($facet.count,1)">
<a href="#url_for_facet_filter($field.name, $facet.name)">$facet.name</a>
</li>

#end
</ul>
</div>

+ 24
- 0
solr_config/velocity/facets.vm View File

@@ -0,0 +1,24 @@
#if($response.facetFields.size() > 0)
#foreach($field in $response.facetFields)
#if($field.values.size() > 0)
#if($engine.resourceExists("facet_${field.name}.vm"))
#parse("facet_${field.name}.vm")
#else
<div id="facet_$field.name" class="facet_field">
<span class="facet-field">#label("facet.${field.name}",$field.name)</span><br/>

<ul>
#foreach($facet in $field.values)
<li><a href="#url_for_facet_filter($field.name, $facet.name)">#if($facet.name!=$null)#label("${field.name}.${facet.name}","${field.name}.${facet.name}")#else<em>missing</em>#end</a> ($facet.count)</li>
#end
</ul>
</div>
#end
#end
#end ## end if field.values > 0
#end ## end if facetFields > 0






+ 29
- 0
solr_config/velocity/footer.vm View File

@@ -0,0 +1,29 @@
<hr/>

<div>

<div id="admin"><a href="#url_root/index.html#/#{core_name}">Solr Admin</a></div>

<a href="#" onclick='jQuery(".debug").toggle(); return false;'>toggle debug mode</a>
<a href="#url_for_lens&wt=xml#if($debug)&debug=true#end">XML results</a> ## TODO: Add links for other formats, maybe dynamically?

</div>

<div>
<a href="http://lucene.apache.org/solr">Solr Home Page</a>
</div>


<div class="debug">
<hr/>
Request:
<pre>
$esc.html($request)
</pre>

<hr/>
Debug:
<pre>
$esc.html($response.response.debug)
</pre>
</div>

+ 290
- 0
solr_config/velocity/head.vm View File

@@ -0,0 +1,290 @@
<title>Solr browse: #core_name</title>

<meta http-equiv="content-type" content="text/html; charset=UTF-8"/>

<link rel="icon" type="image/x-icon" href="#{url_root}/img/favicon.ico"/>
<link rel="shortcut icon" type="image/x-icon" href="#{url_root}/img/favicon.ico"/>

<script type="text/javascript" src="#{url_root}/libs/jquery-3.4.1.min.js"></script>
<script type="text/javascript" src="#{url_for_solr}/admin/file?file=/velocity/js/jquery.tx3-tag-cloud.js&contentType=text/javascript"></script>
<script type="text/javascript" src="#{url_for_solr}/admin/file?file=/velocity/js/dropit.js&contentType=text/javascript"></script>
<script type="text/javascript" src="#{url_for_solr}/admin/file?file=/velocity/js/jquery.autocomplete.js&contentType=text/javascript"></script>

<script type="text/javascript">
$(document).ready(function() {

$("#tagcloud").tx3TagCloud({
multiplier: 1
});

$('.menu').dropit();

$( document ).ajaxComplete(function() {
$("#tagcloud").tx3TagCloud({
multiplier: 5
});
});

$('\#q').keyup(function() {
$('#browse_results').load('#{url_for_home}?#lensNoQ&v.layout.enabled=false&v.template=results&q='+encodeURI($('\#q').val()));

$("\#q").autocomplete('#{url_for_solr}/suggest', {
extraParams: {
'suggest.q': function() { return $("\#q").val();},
'suggest.build': 'true',
'wt': 'json',
}
}).keydown(function(e) {
if (e.keyCode === 13){
$("#query-form").trigger('submit');
}
});
});

});
</script>

<style>

html {
background-color: #F0F8FF;
}

body {
font-family: Helvetica, Arial, sans-serif;
font-size: 10pt;
}

#header {
width: 100%;
font-size: 20pt;
}

#header2 {
margin-left:1200px;
}

#logo {
width: 115px;
margin: 0px 0px 0px 0px;
border-style: none;
}

a {
color: #305CB3;
}

a.hidden {
display:none;
}

em {
color: #FF833D;
}

.error {
color: white;
background-color: red;
left: 210px;
width:80%;
position: relative;
}

.debug { display: none; font-size: 10pt}
#debug_query {
font-family: Helvetica, Arial, sans-serif;
font-size: 10pt;
font-weight: bold;
}
#parsed_query {
font-family: Courier, Courier New, monospaced;
font-size: 10pt;
font-weight: normal;
}

#admin {
text-align: right;
vertical-align: top;
}

#query-form {
width: 90%;
}

#query-box {
padding: 5px;
margin: 5px;
font-weight: normal;
font-size: 24px;
letter-spacing: 0.08em;
}
#constraints {
margin: 10px;
}

#tabs { }
#tabs li { display: inline; font-size: 10px;}
#tabs li a { border-radius: 20px; border: 2px solid #C1CDCD; padding: 10px;color: #42454a; background-color: #dedbde;}
#tabs li a:hover { background-color: #f1f0ee; }
#tabs li a.selected { color: #000; background-color: #f1f0ee; font-weight: bold; padding: 5px }
#tabs li a.no_results { color: #000; background-color: #838B8B; font-style: italic; padding: 5px; pointer-events: none;
cursor: default; text-decoration: none;}

.pagination {
width: 305px;
border-radius: 25px;
border: 2px solid #C1CDCD;
padding: 20px;
padding-left: 10%;
background: #eee;
margin-left: 190px;
margin-top : 42px;
padding-top: 5px;
padding-bottom: 5px;
text-align:left;
}

#results_list { width: 70%; }
.result-document {
border-radius: 25px;
border: 2px solid #C1CDCD;
padding: 10px;
// width: 800px;
// height: 120px;
margin: 5px;
// margin-left: 60px;
// margin-right: 210px;
// margin-bottom: 15px;
transition: 1s ease;
}
.result-document:hover
{
webkit-transform: scale(1.1);
-ms-transform: scale(1.1);
transform: scale(1.1);
transition: 1s ease;
}
.result-document div {
padding: 5px;
}
.result-title {
width:60%;
}
.result-body {
background: #ddd;
}
.result-document:nth-child(2n+1) {
background-color: #FFFFFD;
}

#facets {
margin: 5px;
margin-top: 0px;
padding: 5px;
top: -20px;
position: relative;
float: right;
width: 25%;
}
.facet-field {
font-weight: bold;
}
#facets ul {
list-style: none;
margin: 0;
margin-bottom: 5px;
margin-top: 5px;
padding-left: 10px;
}
#facets ul li {
color: #999;
padding: 2px;
}

div.facet_field {
clear: left;
}

ul.tx3-tag-cloud { }
ul.tx3-tag-cloud li {
display: block;
float: left;
list-style: none;
margin-right: 4px;
}
ul.tx3-tag-cloud li a {
display: block;
text-decoration: none;
color: #c9c9c9;
padding: 3px 10px;
}
ul.tx3-tag-cloud li a:hover {
color: #000000;
-webkit-transition: color 250ms linear;
-moz-transition: color 250ms linear;
-o-transition: color 250ms linear;
-ms-transition: color 250ms linear;
transition: color 250ms linear;
}

.dropit {
list-style: none;
padding: 0;
margin: 0;
}
.dropit .dropit-trigger { position: relative; }
.dropit .dropit-submenu {
position: absolute;
top: 100%;
left: 0; /* dropdown left or right */
z-index: 1000;
display: none;
min-width: 150px;
list-style: none;
padding: 0;
margin: 0;
}
.dropit .dropit-open .dropit-submenu { display: block; }


<!--autocomplete css-->
.ac_results {
padding: 0px;
border: 1px solid black;
background-color: white;
overflow: hidden;
z-index: 99999;
}

.ac_results ul {
width: 100%;
list-style-position: outside;
list-style: none;
padding: 0;
margin: 0;
}

.ac_results li {
margin: 0px;
padding: 2px 5px;
cursor: default;
display: block;
font: menu;
font-size: 12px;
line-height: 16px;
overflow: hidden;
}

.ac_loading {
// background: white url('˜indicator.gif') right center no-repeat;
}

.ac_odd {
background-color: #eee;
}

.ac_over {
background-color: #0A246A;
color: white;
}
</style>

+ 77
- 0
solr_config/velocity/hit.vm View File

@@ -0,0 +1,77 @@

#set($docId = $doc.getFirstValue($request.schema.uniqueKeyField.name))

## Load Mime-Type List and Mapping
#parse('mime_type_lists.vm')

## Title
#if($doc.getFieldValue('title'))
#set($title = $esc.html($doc.getFirstValue('title')))
#else
#set($title = "$doc.getFirstValue('id').substring($math.add(1,$doc.getFirstValue('id').lastIndexOf('/')))")
#end

## Date
#if($doc.getFieldValue('attr_meta_creation_date'))
#set($date = $esc.html($doc.getFirstValue('attr_meta_creation_date')))
#else
#set($date = "No date found")
#end



## URL
#if($doc.getFieldValue('url'))
#set($url = $doc.getFieldValue('url'))
#elseif($doc.getFieldValue('resourcename'))
#set($url = "file:///$doc.getFirstValue('resourcename')")
#else
#set($url = "$doc.getFieldValue('id')")
#end

## Sort out Mime-Type
#set($ct = $doc.getFirstValue('content_type').split(";").get(0))
#set($filename = $doc.getFirstValue('resourcename'))
#set($filetype = false)
#set($filetype = $mimeExtensionsMap.get($ct))
#if(!$filetype)
#set($filetype = $filename.substring($filename.lastIndexOf(".")).substring(1))
#end
#if(!$filetype)
#set($filetype = "file")
#end
#if(!$supportedMimeTypes.contains($filetype))
#set($filetype = "file")
#end

<div class="result-document">
<span class="result-title">
<img src="#{url_root}/img/filetypes/${filetype}.png" align="center">
<b>$title</b>
</span>

<div>
id: $docId </br>
</div>

#set($pad = "")
#foreach($v in $response.response.highlighting.get($docId).get("content"))
$pad$esc.html($v).replace("HL_START","<em>").replace("HL_END","</em>")
#set($pad = " ... ")
#end

</div>

<a href="#" class="debug" onclick='jQuery(this).next().toggle(); return false;'>toggle explain</a>
<pre style="display: none;">
$esc.html($response.getExplainMap().get($doc.getFirstValue('id')))
</pre>

<a href="#" class="debug" onclick='jQuery(this).next().toggle(); return false;'>show all fields</a>
<pre style="display:none;">
#foreach($fieldname in $doc.fieldNames)
<span>$fieldname :</span>
<span>#foreach($value in $doc.getFieldValues($fieldname))$esc.html($value)#end</span>
#end
</pre>


BIN
solr_config/velocity/img/english_640.png View File

Before After
Width: 640  |  Height: 480  |  Size: 135KB

BIN
solr_config/velocity/img/france_640.png View File

Before After
Width: 640  |  Height: 480  |  Size: 98KB

BIN
solr_config/velocity/img/germany_640.png View File

Before After
Width: 640  |  Height: 480  |  Size: 103KB

BIN
solr_config/velocity/img/globe_256.png View File

Before After
Width: 256  |  Height: 256  |  Size: 46KB

+ 0
- 0
solr_config/velocity/jquery.tx3-tag-cloud.js View File


+ 97
- 0
solr_config/velocity/js/dropit.js View File

@@ -0,0 +1,97 @@
/*
* Dropit v1.1.0
* http://dev7studios.com/dropit
*
* Copyright 2012, Dev7studios
* Free to use and abuse under the MIT license.
* http://www.opensource.org/licenses/mit-license.php
*/

;(function($) {

$.fn.dropit = function(method) {

var methods = {

init : function(options) {
this.dropit.settings = $.extend({}, this.dropit.defaults, options);
return this.each(function() {
var $el = $(this),
el = this,
settings = $.fn.dropit.settings;

// Hide initial submenus
$el.addClass('dropit')
.find('>'+ settings.triggerParentEl +':has('+ settings.submenuEl +')').addClass('dropit-trigger')
.find(settings.submenuEl).addClass('dropit-submenu').hide();

// Open on click
$el.off(settings.action).on(settings.action, settings.triggerParentEl +':has('+ settings.submenuEl +') > '+ settings.triggerEl +'', function(){
// Close click menu's if clicked again
if(settings.action == 'click' && $(this).parents(settings.triggerParentEl).hasClass('dropit-open')){
settings.beforeHide.call(this);
$(this).parents(settings.triggerParentEl).removeClass('dropit-open').find(settings.submenuEl).hide();
settings.afterHide.call(this);
return false;
}

// Hide open menus
settings.beforeHide.call(this);
$('.dropit-open').removeClass('dropit-open').find('.dropit-submenu').hide();
settings.afterHide.call(this);

// Open this menu
settings.beforeShow.call(this);
$(this).parents(settings.triggerParentEl).addClass('dropit-open').find(settings.submenuEl).show();
settings.afterShow.call(this);

return false;
});

// Close if outside click
$(document).on('click', function(){
settings.beforeHide.call(this);
$('.dropit-open').removeClass('dropit-open').find('.dropit-submenu').hide();
settings.afterHide.call(this);
});

// If hover
if(settings.action == 'mouseenter'){
$el.on('mouseleave', '.dropit-open', function(){
settings.beforeHide.call(this);
$(this).removeClass('dropit-open').find(settings.submenuEl).hide();
settings.afterHide.call(this);
});
}

settings.afterLoad.call(this);
});
}

};

if (methods[method]) {
return methods[method].apply(this, Array.prototype.slice.call(arguments, 1));
} else if (typeof method === 'object' || !method) {
return methods.init.apply(this, arguments);
} else {
$.error( 'Method "' + method + '" does not exist in dropit plugin!');
}

};

$.fn.dropit.defaults = {
action: 'mouseenter', // The open action for the trigger
submenuEl: 'ul', // The submenu element
triggerEl: 'a', // The trigger element
triggerParentEl: 'li', // The trigger parent element
afterLoad: function(){}, // Triggers when plugin has loaded
beforeShow: function(){}, // Triggers before submenu is shown
afterShow: function(){}, // Triggers after submenu is shown
beforeHide: function(){}, // Triggers before submenu is hidden
afterHide: function(){} // Triggers before submenu is hidden
};

$.fn.dropit.settings = {};

})(jQuery);

+ 763
- 0
solr_config/velocity/js/jquery.autocomplete.js View File

@@ -0,0 +1,763 @@
/*
* Autocomplete - jQuery plugin 1.1pre
*
* Copyright (c) 2007 Dylan Verheul, Dan G. Switzer, Anjesh Tuladhar, Jörn Zaefferer
*
* Dual licensed under the MIT and GPL licenses:
* http://www.opensource.org/licenses/mit-license.php
* http://www.gnu.org/licenses/gpl.html
*
* Revision: Id: jquery.autocomplete.js 5785 2008-07-12 10:37:33Z joern.zaefferer $
*
*/

;(function($) {
$.fn.extend({
autocomplete: function(urlOrData, options) {
var isUrl = typeof urlOrData == "string";
options = $.extend({}, $.Autocompleter.defaults, {
url: isUrl ? urlOrData : null,
data: isUrl ? null : urlOrData,
delay: isUrl ? $.Autocompleter.defaults.delay : 10,
max: options && !options.scroll ? 10 : 150
}, options);
// if highlight is set to false, replace it with a do-nothing function
options.highlight = options.highlight || function(value) { return value; };
// if the formatMatch option is not specified, then use formatItem for backwards compatibility
options.formatMatch = options.formatMatch || options.formatItem;
return this.each(function() {
new $.Autocompleter(this, options);
});
},
result: function(handler) {
return this.bind("result", handler);
},
search: function(handler) {
return this.trigger("search", [handler]);
},
flushCache: function() {
return this.trigger("flushCache");
},
setOptions: function(options){
return this.trigger("setOptions", [options]);
},
unautocomplete: function() {
return this.trigger("unautocomplete");
}
});

$.Autocompleter = function(input, options) {

var KEY = {
UP: 38,
DOWN: 40,
DEL: 46,
TAB: 9,
RETURN: 13,
ESC: 27,
COMMA: 188,
PAGEUP: 33,
PAGEDOWN: 34,
BACKSPACE: 8
};

// Create $ object for input element
var $input = $(input).attr("autocomplete", "off").addClass(options.inputClass);

var timeout;
var previousValue = "";
var cache = $.Autocompleter.Cache(options);
var hasFocus = 0;
var lastKeyPressCode;
var config = {
mouseDownOnSelect: false
};
var select = $.Autocompleter.Select(options, input, selectCurrent, config);
var blockSubmit;
// prevent form submit in opera when selecting with return key
$.browser.opera && $(input.form).bind("submit.autocomplete", function() {
if (blockSubmit) {
blockSubmit = false;
return false;
}
});
// only opera doesn't trigger keydown multiple times while pressed, others don't work with keypress at all
$input.bind(($.browser.opera ? "keypress" : "keydown") + ".autocomplete", function(event) {
// track last key pressed
lastKeyPressCode = event.keyCode;
switch(event.keyCode) {
case KEY.UP:
event.preventDefault();
if ( select.visible() ) {
select.prev();
} else {
onChange(0, true);
}
break;
case KEY.DOWN:
event.preventDefault();
if ( select.visible() ) {
select.next();
} else {
onChange(0, true);
}
break;
case KEY.PAGEUP:
event.preventDefault();
if ( select.visible() ) {
select.pageUp();
} else {
onChange(0, true);
}
break;
case KEY.PAGEDOWN:
event.preventDefault();
if ( select.visible() ) {
select.pageDown();
} else {
onChange(0, true);
}
break;
// matches also semicolon
case options.multiple && $.trim(options.multipleSeparator) == "," && KEY.COMMA:
case KEY.TAB:
case KEY.RETURN:
if( selectCurrent() ) {
// stop default to prevent a form submit, Opera needs special handling
event.preventDefault();
blockSubmit = true;
return false;
}
break;
case KEY.ESC:
select.hide();
break;
default:
clearTimeout(timeout);
timeout = setTimeout(onChange, options.delay);
break;
}
}).focus(function(){
// track whether the field has focus, we shouldn't process any
// results if the field no longer has focus
hasFocus++;
}).blur(function() {
hasFocus = 0;
if (!config.mouseDownOnSelect) {
hideResults();
}
}).click(function() {
// show select when clicking in a focused field
if ( hasFocus++ > 1 && !select.visible() ) {
onChange(0, true);
}
}).bind("search", function() {
// TODO why not just specifying both arguments?
var fn = (arguments.length > 1) ? arguments[1] : null;
function findValueCallback(q, data) {
var result;
if( data && data.length ) {
for (var i=0; i < data.length; i++) {
if( data[i].result.toLowerCase() == q.toLowerCase() ) {
result = data[i];
break;
}
}
}
if( typeof fn == "function" ) fn(result);
else $input.trigger("result", result && [result.data, result.value]);
}
$.each(trimWords($input.val()), function(i, value) {
request(value, findValueCallback, findValueCallback);
});
}).bind("flushCache", function() {
cache.flush();
}).bind("setOptions", function() {
$.extend(options, arguments[1]);
// if we've updated the data, repopulate
if ( "data" in arguments[1] )
cache.populate();
}).bind("unautocomplete", function() {
select.unbind();
$input.unbind();
$(input.form).unbind(".autocomplete");
});
function selectCurrent() {
var selected = select.selected();
if( !selected )
return false;
var v = selected.result;
previousValue = v;
if ( options.multiple ) {
var words = trimWords($input.val());
if ( words.length > 1 ) {
v = words.slice(0, words.length - 1).join( options.multipleSeparator ) + options.multipleSeparator + v;
}
v += options.multipleSeparator;
}
$input.val(v);
hideResultsNow();
$input.trigger("result", [selected.data, selected.value]);
return true;
}
function onChange(crap, skipPrevCheck) {
if( lastKeyPressCode == KEY.DEL ) {
select.hide();
return;
}
var currentValue = $input.val();
if ( !skipPrevCheck && currentValue == previousValue )
return;
previousValue = currentValue;
currentValue = lastWord(currentValue);
if ( currentValue.length >= options.minChars) {
$input.addClass(options.loadingClass);
if (!options.matchCase)
currentValue = currentValue.toLowerCase();
request(currentValue, receiveData, hideResultsNow);
} else {
stopLoading();
select.hide();
}
};
function trimWords(value) {
if ( !value ) {
return [""];
}
var words = value.split( options.multipleSeparator );
var result = [];
$.each(words, function(i, value) {
if ( $.trim(value) )
result[i] = $.trim(value);
});
return result;
}
function lastWord(value) {
if ( !options.multiple )
return value;
var words = trimWords(value);
return words[words.length - 1];
}
// fills in the input box w/the first match (assumed to be the best match)
// q: the term entered
// sValue: the first matching result
function autoFill(q, sValue){
// autofill in the complete box w/the first match as long as the user hasn't entered in more data
// if the last user key pressed was backspace, don't autofill
if( options.autoFill && (lastWord($input.val()).toLowerCase() == q.toLowerCase()) && lastKeyPressCode != KEY.BACKSPACE ) {
// fill in the value (keep the case the user has typed)
$input.val($input.val() + sValue.substring(lastWord(previousValue).length));
// select the portion of the value not typed by the user (so the next character will erase)
$.Autocompleter.Selection(input, previousValue.length, previousValue.length + sValue.length);
}
};

function hideResults() {
clearTimeout(timeout);
timeout = setTimeout(hideResultsNow, 200);
};

function hideResultsNow() {
var wasVisible = select.visible();
select.hide();
clearTimeout(timeout);
stopLoading();
if (options.mustMatch) {
// call search and run callback
$input.search(
function (result){
// if no value found, clear the input box
if( !result ) {
if (options.multiple) {
var words = trimWords($input.val()).slice(0, -1);
$input.val( words.join(options.multipleSeparator) + (words.length ? options.multipleSeparator : "") );
}
else
$input.val( "" );
}
}
);
}
if (wasVisible)
// position cursor at end of input field
$.Autocompleter.Selection(input, input.value.length, input.value.length);
};

function receiveData(q, data) {
if ( data && data.length && hasFocus ) {
stopLoading();
select.display(data, q);
autoFill(q, data[0].value);
select.show();
} else {
hideResultsNow();
}
};

function request(term, success, failure) {
if (!options.matchCase)
term = term.toLowerCase();
var data = cache.load(term);
data = null; // Avoid buggy cache and go to Solr every time
// recieve the cached data
if (data && data.length) {
success(term, data);
// if an AJAX url has been supplied, try loading the data now
} else if( (typeof options.url == "string") && (options.url.length > 0) ){
var extraParams = {
timestamp: +new Date()
};
$.each(options.extraParams, function(key, param) {
extraParams[key] = typeof param == "function" ? param() : param;
});
$.ajax({
// try to leverage ajaxQueue plugin to abort previous requests
mode: "abort",
// limit abortion to this input
port: "autocomplete" + input.name,
dataType: options.dataType,
url: options.url,
data: $.extend({
q: lastWord(term),
limit: options.max
}, extraParams),
success: function(data) {
var parsed = options.parse && options.parse(data) || parse(data);
cache.add(term, parsed);
success(term, parsed);
}
});
} else {
// if we have a failure, we need to empty the list -- this prevents the the [TAB] key from selecting the last successful match
select.emptyList();
failure(term);
}
};
function parse(data) {
var parsed = [];
var rows = data.split("\n");
for (var i=0; i < rows.length; i++) {
var row = $.trim(rows[i]);
if (row) {
row = row.split("|");
parsed[parsed.length] = {
data: row,
value: row[0],
result: options.formatResult && options.formatResult(row, row[0]) || row[0]
};
}
}
return parsed;
};

function stopLoading() {
$input.removeClass(options.loadingClass);
};

};

$.Autocompleter.defaults = {
inputClass: "ac_input",
resultsClass: "ac_results",
loadingClass: "ac_loading",
minChars: 1,
delay: 400,
matchCase: false,
matchSubset: true,
matchContains: false,
cacheLength: 10,
max: 100,
mustMatch: false,
extraParams: {},
selectFirst: false,
formatItem: function(row) { return row[0]; },
formatMatch: null,
autoFill: false,
width: 0,
multiple: false,
multipleSeparator: ", ",
highlight: function(value, term) {
return value.replace(new RegExp("(?![^&;]+;)(?!<[^<>]*)(" + term.replace(/([\^\$\(\)\[\]\{\}\*\.\+\?\|\\])/gi, "\\$1") + ")(?![^<>]*>)(?![^&;]+;)", "gi"), "<strong>$1</strong>");
},
scroll: true,
scrollHeight: 180
};

$.Autocompleter.Cache = function(options) {

var data = {};
var length = 0;
function matchSubset(s, sub) {
if (!options.matchCase)
s = s.toLowerCase();
var i = s.indexOf(sub);
if (options.matchContains == "word"){
i = s.toLowerCase().search("\\b" + sub.toLowerCase());
}
if (i == -1) return false;
return i == 0 || options.matchContains;
};
function add(q, value) {
if (length > options.cacheLength){
flush();
}
if (!data[q]){
length++;
}
data[q] = value;
}
function populate(){
if( !options.data ) return false;
// track the matches
var stMatchSets = {},
nullData = 0;

// no url was specified, we need to adjust the cache length to make sure it fits the local data store
if( !options.url ) options.cacheLength = 1;
// track all options for minChars = 0
stMatchSets[""] = [];
// loop through the array and create a lookup structure
for ( var i = 0, ol = options.data.length; i < ol; i++ ) {
var rawValue = options.data[i];
// if rawValue is a string, make an array otherwise just reference the array
rawValue = (typeof rawValue == "string") ? [rawValue] : rawValue;
var value = options.formatMatch(rawValue, i+1, options.data.length);
if ( value === false )
continue;
var firstChar = value.charAt(0).toLowerCase();
// if no lookup array for this character exists, look it up now
if( !stMatchSets[firstChar] )
stMatchSets[firstChar] = [];

// if the match is a string
var row = {
value: value,
data: rawValue,
result: options.formatResult && options.formatResult(rawValue) || value
};
// push the current match into the set list
stMatchSets[firstChar].push(row);

// keep track of minChars zero items
if ( nullData++ < options.max ) {
stMatchSets[""].push(row);
}
};

// add the data items to the cache
$.each(stMatchSets, function(i, value) {
// increase the cache size
options.cacheLength++;
// add to the cache
add(i, value);
});
}
// populate any existing data
setTimeout(populate, 25);
function flush(){
data = {};
length = 0;
}
return {
flush: flush,
add: add,
populate: populate,
load: function(q) {
if (!options.cacheLength || !length)
return null;
/*
* if dealing w/local data and matchContains than we must make sure
* to loop through all the data collections looking for matches
*/
if( !options.url && options.matchContains ){
// track all matches
var csub = [];
// loop through all the data grids for matches
for( var k in data ){
// don't search through the stMatchSets[""] (minChars: 0) cache
// this prevents duplicates
if( k.length > 0 ){
var c = data[k];
$.each(c, function(i, x) {
// if we've got a match, add it to the array
if (matchSubset(x.value, q)) {
csub.push(x);
}
});
}
}
return csub;
} else
// if the exact item exists, use it
if (data[q]){
return data[q];
} else
if (options.matchSubset) {
for (var i = q.length - 1; i >= options.minChars; i--) {
var c = data[q.substr(0, i)];
if (c) {
var csub = [];
$.each(c, function(i, x) {
if (matchSubset(x.value, q)) {
csub[csub.length] = x;
}
});
return csub;
}
}
}
return null;
}
};
};

$.Autocompleter.Select = function (options, input, select, config) {
var CLASSES = {
ACTIVE: "ac_over"
};
var listItems,
active = -1,
data,
term = "",
needsInit = true,
element,
list;
// Create results
function init() {
if (!needsInit)
return;
element = $("<div/>")
.hide()
.addClass(options.resultsClass)
.css("position", "absolute")
.appendTo(document.body);
list = $("<ul/>").appendTo(element).mouseover( function(event) {
if(target(event).nodeName && target(event).nodeName.toUpperCase() == 'LI') {
active = $("li", list).removeClass(CLASSES.ACTIVE).index(target(event));
$(target(event)).addClass(CLASSES.ACTIVE);
}
}).click(function(event) {
$(target(event)).addClass(CLASSES.ACTIVE);
select();
// TODO provide option to avoid setting focus again after selection? useful for cleanup-on-focus
input.focus();
return false;
}).mousedown(function() {
config.mouseDownOnSelect = true;
}).mouseup(function() {
config.mouseDownOnSelect = false;
});
if( options.width > 0 )
element.css("width", options.width);
needsInit = false;
}
function target(event) {
var element = event.target;
while(element && element.tagName != "LI")
element = element.parentNode;
// more fun with IE, sometimes event.target is empty, just ignore it then
if(!element)
return [];
return element;
}

function moveSelect(step) {
listItems.slice(active, active + 1).removeClass(CLASSES.ACTIVE);
movePosition(step);
var activeItem = listItems.slice(active, active + 1).addClass(CLASSES.ACTIVE);
if(options.scroll) {
var offset = 0;
listItems.slice(0, active).each(function() {
offset += this.offsetHeight;
});
if((offset + activeItem[0].offsetHeight - list.scrollTop()) > list[0].clientHeight) {
list.scrollTop(offset + activeItem[0].offsetHeight - list.innerHeight());
} else if(offset < list.scrollTop()) {
list.scrollTop(offset);
}
}
};
function movePosition(step) {
active += step;
if (active < 0) {
active = listItems.size() - 1;
} else if (active >= listItems.size()) {
active = 0;
}
}
function limitNumberOfItems(available) {
return options.max && options.max < available
? options.max
: available;
}
function fillList() {
list.empty();
var max = limitNumberOfItems(data.length);
for (var i=0; i < max; i++) {
if (!data[i])
continue;
var formatted = options.formatItem(data[i].data, i+1, max, data[i].value, term);
if ( formatted === false )
continue;
var li = $("<li/>").html( options.highlight(formatted, term) ).addClass(i%2 == 0 ? "ac_even" : "ac_odd").appendTo(list)[0];
$.data(li, "ac_data", data[i]);
}
listItems = list.find("li");
if ( options.selectFirst ) {
listItems.slice(0, 1).addClass(CLASSES.ACTIVE);
active = 0;
}
// apply bgiframe if available
if ( $.fn.bgiframe )
list.bgiframe();
}
return {
display: function(d, q) {
init();
data = d;
term = q;
fillList();
},
next: function() {
moveSelect(1);
},
prev: function() {
moveSelect(-1);
},
pageUp: function() {
if (active != 0 && active - 8 < 0) {
moveSelect( -active );
} else {
moveSelect(-8);
}
},
pageDown: function() {
if (active != listItems.size() - 1 && active + 8 > listItems.size()) {
moveSelect( listItems.size() - 1 - active );
} else {
moveSelect(8);
}
},
hide: function() {
element && element.hide();
listItems && listItems.removeClass(CLASSES.ACTIVE);
active = -1;
},
visible : function() {
return element && element.is(":visible");
},
current: function() {
return this.visible() && (listItems.filter("." + CLASSES.ACTIVE)[0] || options.selectFirst && listItems[0]);
},
show: function() {
var offset = $(input).offset();
element.css({
width: typeof options.width == "string" || options.width > 0 ? options.width : $(input).width(),
top: offset.top + input.offsetHeight,
left: offset.left
}).show();
if(options.scroll) {
list.scrollTop(0);
list.css({
maxHeight: options.scrollHeight,
overflow: 'auto'
});
if($.browser.msie && typeof document.body.style.maxHeight === "undefined") {
var listHeight = 0;
listItems.each(function() {
listHeight += this.offsetHeight;
});
var scrollbarsVisible = listHeight > options.scrollHeight;
list.css('height', scrollbarsVisible ? options.scrollHeight : listHeight );
if (!scrollbarsVisible) {
// IE doesn't recalculate width when scrollbar disappears
listItems.width( list.width() - parseInt(listItems.css("padding-left")) - parseInt(listItems.css("padding-right")) );
}
}
}
},
selected: function() {
var selected = listItems && listItems.filter("." + CLASSES.ACTIVE).removeClass(CLASSES.ACTIVE);
return selected && selected.length && $.data(selected[0], "ac_data");
},
emptyList: function (){
list && list.empty();
},
unbind: function() {
element && element.remove();
}
};
};

$.Autocompleter.Selection = function(field, start, end) {
if( field.createTextRange ){
var selRange = field.createTextRange();
selRange.collapse(true);
selRange.moveStart("character", start);
selRange.moveEnd("character", end);
selRange.select();
} else if( field.setSelectionRange ){
field.setSelectionRange(start, end);
} else {
if( field.selectionStart ){
field.selectionStart = start;
field.selectionEnd = end;
}
}
field.focus();
};

})(jQuery);

+ 70
- 0
solr_config/velocity/js/jquery.tx3-tag-cloud.js View File

@@ -0,0 +1,70 @@
/*
* ----------------------------------------------------------------------------
* "THE BEER-WARE LICENSE" (Revision 42):
* Tuxes3 wrote this file. As long as you retain this notice you
* can do whatever you want with this stuff. If we meet some day, and you think
* this stuff is worth it, you can buy me a beer in return Tuxes3
* ----------------------------------------------------------------------------
*/
(function($)
{
var settings;
$.fn.tx3TagCloud = function(options)
{

//
// DEFAULT SETTINGS
//
settings = $.extend({
multiplier : 1
}, options);
main(this);

}

function main(element)
{
// adding style attr
element.addClass("tx3-tag-cloud");
addListElementFontSize(element);
}

/**
* calculates the font size on each li element
* according to their data-weight attribut
*/
function addListElementFontSize(element)
{
var hDataWeight = -9007199254740992;
var lDataWeight = 9007199254740992;
$.each(element.find("li"), function(){
cDataWeight = getDataWeight(this);
if (cDataWeight == undefined)
{
logWarning("No \"data-weight\" attribut defined on <li> element");
}
else
{
hDataWeight = cDataWeight > hDataWeight ? cDataWeight : hDataWeight;
lDataWeight = cDataWeight < lDataWeight ? cDataWeight : lDataWeight;
}
});
$.each(element.find("li"), function(){
var dataWeight = getDataWeight(this);
var percent = Math.abs((dataWeight - lDataWeight)/(lDataWeight - hDataWeight));
$(this).css('font-size', (1 + (percent * settings['multiplier'])) + "em");
});

}

function getDataWeight(element)
{
return parseInt($(element).attr("data-weight"));
}

function logWarning(message)
{
console.log("[WARNING] " + Date.now() + " : " + message);
}

}(jQuery));

+ 42
- 0
solr_config/velocity/layout.vm View File

@@ -0,0 +1,42 @@
<html>
<head>
#parse("head.vm")
</head>
<body>
<div id="header">
<a href="#url_for_home"><img src="#{url_root}/img/solr.svg" id="logo" title="Solr"/></a> $resource.powered_file_search
</div>

<div id="header2" onclick="javascript:locale_select()">
<ul class="menu">

<li>
<a href="#"><img src="#{url_for_solr}/admin/file?file=/velocity/img/globe_256.png&contentType=image/png" id="locale_pic" title="locale_select" width="30px" height="27px"/></a>
<ul>
<li><a href="#url_for_locale('fr_FR')" #if("#current_locale"=="fr_FR")class="hidden"#end>
<img src="#{url_for_solr}/admin/file?file=/velocity/img/france_640.png&contentType=image/png" id="french_flag" width="40px" height="40px"/>Fran&ccedil;ais</a></li>
<li><a href="#url_for_locale('de_DE')" #if("#current_locale"=="de_DE")class="hidden"#end>
<img src="#{url_for_solr}/admin/file?file=/velocity/img/germany_640.png&contentType=image/png" id="german_flag" width="40px" height="40px"/>Deutsch</a></li>
<li><a href="#url_for_locale('')" #if("#current_locale"=="")class="hidden"#end>
<img src="#{url_for_solr}/admin/file?file=/velocity/img/english_640.png&contentType=image/png" id="english_flag" width="40px" height="40px"/>English</a></li>
</ul>
</li>
</ul>
</div>

#if($response.response.error.code)
<div class="error">
<h1>ERROR $response.response.error.code</h1>
$response.response.error.msg
</div>
#else
<div id="content">
$content
</div>
#end

<div id="footer">
#parse("footer.vm")
</div>
</body>
</html>

+ 16
- 0
solr_config/velocity/macros.vm View File

@@ -0,0 +1,16 @@
#macro(lensFilterSortOnly)?#if($response.responseHeader.params.getAll("fq").size() > 0)&#fqs($response.responseHeader.params.getAll("fq"))#end#sort($request.params.getParams('sort'))#end
#macro(lensNoQ)#lensFilterSortOnly&type=#current_type#if("#current_locale"!="")&locale=#current_locale#end#end
#macro(lensNoType)#lensFilterSortOnly#q#if("#current_locale"!="")&locale=#current_locale#end#end
#macro(lensNoLocale)#lensFilterSortOnly#q&type=#current_type#end

## lens modified for example/files - to use fq from responseHeader rather than request, and #debug removed too as it is built into browse params now, also added type to lens
#macro(lens)#lensNoQ#q#end

## Macros defined custom for the "files" example
#macro(url_for_type $type)#url_for_home#lensNoType&type=$type#end
#macro(current_type)#if($response.responseHeader.params.type)${response.responseHeader.params.type}#{else}all#end#end
#macro(url_for_locale $locale)#url_for_home#lensNoLocale#if($locale!="")&locale=$locale#end&start=$page.start#end
#macro(current_locale)$!{response.responseHeader.params.locale}#end

## Usage: #label(resource_key[, default_value]) - resource_key is used as label if no default value specified and no resource exists
#macro(label $key $default)#if($resource.get($key).exists)${resource.get($key)}#else#if($default)$default#else${key}#end#end#end

+ 68
- 0
solr_config/velocity/mime_type_lists.vm View File

@@ -0,0 +1,68 @@
#**
* Define some Mime-Types, short and long form
*#

## MimeType to extension map for detecting file type
## and showing proper icon
## List of types match the icons in /solr/img/filetypes

## Short MimeType Names
## Was called $supportedtypes
#set($supportedMimeTypes = "7z;ai;aiff;asc;audio;bin;bz2;c;cfc;cfm;chm;class;conf;cpp;cs;css;csv;deb;divx;doc;dot;eml;enc;file;gif;gz;hlp;htm;html;image;iso;jar;java;jpeg;jpg;js;lua;m;mm;mov;mp3;mpg;odc;odf;odg;odi;odp;ods;odt;ogg;pdf;pgp;php;pl;png;ppt;ps;py;ram;rar;rb;rm;rpm;rtf;sig;sql;swf;sxc;sxd;sxi;sxw;tar;tex;tgz;txt;vcf;video;vsd;wav;wma;wmv;xls;xml;xpi;xvid;zip")

## Long Form: map MimeType headers to our Short names
## Was called $extMap
#set( $mimeExtensionsMap = {
"application/x-7z-compressed": "7z",
"application/postscript": "ai",
"application/pgp-signature": "asc",
"application/octet-stream": "bin",
"application/x-bzip2": "bz2",
"text/x-c": "c",
"application/vnd.ms-htmlhelp": "chm",
"application/java-vm": "class",
"text/css": "css",
"text/csv": "csv",
"application/x-debian-package": "deb",
"application/msword": "doc",
"message/rfc822": "eml",
"image/gif": "gif",
"application/winhlp": "hlp",
"text/html": "html",
"application/java-archive": "jar",
"text/x-java-source": "java",
"image/jpeg": "jpeg",
"application/javascript": "js",
"application/vnd.oasis.opendocument.chart": "odc",
"application/vnd.oasis.opendocument.formula": "odf",
"application/vnd.oasis.opendocument.graphics": "odg",
"application/vnd.oasis.opendocument.image": "odi",
"application/vnd.oasis.opendocument.presentation": "odp",
"application/vnd.oasis.opendocument.spreadsheet": "ods",
"application/vnd.oasis.opendocument.text": "odt",
"application/pdf": "pdf",
"application/pgp-encrypted": "pgp",
"image/png": "png",
"application/vnd.ms-powerpoint": "ppt",
"audio/x-pn-realaudio": "ram",
"application/x-rar-compressed": "rar",
"application/vnd.rn-realmedia": "rm",
"application/rtf": "rtf",
"application/x-shockwave-flash": "swf",
"application/vnd.sun.xml.calc": "sxc",
"application/vnd.sun.xml.draw": "sxd",
"application/vnd.sun.xml.impress": "sxi",
"application/vnd.sun.xml.writer": "sxw",
"application/x-tar": "tar",
"application/x-tex": "tex",
"text/plain": "txt",
"text/x-vcard": "vcf",
"application/vnd.visio": "vsd",
"audio/x-wav": "wav",
"audio/x-ms-wma": "wma",
"video/x-ms-wmv": "wmv",
"application/vnd.ms-excel": "xls",
"application/xml": "xml",
"application/x-xpinstall": "xpi",
"application/zip": "zip"
})

+ 20
- 0
solr_config/velocity/results.vm View File

@@ -0,0 +1,20 @@
<div id="facets">
#parse("facets.vm")
</div>


<div id="results_list">
<div class="pagination">
<span class="results-found">$page.results_found</span> $resource.results_found_in.insert(${response.responseHeader.QTime})
$resource.page_of.insert($page.current_page_number,$page.page_count)
</div>

#parse("results_list.vm")

<div class="pagination">
#link_to_previous_page
<span class="results-found">$page.results_found</span> $resource.results_found.
$resource.page_of.insert($page.current_page_number,$page.page_count)
#link_to_next_page
</div>
</div>

+ 21
- 0
solr_config/velocity/results_list.vm View File

@@ -0,0 +1,21 @@
<ul id="tabs">
<li><a href="#url_for_type('all')" #if("#current_type"=="all")class="selected"#end>$resource.type.all ($response.response.facet_counts.facet_queries.all_types)</a></li>
#foreach($type in $response.response.facet_counts.facet_fields.doc_type)
#if($type.key)
<li><a href="#url_for_type($type.key)" #if($type.value=="0")class="no_results"#end #if("#current_type"==$type.key)class="selected"#end> #label("type.${type.key}.label", $type.key) ($type.value)</a></li>
#else
#if($type.value > 0)
<li><a href="#url_for_type('unknown')" #if("#current_type"=="unknown")class="selected"#end>$resource.type.unknown ($type.value)</a></li>
#end
#end
#end
</ul>


<div id="results">
#foreach($doc in $response.results)
#parse("hit.vm")
#end
</div>



+ 144
- 0
solr_import.sh View File

@@ -0,0 +1,144 @@
#!/bin/bash
# @name: solr_import.sh
# @version: 0.1
# @creation_date: 2022-03-11
# @license: The MIT License <https://opensource.org/licenses/MIT>
# @author: Simon Bowie <ad7588@coventry.ac.uk>
# @purpose: Runs imports of files into Solr indexes
# @acknowledgements:
# https://www.redhat.com/sysadmin/arguments-options-bash-scripts

############################################################
# Subprograms #
############################################################
License()
{
echo 'Copyright 2022 Simon Bowie <ad7588@coventry.ac.uk>'
echo
echo 'Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:'
echo
echo 'The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.'
echo
echo 'THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.'
}

Help()
{
# Display Help
echo "This script performs Solr import functions for different cores."
echo
echo "Syntax: solr_import.sh [-l|h|z|a|e|i|m|p|x|d|s|w]"
echo "options:"
echo "l Print the MIT License notification."
echo "h Print this Help."
echo "z Index all."
echo "a Index ACTIVE folder."
echo "e Index EXPANDING folder."
echo "i Index INVISIBLE folder."
echo "m Index MULTI-SPECIES folder."
echo "p Index PISSING & LEAKING folder."
echo "x Index SECRET folder."
echo "d Index SELF-DEFENDING folder."
echo "s Index SURVIVING folder."
echo "w Index WORKING folder."
echo
}

Import()
{
docker exec -it solr bin/solr delete -c $core

docker exec -it solr solr create_core -c $core -d custom

#docker exec -ti --user=solr solr bash -c "cp -r /opt/solr/example/files/conf/* /var/solr/data/$core/conf/"

#docker restart solr

sleep 30

docker run --rm -v "$directory/$location:/$core" --network=host solr:latest post -c $core /$core
}

Import_recursive()
{
docker run --rm -v "$directory/$subdirectory:/$core" --network=host solr:latest post -c $core /$core
}
############################################################
############################################################
# Main program #
############################################################
############################################################

# Set variables
directory="/Users/ad7588/projects/patent_site"

# Get the options
while getopts ":hlimzaespxdw" option; do
case $option in
l) # display License
License
exit;;
h) # display Help
Help
exit;;
z) # index all
core="all"
docker exec -it solr bin/solr delete -c $core
docker exec -it solr solr create_core -c $core -d custom
location="data/POP_Dataset_2022"
for subdirectory in $location/*/
do
subdirectory=${subdirectory%*/} # remove the trailing "/"
Import_recursive
done
exit;;
a) # index ACTIVE folder
core="active"
location="data/pop_rtfs/ACTIVE (160)"
Import
exit;;
e) # index EXPANDING folder
core="expanding"
location="data/pop_rtfs/EXPANDING (169)"
Import
exit;;
i) # index INVISIBLE folder
core="invisible"
location="data/pop_rtfs/IN.VISIBLE (204)"
Import
exit;;
m) # index MULTI-SPECIES folder
core="multispecies"
location="data/pop_rtfs/MULTI-SPECIES (180)"
Import
exit;;
p) # index PISSING & LEAKING folder
core="pissing"
location="data/pop_rtfs/PISSING & LEAKING (168)"
Import
exit;;
x) # index SECRET folder
core="secret"
location="data/pop_rtfs/SECRET (92)"
Import
exit;;
d) # index SELF-DEFENDING folder
core="defending"
location="data/pop_rtfs/SELF-DEFENDING (115)"
Import
exit;;
s) # index SURVIVING folder
core="surviving"
location="data/pop_rtfs/SURVIVING (166)"
Import
exit;;
w) # index WORKING folder
core="working"
location="data/pop_rtfs/WORKING (101)"
Import
exit;;
\?) # Invalid option
echo "Error: Invalid option"
exit;;
esac
done

+ 10
- 0
web/Dockerfile View File

@@ -0,0 +1,10 @@
# syntax=docker/dockerfile:1
FROM python:3.10.7-slim-buster

RUN apt-get update -y && apt-get install -y imagemagick

WORKDIR /code
COPY requirements.txt requirements.txt
RUN pip install -r requirements.txt
COPY . .
CMD ["python3", "-m", "flask", "run"]

+ 34
- 0
web/app/__init__.py View File

@@ -0,0 +1,34 @@
# @name: __init__.py
# @creation_date: 2022-09-07
# @license: The MIT License <https://opensource.org/licenses/MIT>
# @author: Simon Bowie <ad7588@coventry.ac.uk>
# @purpose: Initialises the app, SQLAlchemy, and configuration variables
# @acknowledgements:
# https://www.digitalocean.com/community/tutorials/how-to-add-authentication-to-your-app-with-flask-login
# Config stuff adapted from https://blog.miguelgrinberg.com/post/the-flask-mega-tutorial-part-iii-web-forms

from flask import Flask
from flask_moment import Moment
import os

# initiate Moment for datetime functions
moment = Moment()

def create_app():
app = Flask(__name__)

moment.init_app(app)

# blueprint for main parts of app
from .main import main as main_blueprint
app.register_blueprint(main_blueprint)

# blueprint for search parts of app
from .search import search as search_blueprint
app.register_blueprint(search_blueprint)

# blueprint for random parts of app
from .random import random as random_blueprint
app.register_blueprint(random_blueprint)

return app

+ 16
- 0
web/app/main.py View File

@@ -0,0 +1,16 @@
# @name: main.py
# @creation_date: 2022-09-07
# @license: The MIT License <https://opensource.org/licenses/MIT>
# @author: Simon Bowie <ad7588@coventry.ac.uk>
# @purpose: Main route for index and other pages
# @acknowledgements:
# https://www.digitalocean.com/community/tutorials/how-to-add-authentication-to-your-app-with-flask-login

from flask import Blueprint, render_template

main = Blueprint('main', __name__)

# route for index page
@main.route('/')
def index():
return render_template('index.html')

+ 153
- 0
web/app/ops.py View File

@@ -0,0 +1,153 @@
# @name: ops.py
# @version: 0.1
# @creation_date: 2022-09-08
# @license: The MIT License <https://opensource.org/licenses/MIT>
# @author: Simon Bowie <simon.bowie.19@gmail.com>
# @purpose: Performs functions against the European Patent Office's Open Patent Services (OPS) API
# @acknowledgements:
# OPS documented at https://www.epo.org/searching-for-patents/data/web-services/ops.html
# OPS RESTful API specification at http://documents.epo.org/projects/babylon/eponet.nsf/0/F3ECDCC915C9BCD8C1258060003AA712/$File/ops_v3.2_documentation_-_version_1.3.18_en.pdf
# OPS API functions list at https://developers.epo.org/ops-v3-2/apis

import os
import requests
import base64
from wand.image import Image

# get config variables from OS environment variables: set in env file passed through Docker Compose
ops_url = os.environ.get('OPS_URL')
ops_url_images = os.environ.get('OPS_URL_IMAGES')
consumer_key = os.environ.get('CONSUMER_KEY')
consumer_secret = os.environ.get('CONSUMER_SECRET')

def get_access_token():

# OPS API credentials (details at http://documents.epo.org/projects/babylon/eponet.nsf/0/F3ECDCC915C9BCD8C1258060003AA712/$File/ops_v3.2_documentation_-_version_1.3.18_en.pdf)
endpoint_url = ops_url + '3.2/auth/accesstoken'
auth = consumer_key + ":" + consumer_secret
auth_bytes = auth.encode("ascii")
base64_bytes = base64.b64encode(auth_bytes)
base64_string = base64_bytes.decode("ascii")

# set up API call
headers = {"Authorization": "Basic " + base64_string, "Content-Type": "application/x-www-form-urlencoded"}
data = "grant_type=client_credentials"

# give back result
response = requests.post(endpoint_url, headers=headers, data=data)

if response.status_code == 200:
# turn the API response into useful Json
json = response.json()
access_token = json['access_token']

return access_token

def get_publication_details(doc_ref):

access_token = get_access_token()

# OPS API credentials (details at http://documents.epo.org/projects/babylon/eponet.nsf/0/F3ECDCC915C9BCD8C1258060003AA712/$File/ops_v3.2_documentation_-_version_1.3.16_en.pdf)
endpoint_url = ops_url + 'rest-services/published-data/publication/docdb/' + doc_ref + '/biblio'

# set up API call
headers = {"Authorization": "Bearer " + access_token, "Accept": "application/json"}

# get result
response = requests.get(endpoint_url, headers=headers)

output = {}

if response.status_code == 200:
# turn the API response into useful Json
json = response.json()

# for each invention title, check if it's in the original language
try:
json['ops:world-patent-data']['exchange-documents']['exchange-document']['bibliographic-data']['invention-title']
invention_titles = json['ops:world-patent-data']['exchange-documents']['exchange-document']['bibliographic-data']['invention-title']
try:
invention_titles[1]
for invention_title in invention_titles:
if invention_title['@lang'] is not None and invention_title['@lang'] != 'en':
output['original_title'] = invention_title['$']
except KeyError:
if invention_titles['@lang'] is not None and invention_titles['@lang'] != 'en':
output['original_title'] = invention_titles['$']
except KeyError:
pass

# for each abstract, check if it's in the original language
try:
json['ops:world-patent-data']['exchange-documents']['exchange-document']['abstract']
abstracts = json['ops:world-patent-data']['exchange-documents']['exchange-document']['abstract']
try:
abstracts[1]
for abstract in abstracts:
if abstract['@lang'] is not None and abstract['@lang'] != 'en':
output['original_abstract'] = abstract['p']['$']
except KeyError:
if abstracts['@lang'] is not None and abstracts['@lang'] != 'en':
output['original_abstract'] = abstracts['p']['$']
except KeyError:
pass

return output

def get_images(doc_ref):

access_token = get_access_token()

# OPS API credentials (details at http://documents.epo.org/projects/babylon/eponet.nsf/0/F3ECDCC915C9BCD8C1258060003AA712/$File/ops_v3.2_documentation_-_version_1.3.16_en.pdf)
endpoint_url = ops_url + 'rest-services/published-data/publication/docdb/' + doc_ref + '/images'

# set up API call
headers = {"Authorization": "Bearer " + access_token, "Accept": "application/json"}

# give back result
response = requests.get(endpoint_url, headers=headers)

if response.status_code == 200:

output = {}
drawings_url = {}

# turn the API response into useful Json
json = response.json()

try:
json['ops:world-patent-data']['ops:document-inquiry']['ops:inquiry-result']['ops:document-instance']
document_instances = json['ops:world-patent-data']['ops:document-inquiry']['ops:inquiry-result']['ops:document-instance']
try:
document_instances[1]
for document_instance in document_instances:
if document_instance['@desc'] == 'Drawing':
drawings_url = ops_url_images + '3.2/rest-services/' + document_instance['@link'] + '?Range=1'
if drawings_url is None:
for document_instance in document_instances:
if document_instance['@desc'] == 'FullDocument':
drawings_url = ops_url_images + '3.2/rest-services/' + document_instance['@link'] + '?Range=1'
except KeyError:
pass

if drawings_url is not None:

# set up API call
headers = {"Authorization": "Bearer " + access_token, "Accept": "application/tiff"}

# give back result
response = requests.get(drawings_url, headers=headers)

if response.status_code == 200:
with Image(blob = response.content) as image:
png_blob = image.make_blob('png')
base64_bytes = base64.b64encode(png_blob)
output['image'] = base64_bytes.decode("ascii")

except KeyError:
pass

else:
output = False

return output

+ 62
- 0
web/app/random.py View File

@@ -0,0 +1,62 @@
# @name: random.py
# @creation_date: 2022-09-09
# @license: The MIT License <https://opensource.org/licenses/MIT>
# @author: Simon Bowie <ad7588@coventry.ac.uk>
# @purpose: random route for random
# @acknowledgements:

from flask import Blueprint, render_template, request
from . import solr
from . import ops

random = Blueprint('random', __name__)

# route for random page
@random.route('/random/')
def random_record():
core = 'all'
results = solr.get_random_record(core)
for result in results:
publication_details = ops.get_publication_details(result['doc_ref'])
result.update(publication_details)
if ops.get_images(result['doc_ref']):
image = ops.get_images(result['doc_ref'])
result.update(image)
return render_template('record.html', results=results)

# route for comparing two random records
@random.route('/random/two/')
def two_random_records():
core = 'all'
results_list = []
i = 0
while i <= 1:
results = solr.get_random_record(core)
for result in results:
publication_details = ops.get_publication_details(result['doc_ref'])
result.update(publication_details)
if ops.get_images(result['doc_ref']):
image = ops.get_images(result['doc_ref'])
result.update(image)
results_list.append(result)
i += 1
return render_template('compare.html', results=results_list)

# route for getting ten random titles
@random.route('/random/titles/')
def ten_random_titles():
titles = solr.get_ten_random_elements('title')
additional_titles = solr.get_ten_random_elements('title')
return render_template('titles.html', titles=titles, additional_titles=additional_titles)

# route for getting ten random abstracts
@random.route('/random/abstracts/')
def ten_random_abstracts():
abstracts = solr.get_ten_random_elements('abstract')
return render_template('abstracts.html', abstracts=abstracts)

# route for getting ten random images
@random.route('/random/images/')
def ten_random_images():
results = solr.get_ten_random_images()
return render_template('images.html', results=results)

+ 51
- 0
web/app/search.py View File

@@ -0,0 +1,51 @@
# @name: search.py
# @creation_date: 2022-09-07
# @license: The MIT License <https://opensource.org/licenses/MIT>
# @author: Simon Bowie <ad7588@coventry.ac.uk>
# @purpose: search route for search
# @acknowledgements:
# https://www.digitalocean.com/community/tutorials/how-to-add-authentication-to-your-app-with-flask-login

from flask import Blueprint, render_template, request
from . import solr
from . import ops

search = Blueprint('search', __name__)

# route for search page
@search.route('/search/', methods=['POST'])
def basic_search():
search = request.form.get('search')
if request.form.get('core') is not None:
core = request.form.get('core')
else:
core = 'all'
if request.form.get('sort') is not None:
sort = request.form.get('sort')
else:
sort = 'relevance'
results = solr.solr_search(core, sort, search)
return render_template('search.html', results=results, search=search, core=core, sort=sort)

# route for id_search page
@search.route('/search/id/')
def id_search():
if request.args.get('core') is not None:
core = request.args.get('core')
else:
core = 'all'
if request.args.get('sort') is not None:
sort = request.args.get('sort')
else:
sort = 'relevance'
id = request.args.get('id')
results = solr.solr_search(core, sort, search, id)

for result in results:
publication_details = ops.get_publication_details(result['doc_ref'])
result.update(publication_details)
if ops.get_images(result['doc_ref']):
image = ops.get_images(result['doc_ref'])
result.update(image)

return render_template('record.html', results=results)

+ 145
- 0
web/app/solr.py View File

@@ -0,0 +1,145 @@
# @name: solr.py
# @version: 0.1
# @creation_date: 2022-09-07
# @license: The MIT License <https://opensource.org/licenses/MIT>
# @author: Simon Bowie <simon.bowie.19@gmail.com>
# @purpose: Performs Solr functions
# @acknowledgements:

import os
import requests
import re
import urllib
import random
from . import ops

# get config variables from OS environment variables: set in env file passed through Docker Compose
solr_hostname = os.environ.get('SOLR_HOSTNAME')
solr_port = os.environ.get('SOLR_PORT')

def solr_search(core, sort, search=None, id=None):

# Assemble a query string to send to Solr. This uses the Solr hostname from config.env. Solr's query syntax can be found at many sites including https://lucene.apache.org/solr/guide/6_6/the-standard-query-parser.html
if id is not None:
solrurl = 'http://' + solr_hostname + ':' + solr_port + '/solr/' + core + '/select?q.op=OR&q=id%3A"' + id + '"&wt=json'
else:
if (sort == 'relevance'):
solrurl = 'http://' + solr_hostname + ':' + solr_port + '/solr/' + core + '/select?q.op=OR&q=content%3A' + urllib.parse.quote_plus(search) + '&wt=json'
else:
solrurl = 'http://' + solr_hostname + ':' + solr_port + '/solr/' + core + '/select?q.op=OR&q=content%3A' + urllib.parse.quote_plus(search) + '&wt=json&sort=' + sort

# get result
request = requests.get(solrurl)
# turn the API response into useful Json
json = request.json()

if (json['response']['numFound'] == 0):
output = 'no results found'
else:
output = []
for result in json['response']['docs']:
# set ID variable
id = result['id']
# set content variable
content = result['content']
# parse result
result_output = parse_result(id, content)
output.append(result_output)
return output

def parse_result(id, input):

output = {}

output['id'] = id

# set document reference number (used for OPS API)
doc_ref = re.search('=D\s(([^\s]*)\s([^\s]*)\s([^\s]*))', input)
if doc_ref is None:
doc_ref = re.search('=D&locale=en_EP\s(([^\s]*)\s([^\s]*)\s([^\s]*))', input)
output['doc_ref'] = doc_ref.group(1).replace(" ","")
else:
output['doc_ref'] = doc_ref.group(1).replace(" ","")

# search for the application ID in the content element and display it
application_id = re.search('Application.*\n(.*)\n', input)
output['application_id'] = application_id.group(1)

# search for the EPO publication URL in the content element and display it
epo_publication = re.search('Publication.*\n(.*)\n', input)
output['epo_publication_url'] = epo_publication.group(1)

# search for the IPC publication URL in the content element and display it
ipc_publication = re.search('IPC.*\n(.*)\n', input)
output['ipc_publication_url'] = ipc_publication.group(1)

# search for the title in the content element and display it
title = re.search('Title.*\n(.*)\n', input)
if title is not None:
output['title'] = title.group(1)

# search for the abstract in the content element and display it
abstract = re.search('Abstract.*\n(.*)\n', input)
if abstract is None:
abstract = re.search('\(.\) \\n\\n(.*)\\n', input)
if abstract is not None:
output['abstract'] = abstract.group(1);

# search for the year in the content element and display it
year = re.search('=D[^\s]*\s[^\s]*\s[^\s]*\s[^\s]*\s(\d{4})', input)
if year is not None:
output['year'] = year.group(1)
return output

def get_random_record(core):

rand = str(random.randint(0, 9999999))

# Assemble a query string to send to Solr. This uses the Solr hostname from config.env. Solr's query syntax can be found at many sites including https://lucene.apache.org/solr/guide/6_6/the-standard-query-parser.html
solrurl = 'http://' + solr_hostname + ':' + solr_port + '/solr/' + core + '/select?q.op=OR&q=*%3A*&wt=json&sort=random_' + rand + '%20asc&rows=1'

# get result
request = requests.get(solrurl)
# turn the API response into useful Json
json = request.json()

if (json['response']['numFound'] == 0):
output = 'no results found'
else:
output = []
for result in json['response']['docs']:
# set ID variables
id = result['id']
# set content variable
content = result['content']
# parse result
result_output = parse_result(id, content)
output.append(result_output)
return output

def get_ten_random_elements(field):
core = 'all'
output = []
i = 0
while i <= 9:
results = get_random_record(core)
for result in results:
if field in result:
dict = {'id': result['id'], field: result[field]}
output.append(dict)
i += 1
return output

def get_ten_random_images():
core = 'all'
output = []
i = 0
while i <= 9:
results = get_random_record(core)
for result in results:
if ops.get_images(result['doc_ref']):
image = ops.get_images(result['doc_ref'])
result.update(image)
output.append(result)
i += 1
return output

+ 9
- 0
web/app/static/js/main.js View File

@@ -0,0 +1,9 @@
/*
# @name: main.js
# @version: 0.1
# @creation_date: 2022-09-07
# @license: The MIT License <https://opensource.org/licenses/MIT>
# @author: Simon Bowie <ad7588@coventry.ac.uk>
# @purpose: JavaScript functions for various functions
# @acknowledgements:
*/

+ 10
- 0
web/app/static/styles/custom.css View File

@@ -0,0 +1,10 @@
/*
# @name: custom.css
# @version: 0.1
# @creation_date: 2022-09-07
# @license: The MIT License <https://opensource.org/licenses/MIT>
# @author: Simon Bowie <ad7588@coventry.ac.uk>
# @purpose: Custom CSS to override Bootstrap 5 defaults
# @acknowledgements:
# Bootstrap 5.1.3: https://getbootstrap.com/
*/

+ 15
- 0
web/app/templates/abstracts.html View File

@@ -0,0 +1,15 @@
{% extends "base.html" %}

{% block content %}

{% for abstract in abstracts %}

{{ abstract['abstract'] }}

<br><br>

<hr>

{% endfor %}

{% endblock %}

+ 51
- 0
web/app/templates/base.html View File

@@ -0,0 +1,51 @@
<!--
# @name: base.html
# @version: 0.1
# @creation_date: 2022-09-07
# @license: The MIT License <https://opensource.org/licenses/MIT>
# @author: Simon Bowie <ad7588@coventry.ac.uk>
# @purpose: Basic layout for all pages
# @acknowledgements:
# https://www.digitalocean.com/community/tutorials/how-to-make-a-web-application-using-flask-in-python-3
# Bootstrap 5.1.3: https://getbootstrap.com/
# Flask-Moment: https://flask-moment.readthedocs.io/en/latest/
# Boostrap select: https://stackoverflow.com/questions/67942546/bootstrap-5-select-dropdown-with-the-multiple-attribute-collapsed

-->

<!DOCTYPE html>
<html>
<head>
<title>Performing Patents Otherwise: Archival conversations with 320.000 clothing inventions</title>
<meta charset="utf-8">
<meta http-equiv="X-UA-Compatible" content="IE=edge">
<meta name="viewport" content="width=device-width, initial-scale=1">
<!-- Bootstrap CSS -->
<link href="https://cdn.jsdelivr.net/npm/bootstrap@5.1.3/dist/css/bootstrap.min.css" rel="stylesheet" integrity="sha384-1BmE4kWBq78iYhFldvKuhfTAU6auU8tT94WrHftjDbrCEXSU1oBoqyl2QvZ6jIW3" crossorigin="anonymous">
<link href="{{ url_for('static',filename='styles/custom.css') }}" rel="stylesheet">
<!-- JavaScript -->
<script src="https://code.jquery.com/jquery-3.6.0.js" integrity="sha256-H+K7U5CnXl1h5ywQfKtSj8PCmoN9aaq30gDh27Xc0jk=" crossorigin="anonymous"></script>
<script src="{{ url_for('static',filename='js/main.js') }}"></script>
<script src="https://cdn.jsdelivr.net/npm/bootstrap@5.1.3/dist/js/bootstrap.bundle.min.js" integrity="sha384-ka7Sk0Gln4gmtz2MlQnikT1wXgYsOg+OMhuP+IlRH9sENBO0LRn5q+8nbTov4+1p" crossorigin="anonymous"></script>
</head>

<body class="d-flex flex-column min-vh-100">

<main class="flex-shrink-0">
<div class="container-fluid p-5 my-5 border">

{% block content %}
{% endblock %}

</div>
</main>

<footer class="footer py-3 mt-auto bg-light">
<div class="container">
<span class="text-muted">Data from the <a href="https://www.epo.org/">European Patent Office's</a> <a href="https://worldwide.espacenet.com/">Espacenet patent search engine</a> and reconfigured by Goldsmiths, University of London's Archival Conversations project.</span>
</div>
</footer>

</body>

</html>

+ 71
- 0
web/app/templates/compare.html View File

@@ -0,0 +1,71 @@
{% extends "base.html" %}

{% block content %}

<div class="row">
{% for result in results %}
<div class="col-6 text-center">

Application ID:

<a href="/search/id?id={{ result['id'] }}&core=all">
{{ result['application_id'] }}
</a>

<br><br>

Year:

{{ result['year'] }}

<br><br>

EPO publication:

<a href="{{ result['epo_publication_url'] }}">
{{ result['epo_publication_url'] }}
</a>

<br><br>

IPC publication:

<a href="{{ result['ipc_publication_url'] }}">
{{ result['ipc_publication_url'] }}
</a>

<br><br>

{% if result['title'] is defined %}
Title:
{{ result['title'] }}
<br><br>
{% endif %}

{% if result['original_title'] is defined %}
Original language title:
{{ result['original_title'] }}
<br><br>
{% endif %}

{% if result['abstract'] is defined %}
Abstract:
{{ result['abstract'] }}
<br><br>
{% endif %}

{% if result['original_abstract'] is defined %}
Original language abstract:
{{ result['original_abstract'] }}
<br><br>
{% endif %}

{% if result['image'] is defined %}
<img class="img-fluid" src="data:image/jpg;base64,{{ result['image'] }}" alt="Drawing of patent" />'
{% endif %}
</div>
{% endfor %}
</div>

{% endblock %}

+ 11
- 0
web/app/templates/images.html View File

@@ -0,0 +1,11 @@
{% extends "base.html" %}

{% block content %}

{% for result in results %}

<img class="img-fluid" src="data:image/jpg;base64,{{ result['image'] }}" alt="Drawing accompanying patent for{{ result['title'] }}" />'

{% endfor %}

{% endblock %}

+ 46
- 0
web/app/templates/index.html View File

@@ -0,0 +1,46 @@
{% extends "base.html" %}

{% block content %}

<div class="row">
<div class="col text-center">
<p class="h1">Performing Patents Otherwise</p>
<p class="h2">Archival conversations with 320,000 clothing inventions</p>
</div>
</div>

<div class="row justify-content-center p-3">
<div class="col-sm-6 text-center">
<form action="/search" method="POST">
<input type="text" name="search" placeholder="search for a patent record">
<input type="submit" id="submit" value="search">
</form>
</div>
</div>


<div class="row p-3">
<div class="col text-center">
<a href="/random">show a random record</a>
</div>

<div class="col text-center">
<a href="/random/two">compare two random records</a>
</div>
</div>

<div class="row p-3">
<div class="col text-center">
<a href="/random/titles">ten random titles</a>
</div>

<div class="col text-center">
<a href="/random/abstracts">ten random abstracts</a>
</div>

<div class="col text-center">
<a href="/random/images">ten random images (takes a long time to load)</a>
</div>
</div>

{% endblock %}

+ 69
- 0
web/app/templates/record.html View File

@@ -0,0 +1,69 @@
{% extends "base.html" %}

{% block content %}

{% for result in results %}
<div id="result">

Application ID:

<a href="/search/id?id={{ result['id'] }}&core=all">
{{ result['application_id'] }}
</a>

<br><br>

Year:

{{ result['year'] }}

<br><br>

EPO publication:

<a href="{{ result['epo_publication_url'] }}">
{{ result['epo_publication_url'] }}
</a>

<br><br>

IPC publication:

<a href="{{ result['ipc_publication_url'] }}">
{{ result['ipc_publication_url'] }}
</a>

<br><br>

{% if result['title'] is defined %}
Title:
{{ result['title'] }}
<br><br>
{% endif %}

{% if result['original_title'] is defined %}
Original language title:
{{ result['original_title'] }}
<br><br>
{% endif %}

{% if result['abstract'] is defined %}
Abstract:
{{ result['abstract'] }}
<br><br>
{% endif %}

{% if result['original_abstract'] is defined %}
Original language abstract:
{{ result['original_abstract'] }}
<br><br>
{% endif %}

{% if result['image'] is defined %}
<img class="img-fluid" src="data:image/jpg;base64,{{ result['image'] }}" alt="Drawing of patent" />'
{% endif %}

</div>
{% endfor %}

{% endblock %}

+ 95
- 0
web/app/templates/search.html View File

@@ -0,0 +1,95 @@
{% extends "base.html" %}

{% block content %}

<div class="row p-3">
<form action="/search" method="POST">
<input type="hidden" name="search" value="{{ search }}">
<input type="hidden" name="searchopt" value="{{ core }}">
sort by:
<select name="sort" id="sort" onchange="this.form.submit()">
<option value="relevance" {% if sort == 'relevance' %} selected {% endif %}>relevance</option>
<option value="year desc" {% if sort == 'year desc' %} selected {% endif %}>year descending</option>
<option value="year asc" {% if sort == 'year asc' %} selected {% endif %}>year ascending</option>
</select>
<noscript>
<input type="submit" class="btn btn-default" value="Set" />
</noscript>
</form>
</div>

{% if results == 'no results found' %}

{{ results }}

{% else %}

{% for result in results %}

Application ID:

<a href="/search/id?id={{ result['id'] }}&core=all">
<span class="result-entry">
{{ result['application_id'] }}
</span>
</a>

<br><br>

Year:

{{ result['year'] }}

<br><br>

EPO publication:

<a href="{{ result['epo_publication_url'] }}">
{{ result['epo_publication_url'] }}
</a>

<br><br>

IPC publication:

<a href="{{ result['ipc_publication_url'] }}">
{{ result['ipc_publication_url'] }}
</a>

<br><br>

{% if result['title'] is defined %}
Title:
<span class="result-entry">
{{ result['title'] }}
</span>
<br><br>
{% endif %}

{% if result['abstract'] is defined %}
Abstract:
<span class="result-entry">
{{ result['abstract'] }}
</span>
<br><br>
{% endif %}

<hr>

{% endfor %}

{% endif %}

<script>
let search_string = "{{ search }}";
const search_array = search_string.split(" ");
for (const term of search_array){
$("span[class=result-entry]:contains('" + term + "')").html(function(_, html) {
var replace = "(" + term + ")";
var re = new RegExp(replace, "g");
return html.replace(re, '<span style="color:orange">$1</span>');
});
}
</script>

{% endblock %}

+ 42
- 0
web/app/templates/titles.html View File

@@ -0,0 +1,42 @@
{% extends "base.html" %}

{% block content %}

<button class="float-end btn btn-danger" onclick="removeRandomTitle()">-</button>
<button class="float-end btn btn-danger" onclick="addRandomTitle()">+</button>

{% for title in titles %}

<span class="title">
<a href="/search/id?id={{ title['id'] }}&core=all">
{{ title['title'] }}
</a>
</span>

<br><br>

<hr>

{% endfor %}

<script>
var titles = {{ additional_titles|tojson }};

x = 0;

function addRandomTitle(){
var record_array = titles[x];
document.querySelector('.container-fluid').innerHTML += "<a href='/search/id?id=" + record_array['id'] + "&core=all'><span class='title'>" + record_array['title'] + "</span></a><br><br><hr>";
x++;
}

function removeRandomTitle() {
var elts = document.getElementsByClassName("title");
var RandomSpan = elts[Math.floor(Math.random() * elts.length)];
var TextReplacement = RandomSpan.textContent.replace(/\w/g,"-");
RandomSpan.removeAttribute("href");
RandomSpan.innerHTML = TextReplacement;
}
</script>

{% endblock %}

+ 23
- 0
web/content/about.md View File

@@ -0,0 +1,23 @@
Version 0.1 of the Experimental Publishing Compendium was edited by members of COPIM’s experimental publishing group formally known as Work Package Six. Since then a great many contributors⁠, from tool and technology makers to authors, designers and publishers have contributed, slowly transforming the Compendium from an edited volume to a collective resource.

The Compendium is © 2022–2022 [COPIM](https://copim.ac.uk) and licensed under a [Creative Commons Attribution 4.0 International License (CC BY 4.0)](https://creativecommons.org/licenses/by/4.0/) to make it open for reuse and disappropriation.

... List all contributors....

The Compendium is designed to be periodically updated, growing with the practices it aims to catalogue and support. Keeping the Compendium updated takes labour, care and attention and like any processual book it will die at some point. Currently, the Compendium is hosted by the Centre for Postdigital Cultures, Coventry University and is version 0.1.

## Preface

We — the editors of this compendium — do not wish to impose one version of experimental publishing, yet we recognise that a collection such as this is necessarily biased and thus political. In this preface to the first version, we are sharing how this particular version of the compendium came about, in the hope that this will open the compendium for amendments by those who maintain and use it.

The COPIM experimental publishing group, formerly known as work package six, worked for three-and-a-half-years on experimental publishing, in the context of the largely Anglo-American Community-led Open Publication Infrastructures for Monographs Project, COPIM. At a time when commercial consolidation threatened to monopolise the emerging scholarly Open Access publishing landscape, COPIM gathered publishers, libraries and infrastructure providers to develop community-owned infrastructure that can support small and large players. Open infrastructure, we proposed as an alternative to proprietary platforms that extract value and control access. Under the banner of scaling small, COPIM worked towards a diverse publishing landscape characterised by community ownership, collective production and governance, scholar-led publishing, and the sharing of resources and open infrastructures amongst diverse institutions. COPIM’s work packages were largely dedicated to serious infrastructure building, with the exception of the experimental publishing group, which grants the question how experimental publishing contributes to the ambition to establish infrastructures that allow diverse small initiatives to proliferate at scale?

The closely related metaphors of publishing landscape, ecology, ecosystems or bibliodiversity shaped COPIM’s work. Staying with these images of lively and abundant interdependence allows us to locate experimental publishing’s place in scholarly knowledge production. Speaking of publishing ecologies implies that scholarly publishing cannot be separated from the wider academic landscape. How scholarly work is published cannot be separated from how it is funded, conceptualised, written, valued, reviewed, rewarded, read and taught. In this metaphor, scholarly works, like all specimen, coevolve with the environment they inhabit.

Many things can be said about this environment: the contemporary academy. There isn’t one academy for starters. Opinions and politics differ, so do the stakes and subject position of the beholders.

We, invested in feminist techno-politics, yearn for more collective, inclusive, embodied, situated and caring modes of knowledge production. But the notion that changes in publishing affect the entire scholarly landscape applies just as neatly to those, for example, who pursue scholarly excellence through competition and streamlining. Our point here, is that scholarly publishing ecologies reflect and materialise the wider scholarly landscape. Scholarly books, in this ecological view, are not containers of knowledge but relational nodes that materialise what does and doesn’t count as valuable practices, sites, labour, and subjects of knowledge.

The flow of water is commonly used to model the flow of knowledge, taking us further into the question which forces shape the metaphorical scholarly landscape. Bureaucratic fantasies, enshrined in grant applications, project timetables and scholarly self-understanding and career paths imagine scholarly publishing at the end of an orderly pipeline of knowledge. The way that institutions such as libraries, universities, publishers, funders, and intellectual property regimes are organised tends to reinforce the notion of a manageable flow from funding, to research question, to investigation to publication to evaluation. The metaphor of channeled flow and the premises of contained stages provides structure. Channeling the flow of valuable knowledge, gives publishing a place and a form: the book, at the end of the pipe. But… you see it coming… where there are pipes there is , breakage, spillage and blockage. And… without overflow and contamination… there won’t be much to be piped. A sanitised scholarly landscape of industrial pipage is a nightmare, that evokes the very real nightmare streamlined industrial production has brought upon very real ecologies—leaving but scraplands for diversity which alone can ensure life. And... also… despite all efforts to establish well irrigated, drip-fed academies, the flow of scholarly knowledge is not easily channeled. Swamps, oceans, ice shields, underground currents, floods and drought prone rivers evoke alternative models of flow, that might inspire a diverse knowledge-scape that cannot be contained within the academy or otherwise.

Coming back to experimental publishing, new forms of publication might create new kinds of pipes or spill-over into more relational circulation. Either way, we posit that experimental publishing is one of the sites where the shape of scholarly landscapes, and their relationship to other ecologies of knowledge and power is negotiated and materialised in practice. How we do publish matters. Experimenting with scholarly books is to experiment with scholarly modes of knowledge production. This labour of love, like other experimental practice, takes place at the growing edges and in the cracks of established practices, where by steady corrosion, underground commotion or capital intense incubation forms of writing, making, sharing, reviewing, discovering, reading and cataloging books come into being that will change what counts as scholarly work.

+ 15
- 0
web/content/home.md View File

@@ -0,0 +1,15 @@
# ExPub Compendium

The Experimental Publishing Compendium is for authors, designers, publishers, institutions and technologist who challenge, push and redefine the shape, form and rationale of scholarly works. The compendium offers a catalogues of tools, practices, publishers, and books to inspire experimental scholarly works.

## How to use the compendium

The compendium catalogues potential ingredients for the making of experimental publications.

- Under [Tools](/tools) you’ll find mostly software that supports experimental publication from collective writing to the annotation and remix of published texts.
- [Practices](/practices) provide inspiration for experimental book making.
- [Books](/books) and [publishers](/publishers) provides examples of experimental books and those who publish them.

Each item is cross linked so that a practice will take you to relevant tools or examples and vice versa.

The selection is tentative, reflecting the knowledge and biases of the contributors of current and previous versions. If you want to submit or edit anything in the catalogue please do so by doing XXthisXXX.

+ 6
- 0
web/requirements.txt View File

@@ -0,0 +1,6 @@
flask
flask-moment
gunicorn
markdown
requests
Wand

Loading…
Cancel
Save